Discreet Log #6: A Tour of the Cwtch Integration Test
30 Apr 2021
I talked previously on how we had built our automated build and test system and the benefits in quality that gave us. Today I’d like to zoom in to one of my favourite pieces of our quality assurance infrastructure, the Cwtch integration test. This test is important because we’ve written it to hit as much of the Cwtch code base as possible and test it works in one go. It also sits on top of several other crucial components and thus gives them extra coverage and workout too, such as Tapir and our Connectivity package.
Broad integration test coverage helps make sure that code we don’t use in out daily development doesn’t get silently broken by accident. For Cwtch, our core library, covering the protocol, engine, storage, peer logic, and app interfaces is of the utmost importance. So without further ado, welcome to the tour of the core Cwtch integration test.
Setting up the Network
func TestCwtchPeerIntegration(t *testing.T) {
numGoRoutinesStart := runtime.NumGoroutine()
...
tor.NewTorrc().WithSocksPort(socksPort).WithOnionTrafficOnly().WithHashedPassword(base64.StdEncoding.EncodeToString(key)).WithControlPort(controlPort).Build("tordir/tor/torrc")
acn, err := tor.NewTorACNWithAuth("./tordir", path.Join("..", "tor"), controlPort, tor.HashedPasswordAuthenticator{Password: base64.StdEncoding.EncodeToString(key)})
if err != nil {
t.Fatalf("Could not start Tor: %v", err)
}
...
The test starts and does some boiler plate setup that any app would be required to such as checking and creating directories for use for storing app data, assigning port numbers and generating a Tor password. This all culminates in a call to NewTorACNWithAuth
which is from our separate connectivity package which supplies Anonymous Communication Network primitives (currently only Tor is supported). If this is successful, we now have a properly setup ACN using Tor and the directories in place to start some Cwtch apps.
Also of note is that throughout the run of the test, starting at the beginning, we record the number of goroutines, so that at the end we can account for all of them, catch if there is a Go Routine leak, in addition to providing metrics on use. This tracking has caught a lot of leaks over the years, no more than when we first wrote it and immediately had to track down several leaks it discovered.
Setting up the Server
var server *cwtchserver.Server
var serverAddr string
var serverKeyBundle []byte
// launch app with new key
server = new(cwtchserver.Server)
fmt.Println("Starting cwtch server...")
os.Remove("server-test.json")
config := cwtchserver.LoadConfig(".", "server-test.json")
identity := config.Identity()
serverAddr = identity.Hostname()
server.Setup(config)
serverKeyBundle, _ = json.Marshal(server.KeyBundle())
log.Debugf("server key bundle %s", serverKeyBundle)
go server.Run(acn)
// let tor get established
fmt.Printf("Establishing Tor hidden service: %v...\n", serverAddr)
numGoRoutinesPostServer := runtime.NumGoroutine()
We set up a Cwtch server to host a group. By using a group instead of just a quicker P2P message we cover a whole host of group and server functionality with this test.
Setting up the App
app := app2.NewApp(acn, "./storage")
...
bridgeClient := bridge.NewPipeBridgeClient(path.Join(cwtchDir, "testing/clientPipe"), path.Join(cwtchDir, "testing/servicePipe"))
bridgeService := bridge.NewPipeBridgeService(path.Join(cwtchDir, "testing/servicePipe"), path.Join(cwtchDir, "testing/clientPipe"))
appClient := app2.NewAppClient("./storage", bridgeClient)
appService := app2.NewAppService(acn, "./storage", bridgeService)
numGoRoutinesPostAppStart := runtime.NumGoroutine()
// ***** cwtchPeer setup *****
fmt.Println("Creating Alice...")
app.CreatePeer("alice", "asdfasdf")
fmt.Println("Creating Bob...")
app.CreatePeer("bob", "asdfasdf")
fmt.Println("Creating Carol...")
appClient.CreatePeer("carol", "asdfasdf")
Here we create a cwtch instance, and the first two profiles, Alice and Bob. We create an app client and app service and a third profile, Carol.
That last step requires some explanation; Desktop Cwtch uses the all-in-one app, but for Android we require a split between parts of the application so that only a small part of Cwtch can run in the background to conserve battery and allow Cwtch to resume as seamlessly as possible.
Having Carol test out the split application architecture has been very valuable as we don’t often run this app mode during most desktop-based development. On many occasions the integration test has been the sole early indicator that some logic might fail on Android, without the overhead of constant Android testing.
Launching the Peers
app.LaunchPeers()
appClient.LaunchPeers()
...
numGoRoutinesPostPeerStart := runtime.NumGoroutine()
fmt.Println("Alice joining server...")
if err := alice.AddServer(string(serverKeyBundle)); err != nil {
t.Fatalf("Failed to Add Server Bundle %v", err)
}
alice.JoinServer(serverAddr)
fmt.Println("Alice peering with Bob...")
alice.PeerWithOnion(bob.GetOnion())
fmt.Println("Alice peering with Carol...")
alice.PeerWithOnion(carol.GetOnion())
Alice connects to the server and peers with Bob and Carol.
Starting a Group
groupID, _, err := alice.StartGroup(serverAddr)
fmt.Printf("Created group: %v!\n", groupID)
if err != nil {
t.Errorf("Failed to init group: %v", err)
return
}
bob.AddContact("alice?", alice.GetOnion(), model.AuthApproved)
bob.AddServer(string(serverKeyBundle))
bob.SetContactAuthorization(alice.GetOnion(), model.AuthApproved)
waitForPeerPeerConnection(t, alice, carol)
carol.AddContact("alice?", alice.GetOnion(), model.AuthApproved)
carol.AddServer(string(serverKeyBundle))
carol.SetContactAuthorization(alice.GetOnion(), model.AuthApproved)
alice.SendGetValToPeer(bob.GetOnion(), attr.PublicScope, "name")
bob.SendGetValToPeer(alice.GetOnion(), attr.PublicScope, "name")
alice.SendGetValToPeer(carol.GetOnion(), attr.PublicScope, "name")
carol.SendGetValToPeer(alice.GetOnion(), attr.PublicScope, "name")
Alice creates a group on the server. Bob and Carol add Alice as a contact, and then we follow the processes the UI follows of sending peer attributes to each other. Following this we test that each peer received each other’s attributes giving coverage of sending peer to peer messages and the whole attribute processing system.
Sending out the Invites
fmt.Println("Alice inviting Bob to group...")
err = alice.InviteOnionToGroup(bob.GetOnion(), groupID)
Alice now uses the peer connection to invite Bob to the group on the server. After this is some code to manually find the invite on the Bob peer and accept it, mimicking a real person and client surfacing the peer request and it being accepted by them.
Time to Chat
_, err = alice.SendMessageToGroupTracked(groupID, aliceLines[0])
...
_, err = bob.SendMessageToGroupTracked(groupID, bobLines[0])
...
alice.SendMessageToGroupTracked(groupID, aliceLines[1])
...
bob.SendMessageToGroupTracked(groupID, bobLines[1])
Alice and Bob take turns saying pre-canned lines (so we can check the others saw them later). We check they were each able to send their first message, leaving additional group timeline checks until the end to catch additional errors.
Growing
err = alice.InviteOnionToGroup(carol.GetOnion(), groupID)
if err != nil {
t.Fatalf("Error for Alice inviting Carol to group: %v", err)
}
fmt.Println("Carol examining groups and accepting invites...")
for _, groupID := range carol.GetGroups() {
group := carol.GetGroup(groupID)
fmt.Printf("Carol group: %v (Accepted: %v)\n", group.GroupID, group.Accepted)
if group.Accepted == false {
fmt.Printf("Carol received and accepting group invite: %v\n", group.GroupID)
carol.AcceptInvite(group.GroupID)
}
}
fmt.Println("Shutting down Alice...")
app.ShutdownPeer(alice.GetOnion())
numGoRoutinesPostAlice := runtime.NumGoroutine()
fmt.Println("Carol joining server...")
carol.JoinServer(serverAddr)
waitForPeerGroupConnection(t, carol, groupID)
numGoRotinesPostCarolConnect := runtime.NumGoroutine()
Alice, done talking with Bob, now invites Carol to the group. This sets up our chance to test group history resumption. Alice and Bob should both have this history as they were both in the group for it, but if we can later confirm Carol has it too, that means fetching history is covered and working. Then Alice shuts down and Carol joins the group. While it really shouldn’t be needed this will confirm peer shutdown works if Alice does not have any further group history while Bob and Carol talk.
Time to Verify
bob.SendMessageToGroupTracked(groupID, bobLines[2])
...
carol.SendMessageToGroupTracked(groupID, carolLines[0])
fmt.Printf("Alice's TimeLine:\n")
aliceVerified := printAndCountVerifedTimeline(t, alicesGroup.GetTimeline())
if aliceVerified != 4 {
t.Errorf("Alice did not have 4 verified messages")
}
...
fmt.Printf("Bob's TimeLine:\n")
bobVerified := printAndCountVerifedTimeline(t, bobsGroup.GetTimeline())
if bobVerified != 6 {
t.Errorf("Bob did not have 6 verified messages")
}
...
if len(alicesGroup.GetTimeline()) != 4 {
t.Errorf("Alice's timeline does not have all messages")
} else {
// check message 0,1,2,3
aliceGroupTimeline := alicesGroup.GetTimeline()
if aliceGroupTimeline[0].Message != aliceLines[0] || aliceGroupTimeline[1].Message != bobLines[0] ||
aliceGroupTimeline[2].Message != aliceLines[1] || aliceGroupTimeline[3].Message != bobLines[1] {
t.Errorf("Some of Alice's timeline messages did not have the expected content!")
}
}
if len(bobsGroup.GetTimeline()) != 6 {
t.Errorf("Bob's timeline does not have all messages")
} else {
// check message 0,1,2,3,4,5
bobGroupTimeline := bobsGroup.GetTimeline()
if bobGroupTimeline[0].Message != aliceLines[0] || bobGroupTimeline[1].Message != bobLines[0] ||
bobGroupTimeline[2].Message != aliceLines[1] || bobGroupTimeline[3].Message != bobLines[1] ||
bobGroupTimeline[4].Message != bobLines[2] || bobGroupTimeline[5].Message != carolLines[0] {
t.Errorf("Some of Bob's timeline messages did not have the expected content!")
}
}
if len(carolsGroup.GetTimeline()) != 6 {
t.Errorf("Carol's timeline does not have all messages")
} else {
// check message 0,1,2,3,4,5
carolGroupTimeline := carolsGroup.GetTimeline()
if carolGroupTimeline[0].Message != aliceLines[0] || carolGroupTimeline[1].Message != bobLines[0] ||
carolGroupTimeline[2].Message != aliceLines[1] || carolGroupTimeline[3].Message != bobLines[1] ||
carolGroupTimeline[4].Message != bobLines[2] || carolGroupTimeline[5].Message != carolLines[0] {
t.Errorf("Some of Carol's timeline messages did not have the expected content!")
}
}
We check that each pair has the correct message count in their individual time lines and then further verify each message has the expected contents. This is where we can catch lots of errors if their are protocol or message problems:
Final Checks
app.ShutdownPeer(bob.GetOnion())
numGoRoutinesPostBob := runtime.NumGoroutine()
if server != nil {
fmt.Println("Shutting down server...")
server.Shutdown()
time.Sleep(time.Second * 3)
}
numGoRoutinesPostServerShutdown := runtime.NumGoroutine()
fmt.Println("Shutting down Carol...")
appClient.ShutdownPeer(carol.GetOnion())
numGoRoutinesPostCarol := runtime.NumGoroutine()
fmt.Println("Shutting down apps...")
fmt.Printf("app Shutdown: %v\n", runtime.NumGoroutine())
app.Shutdown()
fmt.Printf("appClientShutdown: %v\n", runtime.NumGoroutine())
appClient.Shutdown()
fmt.Printf("appServiceShutdown: %v\n", runtime.NumGoroutine())
appService.Shutdown()
fmt.Printf("bridgeClientShutdown: %v\n", runtime.NumGoroutine())
bridgeClient.Shutdown()
fmt.Printf("brideServiceShutdown: %v\n", runtime.NumGoroutine())
bridgeService.Shutdown()
fmt.Printf("Done shutdown: %v\n", runtime.NumGoroutine())
numGoRoutinesPostAppShutdown := runtime.NumGoroutine()
fmt.Println("Shutting down ACN...")
acn.Close()
numGoRoutinesPostACN := runtime.NumGoroutine()
We shutdown Bob next, then the server, and finally Carol. When writing integration tests it’s always good to subvert the logical order (e.g. by shutting down the server in between the clients, so you can cover events such as a profile handling the server going away). After tearing down Carol we shut down the rest of the cwtch application infrastructure; the app service and client, their communication bridge, and finally the ACN.
Metrics and Resource Leaks
pprof.Lookup("goroutine").WriteTo(os.Stdout, 1)
fmt.Printf("numGoRoutinesStart: %v\nnumGoRoutinesPostServer: %v\nnumGoRoutinesPostAppStart: %v\nnumGoRoutinesPostPeerStart: %v\nnumGoRoutinesPostPeerAndServerConnect: %v\n"+
"numGoRoutinesPostAlice: %v\nnumGoRotinesPostCarolConnect: %v\nnumGoRoutinesPostBob: %v\nnumGoRoutinesPostServerShutdown: %v\nnumGoRoutinesPostCarol: %v\nnumGoRoutinesPostAppShutdown: %v\nnumGoRoutinesPostACN: %v\n",
numGoRoutinesStart, numGoRoutinesPostServer, numGoRoutinesPostAppStart, numGoRoutinesPostPeerStart, numGoRoutinesPostServerConnect,
numGoRoutinesPostAlice, numGoRotinesPostCarolConnect, numGoRoutinesPostBob, numGoRoutinesPostServerShutdown, numGoRoutinesPostCarol, numGoRoutinesPostAppShutdown, numGoRoutinesPostACN)
if numGoRoutinesStart != numGoRoutinesPostACN {
t.Errorf("Number of GoRoutines at start (%v) does not match number of goRoutines after cleanup of peers and servers (%v), clean up failed, leak detected!", numGoRoutinesStart, numGoRoutinesPostACN)
}
Lastly we print a profile of goroutines and a log of goroutine measurements throughout the test’s run and check that we now are back to starting levels of goroutine. As mentioned previously, this final check has been incredibly valuable in terms of preventing easily introduced resource leaks, and I would strongly recommend anyone working with Go give this a try in their tests as an extra layer of coverage.
Continuous Integration
Integration tests aren’t worth much if they aren’t run and we can’t rely on ourselves to always run slow or long tests all the time when we’re developing and so we need to make sure they are plugged into our build automation. In our case they get run against every pull request automatically on our build server which reports back to the PR.
Continuous Investment
Integration tests are an investment, often steepest at first if you have an already existing code base you are introducing them to you need to get coverage on. As Cwtch has grown we’ve continually revisited them to work in covering new features.
In a case of “not quite” Test Driven Development, the Cwtch integration test soon became our primary library client before full actual Client UI work began. As soon as we wrote it, it started paying off. It has found protocol bugs that our unit tests couldn’t catch, it found bugs that could only be discovered by spinning up peers, a server, and making them interact over a group. It immediately caught resource leaks.
Coupled with our automation it’s been one of our most important pieces of quality assurance. If you work on software and don’t have integration tests with full end to end coverage of usage, especially if you work on a piece of networked software, I really hope this might inspire you to take a look at starting.
Quality work takes time and resources; if you’d like to help us with Cwtch, please consider donating!.