Tuesday, July 10, 2012

Thunderbird and testing

Thunderbird has come a long way in its automated test suite since I started working on it 5 years ago. Back then, much of our code was untestable and it was rare that a patch added tests. Now, our code coverage results look like this. It has almost unthinkable to have a patch that doesn't have a test, and there are only a few places in our code where testing is impossible. Now I'm going to propose how to fill in these gaps.

LDAP

Ah, LDAP. The big red part of comm-central whenever I make my coverage treemaps. The problem here could be solved if we had an LDAP fakeserver; having written both IMAP and NNTP servers, this shouldn't be hard? Except that LDAP is not built off of a textual-based layer that you can emulate with telnet but an over-engineered protocol called ASN.1 and more specifically one of its binary encodings. The underlying fakeserver technology is built with the assumption that I'm dealing with a CRLF-based protocol, but it turns out that, with some of my patches, it's actually easy to just pass through the binary data (yay for layering).

The full LDAP specification is actually quite complicated and relies on a lot of pieces, but the underlying model for an LDAP fakeserver could rather easily be controlled by just an LDIF file with perhaps a simplified schema model. At the very least, it's a usable start, and considering that the IMAP fakeserver still isn't RFC 3501-compliant 4 years later, it's good enough for testing.

Here, a big issue arises: the actual protocol decoding. I started by looking for a nice library I could use for ASN.1 decoding so I don't have to do it myself. I first played with using the LDAP lber routines myself via ctypes, but I found myself dissatisfied with how much work it took just to parse the login of the LDAP serve. I then looked into NSS's structured ASN.1 decoding, even happening upon a nice set of templates for LDAP so I didn't have to try to build them with the lack of documentation, but it still ended up not working well, especially given the nice model of genericity I was looking for. I played around with a node-based LDAP server (especially annoying given the current name feud in Debian that prevents the nodejs package from migrating to testing). It worked well enough for an initial test, but the problem of either driving the server from xpcshell or writing node shims combined with the fact that it only processes the protocol and has no usable backend caused me to give up that path. Desperate, I even tried to find just general BER-parsing libraries in JS on the general web and discovered that the ones that were there couldn't quite cope with the format as we use it.

Conclusions: it's possible. The only real hard part is writing the BER parsing library myself. If anyone decides they want to work on this, I can send them the partial pieces of the puzzle to finish. If not, I'll probably nibble on this here and there over the next year or two.

MIME

MIME—that's well-tested per our testsuite, right? Well, not really. A lot of the testing is just pure incidental: hooking the MIME library up to the IMAP fakeserver did a good job of fleshing out a lot of issues, but you can also find lots of small details that no one's going to notice (charsets come to mind). It turns out that MIME is one of those protocols where everybody does the same thing slightly differently, and you end up accumulating a lot of random fixes to MIME. If you want to replace the module from scratch, you become terrified of finding random regressions in real-world mail.

Perhaps unsurprisingly, there are no test suites for proper MIME parsing on the web. There is one for RFC 2231 decoding (kind of). But there's nothing that tries to determine any of the following:

  • Charset detection, especially who gets priority when everyone conficts
  • Whether a part is inline, attached, or not shown at all
  • How attachments get detected and handled
  • Test suites for the various crap that crops up when people fail at i18n
  • Text-to-html or HTML sanitization issues
  • Identifying headers properly (malformed References headers, etc.)
  • Pseudo-MIME constructs, like TNEF, uuencode, BinHex, or yEnc
  • S/MIME or PGP

Issues relating to message display could be handled with a suite of reftests. A brief test confirms that reftest specifications accept absolute URLs, including the URLs that are used to drive the message UI (this can even test it from loading the offline protocol). Reftests even allow you to set prefs before specific tests; with a bit of sugaring around the reftest list, a MIME reftest is easily doable. Attachment and header handling could also follow a MIME reftest design, but I'm not sure that is the best design. I'd also like it to be the kind of test that other people who write MIME libraries could use.

The main issue here is seeding the repository with something useful. Sampling a variety of Usenet newsgroups (especially foreign-language hierarchies) should pick up something useful for basic charset, and I can get uuencode and yEnc by trawling through some binary newsgroups. For a focus on gmail, I could probably pick up some Google Groups things (especially if I recall the magic incantations that let me at actual RFC-822 objects). Random public mailing lists might find something useful. My own private email is unlikely to provide any useful test cases, since I tend to communicate with too homogeneous an environment (i.e., I don't get enough people using Outlook). Sanitizing all of this public stuff is also going to be a pain, especially with the emails that have DKIM.

OS integration

OS integration is a nice header for everything that involves the actual OS: MAPI, import from standard system mail clients, integration with system address books. Unfortunately, my main development environment is Linux, where we have none of this stuff, so I can't really claim that I have a plan for testing here. Thanks to bug 731877, at least testing Outlook Express importing is a possibility, but true tests would probably require dumping some .psts into our tree, but we have no similar story for Mail.app. MAPI could be done with a mock app that exercises the MAPI interfaces; what it really comes down to is that we need to implement these APIs in a way that we can test them by executing in various mock environments during tests.

Performance tests

The other major hole we have is performance. Firefox measures its performance with things like Talos; Thunderbird ought to have a similar testsuite of performance benchmarks. What kind of benchmarks are useful for Thunderbird testing though? Modulo debates over where exactly to place the endpoints on the following tests, I think the following is a good list:

  • Startup and shutdown time
  • Time to open a "large" folder (maybe requiring rebuild?) and mem usage in doing so
  • Doing message operations (mark as read, delete, move, copy, etc.) on several messages in a "large" folder. Possibly memory too
  • Time to select and display a "large" message (inline parts), as well as detach/delete attachments on said message
  • Cross-folder message search (with/without gloda?)
  • Some sort of database resync operation
  • Address book queries

For the large folders, I think having a good distribution of the size of threads (so some messages not in threads, others collected in a 50+ message thread) is necessary. Slow performance in extra-large folders is something we routinely get criticized on, so being able to track regressions is something that I think is useful. Tests that can also adequately catch some stupid things like "download a message fifteen times to display it" are extremely useful in my opinion, and I feel like there needs to be some sort of performance tests that highlight problems in IMAP code would be useful.

4 comments:

Anonymous said...

Joshua you should have titled that automated testing :-)

Thanks you so much for giving me a state of things on the automation side.

Kent James said...

For my ExQuilla extension, I use an actual Exchange Server 2010 implementation in a virtual machine to run xpcshell tests against. Why couldn't you use an actual LDAP implementation for testing? After all, at the moment LDAP is read only, so you could manually populate the server with test cases.

Joshua Cranmer said...

Kent: The tests need to be runnable by developers on the major platforms and ideally without any external requirements (python client.py checkout should be all we need for tests). We also need the ability to drive tests; the setup of slapd (the only production LDAP server I have experience with) is not fun to configure with tests, especially given issues with process creation on Windows (I do not want to make setTimeout tests required for any new tests).

hyc said...

Eh? Look over the OpenLDAP test suite for examples, deploying a slapd with a canned config is a cinch.