Monday, July 16, 2012

Mozilla-central code coverage

I have been posting code-coverage results of comm-central off and on for several years. One common complaint I get from developers is that I don't run any of the mozilla-central testsuites. So I finally buckled down and built mozilla-central's coverage treemap (and LCOV output too).

The testsuites here correspond to running top-level check, mochitest-plain, mochitest-a11y, mochitest-ipcplugins, mochitest-chrome, reftest, crashtest, jstestbrowser, and xpcshell-tests, which should correspond to most of the test suite that Tinderbox runs (excluding Talos and some of the ipc-only-looking things). The LCOV output can break things down by test results, but the treemap still lacks this functionality (I only built in multiple test support to the framework this afternoon while waiting for things to compile).

Caveats of course. This is an x86-64 Linux opt-without-optimizations build. This isn't my laptop, and X forwarding failed, so I had to resort to using Xvfb for the display (which managed to crash during one test run). It seems that some of the mochitests failed due to not having focus, and I have no idea how to make Xvfb give it focus, so not all mochitests ran. Some of the mozapps tests just fail generally because of recursion issues. So this isn't exactly what the tinderboxes run. Oh, and gcov consistently fails to parse jschuff.cpp's coverage data.

Lcov is also getting more painful to use—I finished running tests on Saturday night, and it took me most of Sunday and Monday to actually get the output (!!!). Fortunately, I've reimplemented most of the functionality in my own coverage analysis scripts, so the only parts missing are branch coverage data and the generated HTML index which I want to integrate with my web UI anyways.

Tuesday, July 10, 2012

Thunderbird and testing

Thunderbird has come a long way in its automated test suite since I started working on it 5 years ago. Back then, much of our code was untestable and it was rare that a patch added tests. Now, our code coverage results look like this. It has almost unthinkable to have a patch that doesn't have a test, and there are only a few places in our code where testing is impossible. Now I'm going to propose how to fill in these gaps.

LDAP

Ah, LDAP. The big red part of comm-central whenever I make my coverage treemaps. The problem here could be solved if we had an LDAP fakeserver; having written both IMAP and NNTP servers, this shouldn't be hard? Except that LDAP is not built off of a textual-based layer that you can emulate with telnet but an over-engineered protocol called ASN.1 and more specifically one of its binary encodings. The underlying fakeserver technology is built with the assumption that I'm dealing with a CRLF-based protocol, but it turns out that, with some of my patches, it's actually easy to just pass through the binary data (yay for layering).

The full LDAP specification is actually quite complicated and relies on a lot of pieces, but the underlying model for an LDAP fakeserver could rather easily be controlled by just an LDIF file with perhaps a simplified schema model. At the very least, it's a usable start, and considering that the IMAP fakeserver still isn't RFC 3501-compliant 4 years later, it's good enough for testing.

Here, a big issue arises: the actual protocol decoding. I started by looking for a nice library I could use for ASN.1 decoding so I don't have to do it myself. I first played with using the LDAP lber routines myself via ctypes, but I found myself dissatisfied with how much work it took just to parse the login of the LDAP serve. I then looked into NSS's structured ASN.1 decoding, even happening upon a nice set of templates for LDAP so I didn't have to try to build them with the lack of documentation, but it still ended up not working well, especially given the nice model of genericity I was looking for. I played around with a node-based LDAP server (especially annoying given the current name feud in Debian that prevents the nodejs package from migrating to testing). It worked well enough for an initial test, but the problem of either driving the server from xpcshell or writing node shims combined with the fact that it only processes the protocol and has no usable backend caused me to give up that path. Desperate, I even tried to find just general BER-parsing libraries in JS on the general web and discovered that the ones that were there couldn't quite cope with the format as we use it.

Conclusions: it's possible. The only real hard part is writing the BER parsing library myself. If anyone decides they want to work on this, I can send them the partial pieces of the puzzle to finish. If not, I'll probably nibble on this here and there over the next year or two.

MIME

MIME—that's well-tested per our testsuite, right? Well, not really. A lot of the testing is just pure incidental: hooking the MIME library up to the IMAP fakeserver did a good job of fleshing out a lot of issues, but you can also find lots of small details that no one's going to notice (charsets come to mind). It turns out that MIME is one of those protocols where everybody does the same thing slightly differently, and you end up accumulating a lot of random fixes to MIME. If you want to replace the module from scratch, you become terrified of finding random regressions in real-world mail.

Perhaps unsurprisingly, there are no test suites for proper MIME parsing on the web. There is one for RFC 2231 decoding (kind of). But there's nothing that tries to determine any of the following:

  • Charset detection, especially who gets priority when everyone conficts
  • Whether a part is inline, attached, or not shown at all
  • How attachments get detected and handled
  • Test suites for the various crap that crops up when people fail at i18n
  • Text-to-html or HTML sanitization issues
  • Identifying headers properly (malformed References headers, etc.)
  • Pseudo-MIME constructs, like TNEF, uuencode, BinHex, or yEnc
  • S/MIME or PGP

Issues relating to message display could be handled with a suite of reftests. A brief test confirms that reftest specifications accept absolute URLs, including the URLs that are used to drive the message UI (this can even test it from loading the offline protocol). Reftests even allow you to set prefs before specific tests; with a bit of sugaring around the reftest list, a MIME reftest is easily doable. Attachment and header handling could also follow a MIME reftest design, but I'm not sure that is the best design. I'd also like it to be the kind of test that other people who write MIME libraries could use.

The main issue here is seeding the repository with something useful. Sampling a variety of Usenet newsgroups (especially foreign-language hierarchies) should pick up something useful for basic charset, and I can get uuencode and yEnc by trawling through some binary newsgroups. For a focus on gmail, I could probably pick up some Google Groups things (especially if I recall the magic incantations that let me at actual RFC-822 objects). Random public mailing lists might find something useful. My own private email is unlikely to provide any useful test cases, since I tend to communicate with too homogeneous an environment (i.e., I don't get enough people using Outlook). Sanitizing all of this public stuff is also going to be a pain, especially with the emails that have DKIM.

OS integration

OS integration is a nice header for everything that involves the actual OS: MAPI, import from standard system mail clients, integration with system address books. Unfortunately, my main development environment is Linux, where we have none of this stuff, so I can't really claim that I have a plan for testing here. Thanks to bug 731877, at least testing Outlook Express importing is a possibility, but true tests would probably require dumping some .psts into our tree, but we have no similar story for Mail.app. MAPI could be done with a mock app that exercises the MAPI interfaces; what it really comes down to is that we need to implement these APIs in a way that we can test them by executing in various mock environments during tests.

Performance tests

The other major hole we have is performance. Firefox measures its performance with things like Talos; Thunderbird ought to have a similar testsuite of performance benchmarks. What kind of benchmarks are useful for Thunderbird testing though? Modulo debates over where exactly to place the endpoints on the following tests, I think the following is a good list:

  • Startup and shutdown time
  • Time to open a "large" folder (maybe requiring rebuild?) and mem usage in doing so
  • Doing message operations (mark as read, delete, move, copy, etc.) on several messages in a "large" folder. Possibly memory too
  • Time to select and display a "large" message (inline parts), as well as detach/delete attachments on said message
  • Cross-folder message search (with/without gloda?)
  • Some sort of database resync operation
  • Address book queries

For the large folders, I think having a good distribution of the size of threads (so some messages not in threads, others collected in a 50+ message thread) is necessary. Slow performance in extra-large folders is something we routinely get criticized on, so being able to track regressions is something that I think is useful. Tests that can also adequately catch some stupid things like "download a message fifteen times to display it" are extremely useful in my opinion, and I feel like there needs to be some sort of performance tests that highlight problems in IMAP code would be useful.

Monday, July 9, 2012

Mozilla and Thunderbird

Let me start by saying that I have been contributing to Thunderbird for nearly 5 years. I don't have any secret knowledge; what I know that isn't public is generally had just by talking to people in private messages on IRC. All points I make in here are my own thoughts and beliefs, and do not necessarily reflect the rest of Mozilla.

To say that the recent announcement on Thunderbird's future threw people in a tizzy would be an understatement. After all, we have nothing less than apocalyptic proclamations of the death of Thunderbird. I believe that such proclamations are as exaggerated as Samuel Clemens's death notices (apologies for making a joke that is probably inscrutable to non-en-US people).

The truth is, Thunderbird has not been a priority for Mozilla since before I started working on it. There really isn't any coordination in mozilla-central to make sure that any planned "featurectomies" don't impact Thunderbird—we typically get the same notice that add-on authors get, despite being arguably the largest binary user of the codebase outside of mozilla-central. Given also that the Fennec and B2G codebases were subsequently merged into mozilla-central (one of the arguments I heard about the Fennec merge was that "it's too difficult to maintain the project outside of mozilla-central") and that comm-central remains separate, it should be quickly clear how much apathy for Thunderbird existed prior to this announcement.

As a consequence, the community has historically played a major role in the upkeep of Thunderbird. The massive de-RDF project was driven by a lawyer-in-training. I myself have made significant changes to the address book, NNTP, testing, and MIME codes. Our QA efforts are driven in large part by a non-paid contributor. More than half of the top-ten contributors are non-employees, according to hg churn. So the end of purely-Thunderbird-focused paid developers is by no means the end of the project.

There's a lot of invective about the decision, so let me attempt to rationalize why it was made. Mozilla's primary goal is to promote the Open Web, which means in large part, that Mozilla needs to ensure that it remains relevant in markets to prevent the creation of walled gardens. I believe that Mozilla has judged that it needs to focus on the mobile market, which is where the walled gardens are starting to crop up again. In the desktop world, Mozilla has a strong browser and a strong email client, and maintaining that position is good enough. In the mobile world, Mozilla has virtually no presence right now. Hence all of the effort being put into Firefox Mobile and B2G right now.

Now, many of the decisions as to the future of the project are uncertain; unfortunately, the email laying all of this out was prematurely leaked. But it is clear that Thunderbird suffers from massive technical debt: when I was pondering parts that the Gaia email app might be able to leverage, I first considered the IMAP protocol implementation and then ran out of things to suggest. Well, maybe lightning or the chat backends (for calendaring and IM, respectively), but it's clear that most of the mammoth codebase is completely unsuitable for reincorporation into another project. To this end, I think the most useful thing that could happen in Thunderbird falls under the "maintenance" banner anyways: a replacement of these crappy components with more solid implementations that are less reliant on maybe-obsolete Gecko features and that could be shared with the Gaia email app. As a bit of a shameless plug, I have been working on a replacement MIME parser with an explicit eye towards letting Gaia's app use it. Such work would be more useful than whining about the decision, in any case.