Wednesday, November 7, 2012

Autotools, how I hate thee

When writing custom passes for a compiler, it's often a good idea to try running them on real-world programs to assess things like scalability and correctness. Taking a large project and seeing your pass work (and provide useful results!) is an exhilarating feeling. On the other hand, trying to feed in your compiler options into build systems is a good route to an insane asylum.

I complained some time ago about autoconf 2.13 failing because it assumes implicit int. People rightly pointed out that newer versions of autoconf don't assume that anymore. But new versions still come with their own cornucopias of pain. Libtool, for example, believes that the best route to linking a program is to delete every compiler flag from the command line except those it knows about. Even if you explicitly specify them in LDFLAGS. Then there's this conftest program that I found while compiling gawk:

/* Define memcpy to an innocuous variant, in case <limits.h> declares memcpy.
   For example, HP-UX 11i <limits.h> declares gettimeofday.  */
#define memcpy innocuous_memcpy

/* System header to define __stub macros and hopefully few prototypes,
    which can conflict with char memcpy (); below.
    Prefer <limits.h> to <assert.h> if __STDC__ is defined, since
    <limits.h> exists even on freestanding compilers.  */

#ifdef __STDC__
# include <limits.h>
#else
# include <assert.h>
#endif
#undef memcpy

/* Override any GCC internal prototype to avoid an error.
   Use char because int might match the return type of a GCC
   builtin and then its argument prototype would still apply.  */
#ifdef __cplusplus
extern "C"
#endif
char memcpy ();
/* The GNU C library defines this for functions which it implements
    to always fail with ENOSYS.  Some functions are actually named
    something starting with __ and the normal name is an alias.  */
#if defined __stub_memcpy || defined __stub___memcpy
choke me
#endif

int
main ()
{
return memcpy ();
  ;
  return 0;
}

…I think this code speaks for itself in how broken it is as a test. One of the parts of the compiler pass involved asserted due to memcpy not being used in the right way, crashing the compiler. Naturally, this being autoconf, it proceeded to assume that I didn't have a memcpy and thus decided to provide me one, which causes later code to break in spectacular bad function when you realize that memcpy is effectively a #define in modern glibc. And heaven forbid if I should try to compile with -Werror in configure scripts (nearly every test program causes a compiler warning along the lines of "builtin function is horribly misused").

The saddest part of all is that, as bad as autoconf is, it appears to be the least broken configuration system out there…

Thursday, August 16, 2012

Updated code coverage, now on try!

My expeditions in code coverage date back several years, but I am proud to report the biggest milestone to date now. Instead of running all of these tests on various clusters I have access to (and running into problems with random test failures), I am running them on the same servers that power Mozilla's build automation. How, you might ask? Read on to find out.

I've uploaded the results from running Linux64 debug builds (both LCOV results and my treemap application). My treemap application has gone through major revisions, too. It now uses CSS transitions instead of the framework's builtin transitions (no more slow script dialogs, and the animations are snappier, although Firefox 14 still chokes on the whole view and Firefox nightly appears to suddenly thrash memory). I've also tweaked the UI a bit to fix minor things that you probably wouldn't notice if I didn't point them out. Instructions on using the view are now in place (it turns out that treemaps aren't as intuitive as I feel them to be). You can also break down results by testsuite, although that feature was very hurriedly hacked in and has very obvious pain points. Code for this portion of the site can be found on github.

The other potentially interesting part is how I get this to work on Try. I don't have this in a public repository yet, but you can follow along the changes on the patch I pushed to try. This is how it works: first, I patch the mozconfigs to include gcov results. Then, I package up all of the notes files and include them in the tarball that contains firefox. Now is where the fun begins: at opportune places in all of the test suites, I output (to standard out) a base64-encoded tarball of all the data files collected during the run of the program. For test suites other than make check, I need to munge the environment to set GCOV_PREFIX to a directory where I can find all of the data files again.

When the magic patch is pushed to try, and after it builds, I can now find in all of the output log files of the build a tarball (or several, in the case of mochitest-other) of all the coverage data. The steps after this are done manually, by dint of only getting the try stuff working last night. I download all of the log files, and extract the tarballs of coverage data into a directory. Then I pull the gcov note files into that directory and run the LCOV tools to collect data. I use ccov (my work-in-progress to replace LCOV) to combine all the test suites into a single file and then produce the data for the nice tree view. LCOV is used to produce the detailed per-file data. After everything is produced, I gather it all up and upload it.

Where to from here? I want to automate the latter part of the process, as babysitting steps that take 10-20 minutes each is surprisingly unproductive. I also want to tackle a replacement for LCOV's HTML output, since it both takes forever to run (well over an hour), and the output is surprisingly lossy in information (you don't know which lines are covered by which tests, and weird corner cases in branch and function coverage could be presented better). Eliminating LCOV altogether is another goal, but I can live with it in the first data collection step for now because gcov does really crazy stuff in coverage. I also want to expand all of these tests to run on more than just one platform—ideally, I want code coverage for all test suites run on all platforms. Assuming, of course, releng doesn't decide to kill me first.

Friday, August 10, 2012

How to test new clang versions with try

Mac OS X builds of Firefox now use clang, and as work was put in to make the switch happen, this necessitated testing individual versions of clang. This means we have infrastructure in place that makes it (relatively) easy to run tests on the try server with even patched versions of clang. Here's how you do it:

Step 1: Build your version of clang

This should be trivial, although one wrinkle is that you need to specify where the gcc-toolchain is located explictly (/tools/gcc-4.5-0moz3 for now). If you're like me and lazy, you can just use the build-clang.py script to make the final tarball, after tweaking it to include your patch. Note that it expects to be located in specific directories (/builds/slave/moz-toolchain in particular). If you're building by hand, be sure to make a .tar.bz2 of the

Step 2: Place packages in appropriate location

The next script assumes things in particular places. It wants a directory layout that looks like:

$ pwd
/path/ending/in/clang-SVN revision
$ ls
clang-darwin.tar.bz2 clang-linux32.tar.bz2 clang-linux64.tzr.bz2

Step 3: Make manifests

This script is create-manifest.py, which has a requirement of simplejson 2.5 (which is newer than what mozilla-central's virtualenv python provides, alas). If you don't have a new enough python environment, just eliminate the item_sort_key=key_sort parameter and live with the fact that your output files are going to change more lines when importing to mozilla-central. This step produces files like darwin.manifest; these should be copied to browser/config/tooltool-manifests/platform/clang.manifest and the respective releng.manifest. It will also produce files with long, ugly filenames that look like someone dumped out a SHA512 hash as the filename (this is in fact what happens).

Step 4: Upload the files

First, copy the SHA512-named filenames somewhere public, like people.mozilla.org. Next, go into the mozconfigs to use clang instead of gcc. An example is here:

export CC="$topsrcdir/clang/bin/clang -fgnu89-inline"
export CXX=$topsrcdir/clang/bin/clang++

Disabling warnings as errors is also probably a good idea, since it seems that Linux people can't stop those extra semicolons sneaking in. Then, push to try and hope for the best!

Monday, July 16, 2012

Mozilla-central code coverage

I have been posting code-coverage results of comm-central off and on for several years. One common complaint I get from developers is that I don't run any of the mozilla-central testsuites. So I finally buckled down and built mozilla-central's coverage treemap (and LCOV output too).

The testsuites here correspond to running top-level check, mochitest-plain, mochitest-a11y, mochitest-ipcplugins, mochitest-chrome, reftest, crashtest, jstestbrowser, and xpcshell-tests, which should correspond to most of the test suite that Tinderbox runs (excluding Talos and some of the ipc-only-looking things). The LCOV output can break things down by test results, but the treemap still lacks this functionality (I only built in multiple test support to the framework this afternoon while waiting for things to compile).

Caveats of course. This is an x86-64 Linux opt-without-optimizations build. This isn't my laptop, and X forwarding failed, so I had to resort to using Xvfb for the display (which managed to crash during one test run). It seems that some of the mochitests failed due to not having focus, and I have no idea how to make Xvfb give it focus, so not all mochitests ran. Some of the mozapps tests just fail generally because of recursion issues. So this isn't exactly what the tinderboxes run. Oh, and gcov consistently fails to parse jschuff.cpp's coverage data.

Lcov is also getting more painful to use—I finished running tests on Saturday night, and it took me most of Sunday and Monday to actually get the output (!!!). Fortunately, I've reimplemented most of the functionality in my own coverage analysis scripts, so the only parts missing are branch coverage data and the generated HTML index which I want to integrate with my web UI anyways.

Tuesday, July 10, 2012

Thunderbird and testing

Thunderbird has come a long way in its automated test suite since I started working on it 5 years ago. Back then, much of our code was untestable and it was rare that a patch added tests. Now, our code coverage results look like this. It has almost unthinkable to have a patch that doesn't have a test, and there are only a few places in our code where testing is impossible. Now I'm going to propose how to fill in these gaps.

LDAP

Ah, LDAP. The big red part of comm-central whenever I make my coverage treemaps. The problem here could be solved if we had an LDAP fakeserver; having written both IMAP and NNTP servers, this shouldn't be hard? Except that LDAP is not built off of a textual-based layer that you can emulate with telnet but an over-engineered protocol called ASN.1 and more specifically one of its binary encodings. The underlying fakeserver technology is built with the assumption that I'm dealing with a CRLF-based protocol, but it turns out that, with some of my patches, it's actually easy to just pass through the binary data (yay for layering).

The full LDAP specification is actually quite complicated and relies on a lot of pieces, but the underlying model for an LDAP fakeserver could rather easily be controlled by just an LDIF file with perhaps a simplified schema model. At the very least, it's a usable start, and considering that the IMAP fakeserver still isn't RFC 3501-compliant 4 years later, it's good enough for testing.

Here, a big issue arises: the actual protocol decoding. I started by looking for a nice library I could use for ASN.1 decoding so I don't have to do it myself. I first played with using the LDAP lber routines myself via ctypes, but I found myself dissatisfied with how much work it took just to parse the login of the LDAP serve. I then looked into NSS's structured ASN.1 decoding, even happening upon a nice set of templates for LDAP so I didn't have to try to build them with the lack of documentation, but it still ended up not working well, especially given the nice model of genericity I was looking for. I played around with a node-based LDAP server (especially annoying given the current name feud in Debian that prevents the nodejs package from migrating to testing). It worked well enough for an initial test, but the problem of either driving the server from xpcshell or writing node shims combined with the fact that it only processes the protocol and has no usable backend caused me to give up that path. Desperate, I even tried to find just general BER-parsing libraries in JS on the general web and discovered that the ones that were there couldn't quite cope with the format as we use it.

Conclusions: it's possible. The only real hard part is writing the BER parsing library myself. If anyone decides they want to work on this, I can send them the partial pieces of the puzzle to finish. If not, I'll probably nibble on this here and there over the next year or two.

MIME

MIME—that's well-tested per our testsuite, right? Well, not really. A lot of the testing is just pure incidental: hooking the MIME library up to the IMAP fakeserver did a good job of fleshing out a lot of issues, but you can also find lots of small details that no one's going to notice (charsets come to mind). It turns out that MIME is one of those protocols where everybody does the same thing slightly differently, and you end up accumulating a lot of random fixes to MIME. If you want to replace the module from scratch, you become terrified of finding random regressions in real-world mail.

Perhaps unsurprisingly, there are no test suites for proper MIME parsing on the web. There is one for RFC 2231 decoding (kind of). But there's nothing that tries to determine any of the following:

  • Charset detection, especially who gets priority when everyone conficts
  • Whether a part is inline, attached, or not shown at all
  • How attachments get detected and handled
  • Test suites for the various crap that crops up when people fail at i18n
  • Text-to-html or HTML sanitization issues
  • Identifying headers properly (malformed References headers, etc.)
  • Pseudo-MIME constructs, like TNEF, uuencode, BinHex, or yEnc
  • S/MIME or PGP

Issues relating to message display could be handled with a suite of reftests. A brief test confirms that reftest specifications accept absolute URLs, including the URLs that are used to drive the message UI (this can even test it from loading the offline protocol). Reftests even allow you to set prefs before specific tests; with a bit of sugaring around the reftest list, a MIME reftest is easily doable. Attachment and header handling could also follow a MIME reftest design, but I'm not sure that is the best design. I'd also like it to be the kind of test that other people who write MIME libraries could use.

The main issue here is seeding the repository with something useful. Sampling a variety of Usenet newsgroups (especially foreign-language hierarchies) should pick up something useful for basic charset, and I can get uuencode and yEnc by trawling through some binary newsgroups. For a focus on gmail, I could probably pick up some Google Groups things (especially if I recall the magic incantations that let me at actual RFC-822 objects). Random public mailing lists might find something useful. My own private email is unlikely to provide any useful test cases, since I tend to communicate with too homogeneous an environment (i.e., I don't get enough people using Outlook). Sanitizing all of this public stuff is also going to be a pain, especially with the emails that have DKIM.

OS integration

OS integration is a nice header for everything that involves the actual OS: MAPI, import from standard system mail clients, integration with system address books. Unfortunately, my main development environment is Linux, where we have none of this stuff, so I can't really claim that I have a plan for testing here. Thanks to bug 731877, at least testing Outlook Express importing is a possibility, but true tests would probably require dumping some .psts into our tree, but we have no similar story for Mail.app. MAPI could be done with a mock app that exercises the MAPI interfaces; what it really comes down to is that we need to implement these APIs in a way that we can test them by executing in various mock environments during tests.

Performance tests

The other major hole we have is performance. Firefox measures its performance with things like Talos; Thunderbird ought to have a similar testsuite of performance benchmarks. What kind of benchmarks are useful for Thunderbird testing though? Modulo debates over where exactly to place the endpoints on the following tests, I think the following is a good list:

  • Startup and shutdown time
  • Time to open a "large" folder (maybe requiring rebuild?) and mem usage in doing so
  • Doing message operations (mark as read, delete, move, copy, etc.) on several messages in a "large" folder. Possibly memory too
  • Time to select and display a "large" message (inline parts), as well as detach/delete attachments on said message
  • Cross-folder message search (with/without gloda?)
  • Some sort of database resync operation
  • Address book queries

For the large folders, I think having a good distribution of the size of threads (so some messages not in threads, others collected in a 50+ message thread) is necessary. Slow performance in extra-large folders is something we routinely get criticized on, so being able to track regressions is something that I think is useful. Tests that can also adequately catch some stupid things like "download a message fifteen times to display it" are extremely useful in my opinion, and I feel like there needs to be some sort of performance tests that highlight problems in IMAP code would be useful.

Monday, July 9, 2012

Mozilla and Thunderbird

Let me start by saying that I have been contributing to Thunderbird for nearly 5 years. I don't have any secret knowledge; what I know that isn't public is generally had just by talking to people in private messages on IRC. All points I make in here are my own thoughts and beliefs, and do not necessarily reflect the rest of Mozilla.

To say that the recent announcement on Thunderbird's future threw people in a tizzy would be an understatement. After all, we have nothing less than apocalyptic proclamations of the death of Thunderbird. I believe that such proclamations are as exaggerated as Samuel Clemens's death notices (apologies for making a joke that is probably inscrutable to non-en-US people).

The truth is, Thunderbird has not been a priority for Mozilla since before I started working on it. There really isn't any coordination in mozilla-central to make sure that any planned "featurectomies" don't impact Thunderbird—we typically get the same notice that add-on authors get, despite being arguably the largest binary user of the codebase outside of mozilla-central. Given also that the Fennec and B2G codebases were subsequently merged into mozilla-central (one of the arguments I heard about the Fennec merge was that "it's too difficult to maintain the project outside of mozilla-central") and that comm-central remains separate, it should be quickly clear how much apathy for Thunderbird existed prior to this announcement.

As a consequence, the community has historically played a major role in the upkeep of Thunderbird. The massive de-RDF project was driven by a lawyer-in-training. I myself have made significant changes to the address book, NNTP, testing, and MIME codes. Our QA efforts are driven in large part by a non-paid contributor. More than half of the top-ten contributors are non-employees, according to hg churn. So the end of purely-Thunderbird-focused paid developers is by no means the end of the project.

There's a lot of invective about the decision, so let me attempt to rationalize why it was made. Mozilla's primary goal is to promote the Open Web, which means in large part, that Mozilla needs to ensure that it remains relevant in markets to prevent the creation of walled gardens. I believe that Mozilla has judged that it needs to focus on the mobile market, which is where the walled gardens are starting to crop up again. In the desktop world, Mozilla has a strong browser and a strong email client, and maintaining that position is good enough. In the mobile world, Mozilla has virtually no presence right now. Hence all of the effort being put into Firefox Mobile and B2G right now.

Now, many of the decisions as to the future of the project are uncertain; unfortunately, the email laying all of this out was prematurely leaked. But it is clear that Thunderbird suffers from massive technical debt: when I was pondering parts that the Gaia email app might be able to leverage, I first considered the IMAP protocol implementation and then ran out of things to suggest. Well, maybe lightning or the chat backends (for calendaring and IM, respectively), but it's clear that most of the mammoth codebase is completely unsuitable for reincorporation into another project. To this end, I think the most useful thing that could happen in Thunderbird falls under the "maintenance" banner anyways: a replacement of these crappy components with more solid implementations that are less reliant on maybe-obsolete Gecko features and that could be shared with the Gaia email app. As a bit of a shameless plug, I have been working on a replacement MIME parser with an explicit eye towards letting Gaia's app use it. Such work would be more useful than whining about the decision, in any case.

Wednesday, June 27, 2012

Building mozilla with clang LTO

Ingredients:

  • 1 (one) copy of clang, preferably on the new side
  • 1 (one) source tree containing mozilla-central
  • 4 (four) GiB (that's 230, not 109) of memory
  • Linux (these instructions not tested on OS X, so it may or may not work for you)

Instructions:

For the sake of example, I have clang installed to /usr/local; substitute as appropriate for your environment. Configure mozilla-central with the following mozconfig:

# -O4 tells clang to do LTO
ac_add_options --disable-debug --enable-optimize='-O4'
# Exclude this line, and your will use over 10x the amount of memory.
ac_add_options --disable-debug-symbols
# Build with clang
export CC=/usr/local/bin/clang
export CXX=/usr/local/bin/clang++
# These flags are necessary to get elfhack to compile
export LDFLAGS='-Wl,-z,norelro -flto'
# This flag is necessary to link properly
export RANLIB='ar -s --plugin /usr/local/lib/LLVMgold.so'

You may season with other configure options as desired. Configure and let build for a few hours. Admire the build process as most individual files get compiled much quicker, but be displeased when link times explode. Enjoy!

Friday, June 8, 2012

JS code coverage

My most recent adventure has been an extended foray into the depths of getting good JavaScript code coverage. The tool that I had been pointed to a few years ago was JSCoverage, which took the approach of "reserialize all JS code with implementation details," which doesn't work quite so well with the ever-shifting landscapes of Mozilla JS implementation. It also used SpiderMonkey's internal-only parser API, which, from experience, is a bloody edge to ride on. I've dithered on and off about using the reflection API to use this, but making sure that data can get extracted out again is a bit of a pain.

Now it's 2012 and the JS engine has better hooks that allow you to use debugging APIs without causing the code to come to a screeching halt in terms of performance (with the proliferation of setTimeout-based tests, drastically slowing down JS execution causes a surprising amount of tests to fail). So I started by using a modified version of Gregory Szorc's patch for development, but that immediately ran into several bugs with the JS debugging API. It also requires major contortions due to current limitations of the API. After complaining enough in #jsapi, I was pointed to some bytecode counter hooks enabled in the JS API. After finding and temporarily working my way around more bugs (I've been told that most of this stuff isn't necessary with compartment-per-global, but no one has been ripping this stuff out yet…), I produced a patch (see the previous linked bug) which hooks into xpconnect and automatically does code coverage for all JS run on the xpconnect runtime.

Since the pc counts just dumps out raw counts of hits, I need further processing to make it work; this comes in the form of post-processing script. There is also related issues that counts can't be generated for scripts it never sees, so there's also logic in my ever-growing code-coverage building script to compute all functions that get run. This is the most fragile part of the process right now, as many things can cause it to break. In addition, I ended up writing a partial replacement of lcov in python since I was getting extremely tired of its poor performance. This goes so far as actually reading the gcno and gcda files output by gcc itself, since gcov has problems reading one of them.

So the final result is posted now, and, like usual, I have a treemapified version as well. In that, I went ahead and implemented a feature which lets the URI specify the directory to start in, so you don't have to bear with several long script dialog warnings to view into a specific directory.

Tuesday, May 29, 2012

More code coverage treemap fun

Quite some time ago, I built a treemap visualization of code coverage. This had been inspired by the information visualization course I was taking then, and I have since found a less static version more useful for getting a feeling of code coverage than lcov's index pages. Back then, two years ago, the only decent toolkit I found for building treemaps was in Java, which would necessitate the use of applets if I tried to put it on the web. Since most of what I produced was just static images, using Java locally wasn't issue. Then again, I'm also a contributor to an organization that is pushing hard about being able to do things in pure JS. So I thought I might as well try to replicate it using a JS toolkit.

Two years has made quite a difference in terms of JS toolkits. Back in 2010, the only JS toolkit I found that had treemaps was protovis, which I found fairly unusable. Now, there are a few more higher-quality toolkits, but they've all had a fairly major problem of forcing you to shoehorn your data into a specific format instead of letting you customize how data is formatted. The JavaScript Infovis Toolkit is the one that both makes me happy in the results and annoyed to get it to work. Finally, I settled on D3, which is still too low-level to make me happy (don't ask me how painful getting animations working was).

The end result is here (the equivalent lcov result is also available). Every file is a rectangle in that represents a single file of source code. Hover over it, and you get details of coverage in that file. Click on it, and you can zoom in one level down in the source code hierarchy. If you ctrl-click, you zoom one level back. The size of the rectangles is indicative of how large the file is (you can configure it to be by number of lines or number of functions). Their color is indicative of what percentage of lines (or functions) is covered: a red color means very few are executed in tests, while a green color means most are. There is a scale that indicates what a given percentage of code looks like, and there is even internal code to affect the scale (so as to make the midpoint muddy red be not 50% but more like 70%), but this is disabled since the scale is more difficult to change. Changing parameters causes the treemap to animate to its new position, but the sheer number of elements here appears to give Gecko heartburn—it's much smoother if you zoom in to smaller directories! I do realize that this also results in a lot of boxes flying around each other, but a squarified treemap is unfortunately a very unstable algorithm.

The technology that underlies the visualization isn't complicated. I used lcov to produce a summary file, and then used a python script to convert the file into a summarized JSON version. Tooltips are provided using tipsy, which I don't particularly like all that much, but I haven't found anything that is as easy to use while also servicing my needs (most importantly, placing tooltips so they don't fall off the edge of the screen). As an aside, a visualization toolkit that doesn't have any tooltip support is extremely bad—details-on-demand is a critical component of information visualization.

There's some things I don't have: I don't scrape branch coverage yet, for example. And having a query parameter that lets you zoom into a specific directory immediately would be helpful, as would being able to cut off the tree before the leaves. I would also like to see a nice view of a single file, so I can completely kick the lcov output. Being able to select individual files would also be a boon. Another neat feature would be to select from multiple sources (e.g., different test suite coverages or even monitoring as code coverage changes through the years!).

As a postscript, let me give you a tour of top-level directories in the view. The tiny sliver in the uppper-left is, from top-to-bottom, the C++ source code of Thunderbird-specific code, lightning, and LDAP. Then there is a sizable block for mork, and the rest of the left column is mailnews code. Then there is a barely-noticeable column and another small but noticeable column which contain between them a lot of small top-level directories in mozilla. The next column (similar in width to the comm-central column) contains, from top-to-bottom, IPC, spellcheck, widget, security, accessibility, parser, and xpcom. The next column contains editor, toolkit, network, and DOM code. The right 40% of the diagram or so lacks definable columns. The upper-left of this portion (you'll notice it by the large red block with a lot of "S" letters peeking out) is the graphics code. To its right lies JS. Beneath both of these lies the code for layout, and beneath that and at the bottom is content.

And, no, my code is not on github yet because most of everything that matters is in that one HTML file. I also plan to integrate this view into DXR at some point in time.

Monday, May 28, 2012

A tour of libmime

Libmime (the C++ version, that is) exposes three services, not including the tight coupling it shares with compose code. These three services are a service that extracts headers (which many people probably didn't knew existed), a service that parses structured headers (in two place), and everything else. The first two are fairly straightforward to understand, so it's the complexity of the last one which is worth talking about here.

The first thing the astute programmer will notice is the apparent lack of a way to drive the MIME converter. This is because the converter also implements nsIStreamConverter, kind of. The MIME converter uses the fact that it derives from nsIStreamListener to receive data, and it uses the Init method to set itself up (while effectively ignoring half its parameters). There is logic to extract the URL to load from the context parameter, and this is where the fun and games begin. If no output type is manually selected, it is sniffed from the URL as follows:

  1. No URL results in quoting
  2. Finding part= (unless the desired type is application/vnd.mozilla.xul+xml) causes the output type to be raw, unless a type= parameter is present, which induces body display and uses that type for the stream's content-type.
  3. Finding emitter=js causes a special case that is used to select gloda's JS MIME emitter
  4. Finding header= does a lookup for the special mime output type; the values possible are filter, quotebody, print, only (uses header display), none (uses body display), quote, saveas, src, and raw, most of whose mappings to the appropriate value in nsMimeOutput are fairly obvious.
  5. The default, if none of the above match, is to display the body.

With the URL and output format settled, the next step is choosing the emitter. The gloda hack obviously results in the gloda JS MIME emitter, and editor templates and drafts don't use an emitter, but, otherwise, either the raw emitter (attach, decrypt, raw, and source), XML emitter (header display), orHTML emitter (everybody else) is used.

After the emitter is initialized and constructed, a bridge stream is set up to actually pump the data into libmime. The output is handled differently in the case of things that get sent to compose and things that get driven by display. The former merely passes through some more options and then decomposes attachments into temporary files, while the latter is more interesting and complicated. The URL is parsed—again—and the following parts are noticed. If headers= is present and results in one of only, none, all, some, micro, cite, or citation, then opts->headers is set to signify the choice. The presence of part= sets opts->part_to_load, while rot13 enables rot13_p (did you know TB had that feature?). Some more hacks are enabled if emitter=js is present, and then the part number is checked to see if it might be an attempt to use the old (Netscape 2) MIME part numbering scheme. After all of this done, we set up charset overrides for the message if they need to be set up and then we can actually start pumping data.

The guts of libmime that actually do the parsing is in the polymorphic object system within libmime. The text initially gets fed to an instance of MimeMessage, but once the headers of one these containers gets parsed, the body is fed through another object. Going from a content-type to an object takes several lines of code, but a brief summary of the decisions should suffice. If no Content-Type header is present, the code chooses a default (ultimately, for children of MimeMessage, the default is untyped text, which sniffs for binhex, yEnc, and uuencode). Some parts can have their content-type sniffed by a found filename instead. The extension point that allows extensions to add their own content-type handlers gets run before anyone else, but, as far as I can tell, only enigmail leverages this. There is another subtle extension hook which allows people to convert any type to HTML (Lightning uses this). Some types are simple and always go to the same class: message/rfc822 and message/news both result in a MimeMessage, while message/external-body creates a MimeExternalBody. Other types get dispatched based mostly on preferences: text/html and multipart/* are the biggest cases here. Everything else (note that text/* defaults to text/plain and multipart/* to multipart/mixed) becomes a MimeExternalObject, and a paranoia preference can also turn half of the classes into a MimeExternalObject.

After creating classes, the next phase is the actual parsing logic. And this is where I'm stopping the tour for now. I've got more precise information laid out on virtual sticky notes on my laptop, but I don't think it's worth posting what amounts to several large switch statements in a brief tour.

Sunday, May 6, 2012

A new JS MIME parser

Here is one of those guessing games whose answers will depress you: how many (partial) MIME parsers do we have in Thunderbird? If you guessed zero, ha-ha, very funny; now go sit in the corner. If you guessed one, you've obviously never worked with libmime before. The actual answer depends on how liberal you want to be. On the conservative end, there are no fewer than 5 parsers which go at least as far as parsing multipart messages (including two in a single file). If you choose to go liberally, counting everybody who attempts to split header values to look for a specific value, you gain an additional six that I am aware of… and I have no doubt that more are lurking behind them. I suppose this means that we now have more implementations of message header parsing than we do base64-decoding (although nowadays, it seems like most base64-decoding got stuffed into one method with several wrappers). The complete list as I know it:

  • nsMsgBodyHandler
  • nsMsgDBFolder::GetMsgTextFromStream
  • libmime
  • IMAP fakeserver (one for body parts, one for body structure, and another spot happily hunts down headers)
  • NNTP fakeserver (hunts down headers)
  • nsParseMailbox
  • nsNNTPProtocol (hunts down headers)
  • nsMsgLocalStoreUtils (hunts down headers)
  • Somewhere in compose code. Probably, although it's not clear how much is being funneled back to libmime and how much is not.
  • A class in necko implements in the RFC 2231 and RFC 2047 decoding.

Well, it's time for to increment that count by one. And then decrement it by at least three (two of the IMAP and the NNTP fakeserver decoders are switched over in a queued patch, and I hope to get the third IMAP fakeserver switched over). I've been working on a new JS MIME parser to Thunderbird for a bit at a time over the past several months, and I have local patches that start using it in tests (and as a replacement for nsIMimeHeaders).

So, why replace libmime instead of consolidating everyone onto it? The short answer is because libmime is a festering pile of crap. It uses an internal object system that can be summarized as "reimplementing C++ in C" and has existed largely in the same form since at least Netscape 5 (the Mozilla classic version gives a date of May 15, 1996). All of that would be manageable were it not for the fact that the architecture is clearly broken. Reliably parsing multipart MIME messages is not a trivial task (indeed, not only does one of the above do it incorrectly, but there is actually a test which may rely on it being wrong. Guess which one it is.), and the fact that so many people have variants on it is a clear indication that the API fails to expose what it ought to expose. This means that the code is in need of change, and the implementation is a form which makes changing things extremely difficult.

The choice to do it in JS was motivated mostly by Andrew Sutherland. There has been a long-term ideal in Thunderbird dating back to at least around 2008 to move more implementation of core code into JS, which would help avoid massive churn spurred on by mozilla-central; nowadays, there is the added benefit that it would aid in efforts like B2G or porting to platforms where native code is frowned upon. MIME code, being extremely self-contained (charset conversion and S/MIME encryption make up the biggest dependencies in the core parser). As of my current implementation, the only external dependency that the MIME parser has is atob, although charset conversion (via the proposed string encoding API) will be added when I get there. In other words, this code is usable by anyone who wants to write a mail client in JS, not just the internal mailnews code.

Another advantage to writing my own library is that it gives me a lot of time to lay out specifications in clearer terms. One called out in the spec are on how to handle seeing boundaries for the non-leaf-most part. My own brain went further and started musing on non-leaf parts getting QP or base64 content-transfer-encodings (which RFC 6532 went ahead and allowed anyways), or multiple nested parts having the same boundary (which the specification, in my view, hints at resolving in a particular fashion). Other developments include the fact that most MIME parsers do not permit as much CFWS as the specification indicates could be present ("Content-Type: multipart / mixed" would be unusable in every MIME parser source I read)I also produced a list of all the specifications that the parser will need to refer to that I have found so far (13 RFCs and a handful of non-RFCs, plus 9 obsoleted RFCs that may still warrant viewing). As for size? The JS implementation is about 600-700 lines right now (including over 300 lines of comments), while the equivalent C++ portions take over a thousand lines to do more or less the same thing.

As I said earlier, one of the problems with libmime is its oppressive architecture. It is primarily designed to drive the message pane while living as a streaming converter. However, it encompasses roughly three steps: the production of the raw MIME tree (or a tree that combine the raw MIME data with pseudo-MIME alternatives), conversion of that tree into a more traditional body-and-attachments view, and then the final driving of the UI. Getting it to stop any earlier is between difficult and impossible; what I have now, once it gets some more testing, can satisfy people who just need the first step. Doing the second step would require me to sit down and document how libmime makes its choices, which is not a task I look forward to doing.

Wednesday, March 7, 2012

Hate speech? Really?

One of my goals in life is to participate in a true political debate. Not a shouting match, but a true debate where both sides argue their points reasonably and, crucially, are open to the fact that they may be holding the wrong ideas and willing to change their beliefs. The Internet, allowing for an easy way to facilitate discussions across a very large, very diverse group of participants ought to have made this easier; alas, it seems that the past few years has only increased the vitriol and spitting bile in these shouting matches.

When I read news articles, sometimes I think "this sounds thought-provoking", and I turn to the online comments to see if it started one. Almost invariably, most of the comments turn out to be rants that are only distantly germane to the article (e.g., any article on the Middle East sets off a firestorm with ... well, you can guess; any article on Ron Paul sets off commenters screaming about how he's being "ignored" by the media, etc.). Sometimes, the ones I think are most thought-provoking have no comments (typically any article that mentions Africa). Go figure.

It is also surprising how many people you think ought to know better end up doing nothing but shouting. I went to a well-regarded selective-admissions public high school (the mean score for the standard college-admissions test in the US is not far from perfect) (and no, I did not get a perfect score on said test), but when we had a sponsored debate between the Democratic and Republican groups in our school, it still devolved into a (literal) shouting match.

Which brings me, in a roundabout manner, to the true topic of this post. Apparently, there is a firestorm currently brewing over a post on planet.mozilla.org that dealt with a political topic of current interest to some people. I discovered, years ago, that you catch a lot more flak if you write a post that goes to an aggregator that people disagree with than if you write one they agree with (I explained why I did not like the results of the 2008 presidential election, and caught more flak than the few posts earlier celebrating those results). I don't mind political discussion—indeed, as I described earlier, I'd love one. But I can understand why people might not like sensitive political topics being collected on a site which can be interpreted as official Mozilla policy.

However, there is one thing I don't like about it. Some people have taken it upon themselves to call the viewpoint presented as hate speech. I don't know what hate speech is exactly, but I don't think it should ever exist, no matter what one defines. That is because terming something as hate speech is an attempt to avoid discussing it: it is a way to say "that statement is so wrong I am not going to bother to refute it", or, equivalently, "I am so right that any opposition to my idea cannot be rationally argued." It is at best an insult to one's opponent and at worst an admission of outright arrogance.

Even worse, I think, are people who want to outlaw hate speech. Speech is, fundamentally, a reflection of one's belief. Attempting to outlaw the utterance of specific beliefs is nothing more than trying to outlaw those beliefs in the first place. Sure, some of them are injurious insults (such as those who deny that the Holocaust happened), but there is, to me, a steep slippery slope you start out on. Why outlaw denials of the Holocaust but not also outlaw denials of other historical religiously-motivated genocides? Hell, why stop there? We could outlaw any false statement: think how much more rational our discussions would be if people didn't state such lies as "Sharia law is invading the US legal system" or "the New Deal did nothing to bring us out of the Great Depression." Wait, maybe that's why.

I remember, when doing a project for a US Government class that involved redebating several key cases on freedom speech, that I found a Supreme Court decision which mentioned something to the effect of "Free speech is at its best when it causes discomfort" (I cannot find the exact quote right now, sadly enough). At that moment, I think, I solidified my stance on free speech to the following: all speech ought to be free, no matter how insulting, radical, or slanderous it may be. It may be that your speech causes injurious actions, but it's only the intent and the action that matters (e.g., falsely yelling "Fire!" in a crowded theater is grounds for manslaughter, but because the action was intended to create a stampede, not because you actually said "Fire"). Attempting to censor others' opinions—whether through legal means, or through attempts to shame them by calling their words "hate speech"—is not an action I can support.

And if you disagree with me, feel free to try to discuss it with me. Who knows, I may just change my mind.

Monday, March 5, 2012

How to use bugpoint

One of the problems with working with compilers is that you tend not to find some bugs until they get used on real code. And unlike most test suites, real code can involve some very large files, which is where you tend to find bugs. As luck would have it, my debugging fun today led to two bugs being found in the same function... which is only a 200 line function that hides infinite loops in macros and is written using continuations to boot. The LLVM IR for this function, after being compiled with -O3, was 4500 lines (and one block had 87 predecessors and several ϕ instructions as well). At such a size, finding out why my code crashes on such a function is impossible, let alone figuring out which optimizations I need to blame for it.

Fortunately, LLVM has a tool called bugpoint which can reduce this IR into a more manageable size of work. Doing manual reduction via an iterative process of "this doesn't look necessary; cut it out" the first time took me about an hour to produce a pile of code small enough to actually analyze. Doing it via bugpoint on the second bug took closer to 30 minutes. Unfortunately, the hard part is figuring out how to actually use the tool in the first place: none of the manuals give an example command line invocation, and they start playing games of "look at that documentation over there". So, I am going to remedy this situation by actually giving functional documentation.

In this case, I have an assert in an LLVM pass that is being triggered. This pass isn't being run as part of opt, but rather its own tool that takes as input an LLVM IR file. So the first step is to get the IR file (clang -cc1 -emit-llvm -O3 is sufficient for my needs). After that, it's time to prepare a shell script that actually compiles the code; you can skip this step if you don't actually need to provide arguments to the program. For example, my bugpoint.sh script would look like:

#!/bin/bash
/path/to/tool "$@" -arg1 -arg2 -arg3=bleargh -o /dev/null

After that, the next step is to actually invoke bugpoint. Here's the command line that's running as I write this post: bugpoint --compile-custom -compile-command ./bugpoint.sh io_lat4.ll. Since my program causes an assertion failure, bugpoint figures out that I'm trying to crash on code generation (it can also detect miscompilation errors). Hopefully, the first bit of output you get should look like the following:

Error running tool:
  ./bugpoint.sh bugpoint-test-program.bc-FB1NoU
Text indicating your program crashed
  *** Debugging code generator crash!

If you've seen this much (you may need to wait for it to crash; it can take a long time if you doing Debug+Asserts builds), you know that it's trying to find code that makes your tool crash. After that, bugpoint tries to first reduce global initializers and then tries to eliminate as many functions as possible. After that, it tries eliminating basic blocks and then goes to work eliminating instructions. You will see lots of streaming output telling you what stuff it's removing; the documentation says it's helpful to capture this output, but I've found it useless.

When everything is done, you should get several files of the form bugpoint-*.bc; the most useful one is bugpoint-reduced-simplified.bc, which is the most reduced testcase. What you get now is a nice, reduced testcase? Not quite. First, it gives you just a bitcode file (I'd prefer .ll files, simply because my first thought is to read them to figure out what's going on). Another problem I have with bugpoint is that it doesn't do things like attempt to eliminate unused struct entries, nor does it bother to clean up some of its names. Take a look at this struct monstrosity: %struct.su3_matrix.4.134.147.173.199.212.238.290.303.329.381.394.407.459.472.485.498.524.550.563.576.589.602.615.628.654.667.680.719.758.823.836.849.862.1044.1057.1096.1808 = type { [3 x [3 x %struct.complex.3.133.146.172.198.211.237.289.302.328.380.393.406.458.471.484.497.523.549.562.575.588.601.614.627.653.666.679.718.757.822.835.848.861.1043.1056.1095.1807]] }.

Anyways, I hope this helps anyone else who has looked at bugpoint before and wondered "How do I actually use this tool to do something useful?".

Monday, February 27, 2012

Fun and games with assembly

This is a problem that came up during my research. Suppose I have a function foo whose assembly looks like the following:

foo:
  mov global($rip), %rax
  mov (%rax), %rax
  retq

Are the following two lines of code equivalent?

  /* defines somewhere */
  int val;
  extern int foo(void);
  /* actual code */
  asm("callq foo" : "=a"(val));
  val = foo();

The answer, as you might guess by the fact that I'm writing this post, is "no." The reason why is a lot more complex. In the case that caused me endless toil and grief, the function foo is not located in the same executable as the caller; it is in an external library file. Now, the platform ABI allows functions to clobber a fair number of registers which callers must assume are unusable. If the loader needs to perform work when calling a global relocated function, then it is a good idea to use those registers where possible, since you don't need to bother saving those values. When the asm construct is invoked, we've declared that only the return register is clobbered, so the compiler is free to store some values over the call. When the function call is direct and doesn't need to go through the GOT or anything similar, this is perfectly fine and works as one would expect. But if you need to invoke the loader when calling through the GOT, then you now have registers whose values have suddenly mysteriously changed on you when calling a function which doesn't (appear to) use them. Hope you have more fun debugging those cases than I did!

Wednesday, February 15, 2012

Achievement Unlocked: Use 100GB of memory in a single process

I discovered that the compiler I was using last night managed to run out of memory while compiling some code. This was on our research group's server which has a whopping 128GB of memory. And by "ran out of memory", I literally watched in top as clang went from dozens of megabytes of memory to 120GB of memory, evict everything from the cache and thrash swap like crazy before the out-of-memory killer decided it ought to go.

What was I doing? No, I wasn't linking the source code to the universe; I was merely compiling a file in the Spec 2006 benchmarks. And this wasn't anything large--only 3700 lines of code or so in a single C++ file. And the pass that wasn't dying wasn't some horribly written pass that leaked memory like a leaky bucket. Indeed, LLVM tends to get you to free memory too quickly, judging by how many times I've chased down memory corruption. If you don't believe me, the offending code was this:

    Code = CloneBasicBlock(CodeOrig, ValMap, "", F, NULL);
    for (II = Code->begin(), EE = Code->end(); II != EE; ++II) {
      if (!isa(II))
        RemapInstruction(II, ValMap, RF_IgnoreMissingEntries);
    }

For those of you who don't know what the code is doing, it roughly amounts to this: make a copy of a basic block of a function, and then tell all of the instructions to use the new instructions instead. On some programs, though, this simple instruction seems to have compelled the code to eat up gigabytes of memory.

Fortunately, I am now proud to announce that I have reduced my memory consumption by 99%… with a one-line fix. So I guess I should get the achievement for "Fix the Memory Leak" as well?

Tuesday, January 10, 2012

How bugs get fixed

Recently, I have had the misfortuneopportunity to fix bug 695309. This bug is perhaps a good exemplar of why "obvious" bugs take weeks or months to fix; it is also a good example of why just reporting that a bug occurs for the filer is insufficient to fix. In the hope that others might find this useful, I will explain in a fake liveblogging format how I fixed the bug (fake because I'm writing most of these entries well after they happened).

October 20

There's a bug that "Thunderbird sometimes marks entire newsgroups as unread" in my email. Time to open up the bug report… reported in version 8… I'm pretty sure I've seen this once or twice before, so I don't think it's just a "the news server borked itself" issue. Time to file it away until when I have more time.

November 12

Another comment reminded me that I have this tab open. I've definitely seen it a few times, but I need to remember to keep Thunderbird open with logging enabled to figure out what's going on. It's reported as being a regression from version 8, when I checked in a slew of the NNTP URI parsing patches, so that seems like a probable target to look at. The question is why URI parsing would be causing an intermittent problem instead of a constant issue?

December 3

Some NNTP logs. Nothing is obviously amiss (not terribly unexpected, since logging either tends to drown you in useless information or omit the things you really want to know). Note after the fact: the NNTP logs in fact contain the smoking gun; it's just that the poster trimmed it out.

December 6

Bienvenu comments that no developer has seen the problem; this isn't true, as I have seen it (but just taken to avoiding it as much as possible). At some point in time, I figured out that the issue could be avoided by shutting down Thunderbird before putting my laptop to sleep. After being prodded over IRC, I finally sat down and attempted to debug it. The working thesis is that the problem is that newsrc files are getting corrupted, so I faithfully save several copies of the newsrc for confirmation. The msf files are also generally good candidates, but since other flags are untouched, it's highly unlikely in this bug.

I successfully trigger the bug once. The reports are mixed: the newsrc didn't get trashed as expected at first, but later it did. However, there is a brief window of time after the bug happens which allows you to fix it, if you shut down Thunderbird. Knowing what I now know, this is because the newsrc file is written out on a timer, so the newly-all-unread messages wouldn't have been saved to disk to show the bug. Since I always think of what to log after the fact, I try to trigger it a few more times.

None of the later tests are quite so successful as the first one. It does start to dawn on me that the bug probably has two parts. There is a first step that puts the group into a pre-bug situation; a second step actually produces the symptoms of the bug. Omniscient addendum: the first step is where the bug happens; the second step is just the delay in getting the bug to occur. I report my findings, and subsequent comments all but confirm my hypothesis for a necessary component.

December 12

This bug seems to be happening for me only when I don't want it to happen. I thought of more things to test earlier, and had them ready to copy-paste into my error console to try to find a smoking gun. This test confirms that the fault originates with the newsrcLine (so something reset that); more investigation leads to only one truly likely scenario to cause this to happen. At this point, I am all but convinced that the bug happens in two parts. All of the debugging needs to focus on when the first part happens; the symptoms (everything being marked read) are a result of Thunderbird attempting to repair the damage that had already been done. Since most people are going to try to report based on when they see the problems, I'm probably not going to get anything useful.

December 13

Hey, I found it again. Tests confirm that the high water number was first set to 0. This means either we have memory corruption or we are setting the value incorrectly initially. Looking at the code, this is set from a call to sscanf. A simple test indicates that I can set the input value to 0 if I don't have it scan a number. Now all I need to do is figure out network traffic that can cause this. Time to try to trigger it with xpcshell tests. And then get frustrated because I still can't do it reliably.

December 18

With Wireshark (logging on Windows is rather annoying), I finally can track down the network traffic of interest. It turns out that we are off-by-one in terms of responses. This really should be enough to get xpcshell to report the error. However, it also means I probably should finally give up and go for the NNTP logs again to catch the error: it is painfully obvious that the problem is that the internal state is futzed up. This also means that other various newly-reported issues (more frequent auth failures, various alerts like "Not in a newsgroup", etc.) are pretty much confirmed to be the same bug.

January 9

I finally buckled down and turned my very old hackish code for NSPR log management into an extension. This means that I don't have to futz with environment variables on Windows, and it also gives me an easy way to trim down my log size (since unlike environment variables, I can stop logging). I test and finally get the NNTP log of interest.

Now that I have a log, I can try to work backwards and figure out how this bug happens. The failure is easily ascribable to somehow thinking we've already opened the socket when we open a socket; this much I have known for almost a month (the Wireshark told me). What the NNTP log gives me is a more fine-grained ability to try to trace the code after-the-fact to figure out where the bug is.

Working backwards from the log

NNTP logs helpfully record the address of the the NNTP protocol object when emitting output, so the best place to start is from the first line of that object. Since the numbers are hard to read, so I replace the addresses with a more obvious string like "evil." The next state was set to SEND_FIRST_NNTP_COMMAND—that clearly should be NNTP_LOGIN_RESPONSE, so let's see why it might be skipped. The latter is set if m_socketIsOpen is false, so perhaps someone could open a socket if it isn't set.

This variable is set only by nsMsgProtocol, specifically the LoadUrl method. So who might call that? Still nothing out of the ordinary is popping out at me, so it's time to return to the logs (A particularly astute developer might be able to find out the bug without returning to the log here, but I doubt most people would come up with the final flash of insight yet).

To figure out the problem, it's important (to me, at least) to set out the time frames. The natural sequence of events for a connection is to be created, connected to the server, and then go through several iterations of running URLs. Is this bad connection a new connection or an old one (since I started the log after TB started up, I can't guarantee that this is truly a new connection)? Thunderbird stupidly (but helpfully, in this case) tells me when the connection is created, since we get a log message of "creating" in the constructor. Since this message doesn't appear, it's being reused.

Wait—that message is there, tucked just underneath the lines that tell me LoadUrl is run. As I am often wont to do, I announce my revelation to IRC: "< jcranmer> bienvenu: please slap me". The code for the constructor is pretty simple, but there is one little function call that ends up causing the problem. If I tell you the exact revision that caused the regression (specifically, the first hunk of that the patch), can you find the regression? Click me if you can't

Reproducing the bug

Now that I know the exact sequence of events to cause the bug, I can write a test to reliably reproduce the bug. I can also explain why the symptoms occur. First, a connection's socket dies, but it doesn't fail the timeout check in mailnews code, so it remains in the cache. Next, the server schedules a URL on the connection, which promptly dies. If there are many more URLs to be run (when we're doing, say, a timed update), then we have several elements in the queue waiting to be processed. Now, we attempt to run a new URL; with no connections available in the cache, we now create a new connection and run the new URL on that. Since the queue is not empty, the SetIsBusy(false) call ends up pulling a queued URL and setting up the state (including opening up the connection) and running that. Then the constructor finishes, and the new URL for which the connection was constructed is used to initialize it. Since the connection is by now open, the code moves straight to the run-my-URL part of the state machine, which misinterprets the login response as the answer to the command it just sent. This ends up leading to all of the various badness subsequently observed.

The description of these events is pretty complicated. A condensed version is added to the test I constructed to get this to happen, which is called, fittingly enough, test_bug695309.js. I am not going to try to come up with a better name, as I think most people might agree that this is where naming tests after bug numbers is most desirable. Naturally, the code to actually trigger this is complicated: I started with code to trigger just the unread effect, and then switched to monitoring the highwater mark instead (one less download of messages). I need to fill up the connection cache and then figure out how to kill the connections (restarting the server seems to do an adequate job). After that, I need to run enough URLs to cause the connections to queue, and then trigger a load of the folder in question at precisely the right moment. All of these requires paying careful attention to what happens synchronously and what does not, and it all relies on knowing the precise steps of what will happen as a result of each action. Any of a small number of changes I could make in the future is probably going to "break" the test, in that it will cease to test anything meaningful.

Postmortem on fixing it

I'm sure some people might look at this bug and ask several questions. Let me try to anticipate some of them and answer them.

Why did the reviewers fail to notice this change?
This bug is a result of the confluence of several changes. In the beginning, SetIsBusy was mostly a no-op, so calling it from the constructor was safe. Then the URL queue was added, and SetIsBusy was used to indicate that the connection was ready to receive a new URL from the queue. This turned it into a time bomb, since the code was correct only because the server was null in the constructor and the queue should have been empty during connection construction anyways. The final change was moving initialization of the server, which triggered the time bomb. But even then, it triggered the bomb only in a rather abnormal circumstance. At no time would the code have failed any automated tests or any of the "does your patch work" tests that a reviewer normally goes through: robustness during connection interruption is both sufficiently difficult to test and rare enough that it generally doesn't get tested.
Why did this bug take so long to track down?
The time since December 1 is truly a result of lack of time: I have had maybe 1 good week to work on anything due to outside factors. There are two, maybe three, people in active participation with Mozilla who know this code well, and out of those, only one was able to somewhat reliably reproduce it (specifically, the one with the least amount of time on his hands). I could have relied on others to gather the information I needed, but my previous efforts in QA have taught me that getting people to give you necessary information is often very difficult, especially when the information collectable at the obvious point in time is completely useless. I realize in retrospect that there were some more technically inclined people in the bug (at which point in time I was already devoting as much attention as I felt I could spare on the bug anyways). The bug is also particularly pernicious in that the most valuable information is that which is gathered before any visible symptoms occur; it was after I figured out how to discover that the bug was going to occur that I could track it down further.
Will the fix be backported to more stable branches?
At this stage, I am looking at two fixes: the small one-liner that fixes the specific bug, and the larger patch which fixes the larger issue of "the NNTP URL connection queue is a ticking time bomb". The former I would definitely like to see backported to aurora and beta (and possibly release if there are plans for another one before the next uplift), and I can't see any reason a small patch on the most-highly-voted TB bug filed since 01-01-2011 would get rejected (it is the newest bug to have more than 5 votes, and the next largest has 1/3 less and is almost twice as old).
Why is this bug so hard to reproduce?
If you think to the cause of the bug, it is that it is passing a check in a function that is always called. It therefore seems like this bug should be triggered a majority of the time, and not an undependable minority. As I alluded to earlier, the problem is that this also has to interact with a buggy pending URL queue management system. In short, most of the time, this queue will be empty during the initial, buggy call; with a poor timing of events (less likely to happen on most developers' computers, in short), the queue will cease to be empty and cause the problem. Indeed, when crafting my test for reproduction, I discovered that letting the event loop spin in one particular location (after killing connections) would fail to trigger the bug. In the real world, that is one of the places where the event loop is most liable to spinning for the longest.