Quetzalcoatal: May 2012

Tuesday, May 29, 2012

More code coverage treemap fun

Quite some time ago, I built a treemap visualization of code coverage. This had been inspired by the information visualization course I was taking then, and I have since found a less static version more useful for getting a feeling of code coverage than lcov's index pages. Back then, two years ago, the only decent toolkit I found for building treemaps was in Java, which would necessitate the use of applets if I tried to put it on the web. Since most of what I produced was just static images, using Java locally wasn't issue. Then again, I'm also a contributor to an organization that is pushing hard about being able to do things in pure JS. So I thought I might as well try to replicate it using a JS toolkit.

Two years has made quite a difference in terms of JS toolkits. Back in 2010, the only JS toolkit I found that had treemaps was protovis, which I found fairly unusable. Now, there are a few more higher-quality toolkits, but they've all had a fairly major problem of forcing you to shoehorn your data into a specific format instead of letting you customize how data is formatted. The JavaScript Infovis Toolkit is the one that both makes me happy in the results and annoyed to get it to work. Finally, I settled on D3, which is still too low-level to make me happy (don't ask me how painful getting animations working was).

The end result is here (the equivalent lcov result is also available). Every file is a rectangle in that represents a single file of source code. Hover over it, and you get details of coverage in that file. Click on it, and you can zoom in one level down in the source code hierarchy. If you ctrl-click, you zoom one level back. The size of the rectangles is indicative of how large the file is (you can configure it to be by number of lines or number of functions). Their color is indicative of what percentage of lines (or functions) is covered: a red color means very few are executed in tests, while a green color means most are. There is a scale that indicates what a given percentage of code looks like, and there is even internal code to affect the scale (so as to make the midpoint muddy red be not 50% but more like 70%), but this is disabled since the scale is more difficult to change. Changing parameters causes the treemap to animate to its new position, but the sheer number of elements here appears to give Gecko heartburn—it's much smoother if you zoom in to smaller directories! I do realize that this also results in a lot of boxes flying around each other, but a squarified treemap is unfortunately a very unstable algorithm.

The technology that underlies the visualization isn't complicated. I used lcov to produce a summary file, and then used a python script to convert the file into a summarized JSON version. Tooltips are provided using tipsy, which I don't particularly like all that much, but I haven't found anything that is as easy to use while also servicing my needs (most importantly, placing tooltips so they don't fall off the edge of the screen). As an aside, a visualization toolkit that doesn't have any tooltip support is extremely bad—details-on-demand is a critical component of information visualization.

There's some things I don't have: I don't scrape branch coverage yet, for example. And having a query parameter that lets you zoom into a specific directory immediately would be helpful, as would being able to cut off the tree before the leaves. I would also like to see a nice view of a single file, so I can completely kick the lcov output. Being able to select individual files would also be a boon. Another neat feature would be to select from multiple sources (e.g., different test suite coverages or even monitoring as code coverage changes through the years!).

As a postscript, let me give you a tour of top-level directories in the view. The tiny sliver in the uppper-left is, from top-to-bottom, the C++ source code of Thunderbird-specific code, lightning, and LDAP. Then there is a sizable block for mork, and the rest of the left column is mailnews code. Then there is a barely-noticeable column and another small but noticeable column which contain between them a lot of small top-level directories in mozilla. The next column (similar in width to the comm-central column) contains, from top-to-bottom, IPC, spellcheck, widget, security, accessibility, parser, and xpcom. The next column contains editor, toolkit, network, and DOM code. The right 40% of the diagram or so lacks definable columns. The upper-left of this portion (you'll notice it by the large red block with a lot of "S" letters peeking out) is the graphics code. To its right lies JS. Beneath both of these lies the code for layout, and beneath that and at the bottom is content.

And, no, my code is not on github yet because most of everything that matters is in that one HTML file. I also plan to integrate this view into DXR at some point in time.

Monday, May 28, 2012

A tour of libmime

Libmime (the C++ version, that is) exposes three services, not including the tight coupling it shares with compose code. These three services are a service that extracts headers (which many people probably didn't knew existed), a service that parses structured headers (in two place), and everything else. The first two are fairly straightforward to understand, so it's the complexity of the last one which is worth talking about here.

The first thing the astute programmer will notice is the apparent lack of a way to drive the MIME converter. This is because the converter also implements nsIStreamConverter, kind of. The MIME converter uses the fact that it derives from nsIStreamListener to receive data, and it uses the Init method to set itself up (while effectively ignoring half its parameters). There is logic to extract the URL to load from the context parameter, and this is where the fun and games begin. If no output type is manually selected, it is sniffed from the URL as follows:

No URL results in quoting
Finding part= (unless the desired type is application/vnd.mozilla.xul+xml) causes the output type to be raw, unless a type= parameter is present, which induces body display and uses that type for the stream's content-type.
Finding emitter=js causes a special case that is used to select gloda's JS MIME emitter
Finding header= does a lookup for the special mime output type; the values possible are filter, quotebody, print, only (uses header display), none (uses body display), quote, saveas, src, and raw, most of whose mappings to the appropriate value in nsMimeOutput are fairly obvious.
The default, if none of the above match, is to display the body.

With the URL and output format settled, the next step is choosing the emitter. The gloda hack obviously results in the gloda JS MIME emitter, and editor templates and drafts don't use an emitter, but, otherwise, either the raw emitter (attach, decrypt, raw, and source), XML emitter (header display), orHTML emitter (everybody else) is used.

After the emitter is initialized and constructed, a bridge stream is set up to actually pump the data into libmime. The output is handled differently in the case of things that get sent to compose and things that get driven by display. The former merely passes through some more options and then decomposes attachments into temporary files, while the latter is more interesting and complicated. The URL is parsed—again—and the following parts are noticed. If headers= is present and results in one of only, none, all, some, micro, cite, or citation, then opts->headers is set to signify the choice. The presence of part= sets opts->part_to_load, while rot13 enables rot13_p (did you know TB had that feature?). Some more hacks are enabled if emitter=js is present, and then the part number is checked to see if it might be an attempt to use the old (Netscape 2) MIME part numbering scheme. After all of this done, we set up charset overrides for the message if they need to be set up and then we can actually start pumping data.

The guts of libmime that actually do the parsing is in the polymorphic object system within libmime. The text initially gets fed to an instance of MimeMessage, but once the headers of one these containers gets parsed, the body is fed through another object. Going from a content-type to an object takes several lines of code, but a brief summary of the decisions should suffice. If no Content-Type header is present, the code chooses a default (ultimately, for children of MimeMessage, the default is untyped text, which sniffs for binhex, yEnc, and uuencode). Some parts can have their content-type sniffed by a found filename instead. The extension point that allows extensions to add their own content-type handlers gets run before anyone else, but, as far as I can tell, only enigmail leverages this. There is another subtle extension hook which allows people to convert any type to HTML (Lightning uses this). Some types are simple and always go to the same class: message/rfc822 and message/news both result in a MimeMessage, while message/external-body creates a MimeExternalBody. Other types get dispatched based mostly on preferences: text/html and multipart/* are the biggest cases here. Everything else (note that text/* defaults to text/plain and multipart/* to multipart/mixed) becomes a MimeExternalObject, and a paranoia preference can also turn half of the classes into a MimeExternalObject.

After creating classes, the next phase is the actual parsing logic. And this is where I'm stopping the tour for now. I've got more precise information laid out on virtual sticky notes on my laptop, but I don't think it's worth posting what amounts to several large switch statements in a brief tour.

Sunday, May 6, 2012

A new JS MIME parser

Here is one of those guessing games whose answers will depress you: how many (partial) MIME parsers do we have in Thunderbird? If you guessed zero, ha-ha, very funny; now go sit in the corner. If you guessed one, you've obviously never worked with libmime before. The actual answer depends on how liberal you want to be. On the conservative end, there are no fewer than 5 parsers which go at least as far as parsing multipart messages (including two in a single file). If you choose to go liberally, counting everybody who attempts to split header values to look for a specific value, you gain an additional six that I am aware of… and I have no doubt that more are lurking behind them. I suppose this means that we now have more implementations of message header parsing than we do base64-decoding (although nowadays, it seems like most base64-decoding got stuffed into one method with several wrappers). The complete list as I know it:

nsMsgBodyHandler
nsMsgDBFolder::GetMsgTextFromStream
libmime
IMAP fakeserver (one for body parts, one for body structure, and another spot happily hunts down headers)
NNTP fakeserver (hunts down headers)
nsParseMailbox
nsNNTPProtocol (hunts down headers)
nsMsgLocalStoreUtils (hunts down headers)
Somewhere in compose code. Probably, although it's not clear how much is being funneled back to libmime and how much is not.
A class in necko implements in the RFC 2231 and RFC 2047 decoding.

Well, it's time for to increment that count by one. And then decrement it by at least three (two of the IMAP and the NNTP fakeserver decoders are switched over in a queued patch, and I hope to get the third IMAP fakeserver switched over). I've been working on a new JS MIME parser to Thunderbird for a bit at a time over the past several months, and I have local patches that start using it in tests (and as a replacement for nsIMimeHeaders).

So, why replace libmime instead of consolidating everyone onto it? The short answer is because libmime is a festering pile of crap. It uses an internal object system that can be summarized as "reimplementing C++ in C" and has existed largely in the same form since at least Netscape 5 (the Mozilla classic version gives a date of May 15, 1996). All of that would be manageable were it not for the fact that the architecture is clearly broken. Reliably parsing multipart MIME messages is not a trivial task (indeed, not only does one of the above do it incorrectly, but there is actually a test which may rely on it being wrong. Guess which one it is.), and the fact that so many people have variants on it is a clear indication that the API fails to expose what it ought to expose. This means that the code is in need of change, and the implementation is a form which makes changing things extremely difficult.

The choice to do it in JS was motivated mostly by Andrew Sutherland. There has been a long-term ideal in Thunderbird dating back to at least around 2008 to move more implementation of core code into JS, which would help avoid massive churn spurred on by mozilla-central; nowadays, there is the added benefit that it would aid in efforts like B2G or porting to platforms where native code is frowned upon. MIME code, being extremely self-contained (charset conversion and S/MIME encryption make up the biggest dependencies in the core parser). As of my current implementation, the only external dependency that the MIME parser has is atob, although charset conversion (via the proposed string encoding API) will be added when I get there. In other words, this code is usable by anyone who wants to write a mail client in JS, not just the internal mailnews code.

Another advantage to writing my own library is that it gives me a lot of time to lay out specifications in clearer terms. One called out in the spec are on how to handle seeing boundaries for the non-leaf-most part. My own brain went further and started musing on non-leaf parts getting QP or base64 content-transfer-encodings (which RFC 6532 went ahead and allowed anyways), or multiple nested parts having the same boundary (which the specification, in my view, hints at resolving in a particular fashion). Other developments include the fact that most MIME parsers do not permit as much CFWS as the specification indicates could be present ("Content-Type: multipart / mixed" would be unusable in every MIME parser source I read)I also produced a list of all the specifications that the parser will need to refer to that I have found so far (13 RFCs and a handful of non-RFCs, plus 9 obsoleted RFCs that may still warrant viewing). As for size? The JS implementation is about 600-700 lines right now (including over 300 lines of comments), while the equivalent C++ portions take over a thousand lines to do more or less the same thing.

As I said earlier, one of the problems with libmime is its oppressive architecture. It is primarily designed to drive the message pane while living as a streaming converter. However, it encompasses roughly three steps: the production of the raw MIME tree (or a tree that combine the raw MIME data with pseudo-MIME alternatives), conversion of that tree into a more traditional body-and-attachments view, and then the final driving of the UI. Getting it to stop any earlier is between difficult and impossible; what I have now, once it gets some more testing, can satisfy people who just need the first step. Doing the second step would require me to sit down and document how libmime makes its choices, which is not a task I look forward to doing.