Sunday, May 6, 2012

A new JS MIME parser

Here is one of those guessing games whose answers will depress you: how many (partial) MIME parsers do we have in Thunderbird? If you guessed zero, ha-ha, very funny; now go sit in the corner. If you guessed one, you've obviously never worked with libmime before. The actual answer depends on how liberal you want to be. On the conservative end, there are no fewer than 5 parsers which go at least as far as parsing multipart messages (including two in a single file). If you choose to go liberally, counting everybody who attempts to split header values to look for a specific value, you gain an additional six that I am aware of… and I have no doubt that more are lurking behind them. I suppose this means that we now have more implementations of message header parsing than we do base64-decoding (although nowadays, it seems like most base64-decoding got stuffed into one method with several wrappers). The complete list as I know it:

  • nsMsgBodyHandler
  • nsMsgDBFolder::GetMsgTextFromStream
  • libmime
  • IMAP fakeserver (one for body parts, one for body structure, and another spot happily hunts down headers)
  • NNTP fakeserver (hunts down headers)
  • nsParseMailbox
  • nsNNTPProtocol (hunts down headers)
  • nsMsgLocalStoreUtils (hunts down headers)
  • Somewhere in compose code. Probably, although it's not clear how much is being funneled back to libmime and how much is not.
  • A class in necko implements in the RFC 2231 and RFC 2047 decoding.

Well, it's time for to increment that count by one. And then decrement it by at least three (two of the IMAP and the NNTP fakeserver decoders are switched over in a queued patch, and I hope to get the third IMAP fakeserver switched over). I've been working on a new JS MIME parser to Thunderbird for a bit at a time over the past several months, and I have local patches that start using it in tests (and as a replacement for nsIMimeHeaders).

So, why replace libmime instead of consolidating everyone onto it? The short answer is because libmime is a festering pile of crap. It uses an internal object system that can be summarized as "reimplementing C++ in C" and has existed largely in the same form since at least Netscape 5 (the Mozilla classic version gives a date of May 15, 1996). All of that would be manageable were it not for the fact that the architecture is clearly broken. Reliably parsing multipart MIME messages is not a trivial task (indeed, not only does one of the above do it incorrectly, but there is actually a test which may rely on it being wrong. Guess which one it is.), and the fact that so many people have variants on it is a clear indication that the API fails to expose what it ought to expose. This means that the code is in need of change, and the implementation is a form which makes changing things extremely difficult.

The choice to do it in JS was motivated mostly by Andrew Sutherland. There has been a long-term ideal in Thunderbird dating back to at least around 2008 to move more implementation of core code into JS, which would help avoid massive churn spurred on by mozilla-central; nowadays, there is the added benefit that it would aid in efforts like B2G or porting to platforms where native code is frowned upon. MIME code, being extremely self-contained (charset conversion and S/MIME encryption make up the biggest dependencies in the core parser). As of my current implementation, the only external dependency that the MIME parser has is atob, although charset conversion (via the proposed string encoding API) will be added when I get there. In other words, this code is usable by anyone who wants to write a mail client in JS, not just the internal mailnews code.

Another advantage to writing my own library is that it gives me a lot of time to lay out specifications in clearer terms. One called out in the spec are on how to handle seeing boundaries for the non-leaf-most part. My own brain went further and started musing on non-leaf parts getting QP or base64 content-transfer-encodings (which RFC 6532 went ahead and allowed anyways), or multiple nested parts having the same boundary (which the specification, in my view, hints at resolving in a particular fashion). Other developments include the fact that most MIME parsers do not permit as much CFWS as the specification indicates could be present ("Content-Type: multipart / mixed" would be unusable in every MIME parser source I read)I also produced a list of all the specifications that the parser will need to refer to that I have found so far (13 RFCs and a handful of non-RFCs, plus 9 obsoleted RFCs that may still warrant viewing). As for size? The JS implementation is about 600-700 lines right now (including over 300 lines of comments), while the equivalent C++ portions take over a thousand lines to do more or less the same thing.

As I said earlier, one of the problems with libmime is its oppressive architecture. It is primarily designed to drive the message pane while living as a streaming converter. However, it encompasses roughly three steps: the production of the raw MIME tree (or a tree that combine the raw MIME data with pseudo-MIME alternatives), conversion of that tree into a more traditional body-and-attachments view, and then the final driving of the UI. Getting it to stop any earlier is between difficult and impossible; what I have now, once it gets some more testing, can satisfy people who just need the first step. Doing the second step would require me to sit down and document how libmime makes its choices, which is not a task I look forward to doing.

7 comments:

Anonymous said...

Is it written with JISON?

http://zaach.github.com/jison/

Anonymous said...

Will it help in any way to implement in Thunderbird native M$ Transport Neutral Encapsulation Format (messages with winmail.dat files)?

Anonymous said...

No link to the code?

Anonymous said...

Several years ago I filed this:
https://bugzilla.mozilla.org/show_bug.cgi?id=248846

tried to work on it, but didn't get enough traction (and was not experienced enough and without enough time to do it alone).

Eyal

Joshua Cranmer said...

1. MIME is not suitable for parsing with standard yacc-esque parsers.

2. Being able to decode TNEF is explicitly mentioned in one of the files as a long-term goal.

3. bug 746052 has my code, but it's a few versions old and doesn't have any of my now-use-it patches.

4. I've played with using libmime as a testcase for automated rewriting in the past. If I go any further and start trying to replicate later logic, I may very well locally do a rewrite using clang or something just so I am sane enough to continue reading.

fbender said...

Can we have it on Github, please? ;) Maybe you can join forces with andris9[1]?

[1] https://github.com/andris9/mailparser

andris said...

I'm the author of Node.JS module MailParser, mentioned in the previous comment, and I'd love to help if needed.

I think that by now MailParser is a pretty solid library for parsing e-mails. And it's mainly pure JavaScript except for the iconv library that is used for charset conversion. The module can be tested live here: http://node.ee/MailParser/Demo

One feature missing I'd like to add some day is DKIM validation but this shouldn't be very hard to do as I've made DKIM signing in another Node.JS e-mail related module MailComposer and siging is pretty much the same as verifying.