Saturday, September 14, 2013

Why email is hard, part 1: architecture

Which is harder, writing an email client or writing a web browser? Several years ago, I would have guessed the latter. Having worked on an email client for several years, I am now more inclined to guess that email is harder, although I never really worked on a web browser, so perhaps it's just bias. Nevertheless, HTML comes with a specification that tells you how to parse crap that pretends to be HTML; email messages come with no such specification, which forces people working with email to guess based on other implementations and bug reports. To vent some of my frustration with working with email, I've decided to post some of my thoughts on what email did wrong and why it is so hard to work with. Since there is so much to talk about, instead of devoting one post to it, I'll make it an ongoing series with occasional updates (i.e., updates will come out when I feel like it, so don't bother asking).

First off, what do I mean by an email client? The capabilities of, say, Outlook versus Gaia Email versus Thunderbird are all wildly different, and history has afforded many changes in support. I'll consider anything that someone might want to put in an email client as fodder for discussion in this series (so NNTP, RSS, LDAP, CalDAV, and maybe even IM stuff might find discussions later). What I won't consider are things likely to be found in a third-party library, so SSL, HTML, low-level networking, etc., are all out of scope, although I may mention them where relevant in later posts. If one is trying to build a client from scratch, the bare minimum one needs to understand first is the basic message formatting, MIME (which governs attachments), SMTP (email delivery), and either POP or IMAP (email receipt). Unfortunately, each of these requires cross-referencing a dozen RFCs individually when you start considering optional or not-really-optional features.

The current email architecture we work with today doesn't have a unique name, although "Internet email" [1] or "SMTP-based email" are probably the most appropriate appellations. Since there is only one in use in modern times, there is no real need to refer to it by anything other than "email." The reason for the use of SMTP in lieu of any other major protocol to describe the architecture is because the heart of the system is motivated by the need to support SMTP, and because SMTP is how email is delivered across organizational boundaries, even if other protocols (such as LMTP) are used internally.

Some history of email, at least that lead up to SMTP, is in order. In the days of mainframes, mail generally only meant communicating between different users on the same machine, and so a bevy of incompatible systems started to arise. These incompatible systems grew to support connections with other computers as networking computers became possible. The ARPANET project brought with it an attempt to standardize mail transfer on ARPANET, separated into two types of documents: those that standardized message formats, and those that standardized the message transfer. These would eventually culminate in RFC 822 and RFC 821, respectively. SMTP was designed in the context of ARPANET, and it was originally intended primarily to standardize the messages transferred only on this network. As a result, it was never intended to become the standard for modern email.

The main competitor to SMTP-based email that is worth discussing is X.400. X.400 was at one time expected to be the eventual global email interconnect protocol, and interoperability between SMTP and X.400 was a major focus in the 1980s and 1990s. SMTP has a glaring flaw, to those who work with it, in that it is not so much designed as evolved to meet new needs as they came up. In contrast, X.400 was designed to account for a lot of issues that SMTP hadn't dealt with yet, and included arguably better functionality than SMTP. However, it turned out to be a colossal failure, although theories differ as to why. The most convincing to me boils down to X.400 being developed at a time of great flux in computing (the shift from mainframes to networked PCs) combined with a development process that was ill-suited to reacting quickly to these changes.

I mentioned earlier that SMTP eventually culminates in RFC 821. This is a slight lie, for one of the key pieces of the Internet, and a core of the modern email architecture, didn't exist. That is DNS, which is the closest thing the Internet has to X.500 (a global, searchable directory of everything). Without DNS, figuring out how to route mail via SMTP is a bit of a challenge (hence why SMTP allowed explicit source routing, deprecated post-DNS in RFC 2821). The documents which lay out how to use DNS to route are RFC 974, RFC 1035, and RFC 1123. So it's fair to say that RFC 1123 is really the point at which modern SMTP was developed.

But enough about history, and on to the real topic of this post. The most important artifact of SMTP-based architecture is that different protocols are used to send email from the ones used to read email. This is both a good thing and a bad thing. On the one hand, it's easier to experiment with different ways of accessing mailboxes, or only supporting limited functionality where such is desired. On the other, the need to agree on a standard format still keeps all the protocols more or less intertwined, and it makes some heavily-desired features extremely difficult to implement. For example, there is still, thirty years later, no feasible way to send a mail and save it to a "Sent" folder on your IMAP mailbox without submitting it twice [2].

The greatest flaws in the modern architecture, I think, lie in particular in a bevy of historical design mistakes which remain unmitigated to this day, in particular in the base message format and MIME. Changing these specifications is not out of the question, but the rate at which the changes become adopted is agonizingly slow, to the point that changing is generally impossible unless necessary. Sending outright binary messages was proposed as experimental in 1995, proposed as a standard in 2000, and still remains relatively unsupported: the BINARYMIME SMTP keyword only exists on one of my 4 SMTP servers. Sending non-ASCII text is potentially possible, but it is still not used in major email clients to my knowledge (searching for "8BITMIME" leads to the top results generally being "how do I turn this off?"). It will be interesting to see how email address internationalization is handled, since it's the first major overhaul to email since the introduction of MIME—the first major overhaul in 16 years. Intriguingly enough, the NNTP and Usenet communities have shown themselves to be more adept to change: sending 8-bit Usenet messages generally works, and yEnc would have been a worthwhile addition to MIME if its author had ever attempted to push it through. His decision not to (with the weak excuses he claimed) is emblematic of the resistance of the architecture to change, even in cases where such change would be pretty beneficial.

My biggest complaint with the email architecture isn't actually really a flaw in the strict sense of the term but rather a disagreement. The core motto of email could perhaps be summed up with "Be liberal in what you accept and conservative in what you send." Now, I come from a compilers background, and the basic standpoint in compilers is, if a user does something wrong, to scream at them for being a bloody idiot and to reject their code. Actually, there's a tendency to do that even if they do something technically correct but possibly unintentionally wrong. I understand why people dislike this kind of strict checking, but I personally consider it to be a feature, not a bug. My experience with attempting to implement MIME is that accepting what amounts to complete crap not only means that everyone has to worry about parsing the crap, but it actually ends up encouraging it. The attitude people get in bugs starts becoming "this is supported by <insert other client>, and your client is broken for not supporting it," even when pointed out that their message is in flagrant violation of the specification. As I understand it, HTML 5 has the luxury of specifying a massively complex parser that makes /dev/urandom in theory reliably parsed across different implementations, but there is no such similar document for the modern email message. But we still have to deal with the utter crap people claim is a valid email message. Just this past week, upon sifting through my spam folder, I found a header which is best described as =?UTF-8?Q? ISO-8859-1, non-ASCII text ?= (spaces included). The only way people are going to realize that their tools are producing this kind of crap is if their tools stop working altogether.

These two issues come together most spectacularly when RFC 2047 is involved. This is worth a blog post by itself, but the very low technically-not-but-effectively-mandatory limit on the header length (to aide people who read email without clients) means that encoded words need to be split up to fit on header lines. If you're not careful, you can end up splitting multibyte characters between different encoded words. This unfortunately occurs in practice. Properly handling it in my new parser required completely reflowing the design of the innermost parsing function and greatly increasing implementation complexity. I would estimate that this single attempt to "gracefully" handle wrong-but-of-obvious-intent scenario is worth 15% or so of the total complexity of decoding RFC 2047-encoded text.

There are other issues with modern email, of course, but all of the ones that I've collected so far are not flaws in the architecture as a whole but rather flaws of individual portions of the architecture, so I'll leave them for later posts.

[1] The capital 'I' in "Internet email" is important, as it's referring to the "Internet" in "Internet Standard" or "Internet Engineering Task Force." Thus, "Internet email" means "the email standards developed for the Internet/by the IETF" and not "email used on the internet."
[2] Yes, I know about BURL. It doesn't count. Look at who supports it: almost nobody.

5 comments:

Ludovic Hirlimann said...

I wrote my views a few days ago at http://sietch-tabr.tumblr.com/post/58044873275/a-very-nice-video-explaining-why-email-is-hard-to

Mook said...

My understanding was that HTML was also underspecified - or rather, the specs only bore passing resemblance to reality. I believe that Hixie did some reverse-engineering of the browsers at the time (IE, Firefox, Safari, Opera) to figure out what to put in the HTML5 spec, on account of the same "be liberal &c" approach that browsers had.

I'd guess from XHTML uptake (and I don't mean text/html stuff) that people don't actually like sticking to specs all that much ;)

Jeffrey Stedfast said...

Hi, you might be interested in a similar blog post of mine about the complexities of rfc2047 header decoding.

Joshua Cranmer said...

Ludovic: Mark sent me that link some time ago, and I definitely agree with the author on several points. Especially the part about "if you want to ever see an email programmer get red in the face, ask him about CFWS." I am storing up a good rant on that.

Mook: I would say that it's not about liking sticking to specs as much as it is the difficulty of sticking to consistent interpretations of specs. At a high level, it's obvious how to do RFC 2047... but if you were to read the low-level details, there are a lot of small pitfalls, which is why people mess it up.

Jeffrey: Don't worry, I am currently working on a very thorough skewering on RFC 2047 and related work. Only RFC 2231 gets off light, because it's merely complex and not completely insane.

Jeffrey Stedfast said...

I'll be keeping my eye out for it ;-)

I've got 2 implementations of a mime parser at this point, one in C (GMime) and one in C# (MimeKit) that you are more than welcome to look over to see how I've solved these rfc2047 problems.

In GMime, you can find my rfc2047 decoder in gmime-utils.c, just grep for "rfc2047_token". I've got 2 tokenizers: tokenize_rfc2047_phrase() and tokenize_rfc2047_text(). The rfc2047_decode_tokens() function is what pieces them back together again, making sure to re-use the same base64/qp and iconv() state from encoded-word to encoded-word assuming they used the same charset in order to overcome the possibility of broken mailers discovering that their encoded-words were too long and randomly splitting the word down the middle to try and make it fit without any regards for either the payload encoding nor the charset encoding.

My C# implementation is in the Rfc2047.cs source file and is structured very similarly. I'm not sure it'd be any easier to read, but it might be. I suspect you won't have any trouble with reading the C implementation.

Hopefully you find my implementation at least somewhat useful, either by validating your own current approach (can't tell from your blog post if you've already implemented this or not) or by offering a slightly different approach that you hadn't considered.

(BTW, I'd love to see your solution once you've implemented it for the same reasons.)

FWIW, rfc2231 suffers from similar problems in that mailers might wrap in the middle of a hex triplet, but otherwise you're right, it's not quite as bad. At least you can rely on all of the pieces using the same charset encoding...

Anyway, it's good to see someone else that has hit all of the same frustrations with MIME that I have because they actually implementing the MIME specifications seriously instead of just blindly writing some regexes based on a few email samples they have in their Inbox and never bothering to read the specifications (like pretty much every mime parser library I can find source code for).

Good luck!