Wednesday, December 4, 2013

Why email is hard, part 4: Email addresses

This post is part 4 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. This post discusses the problems with email addresses.

You might be surprised that I find email addresses difficult enough to warrant a post discussing only this single topic. However, this is a surprisingly complex topic, and one which is made much harder by the presence of a very large number of people purporting to know the answer who then proceed to do the wrong thing [0]. To understand why email addresses are complicated, and why people do the wrong thing, I pose the following challenge: write a regular expression that matches all valid email addresses and only valid email addresses. Go ahead, stop reading, and play with it for a few minutes, and then you can compare your answer with the correct answer.




Done yet? So, if you came up with a regular expression, you got the wrong answer. But that's because it's a trick question: I never defined what I meant by a valid email address. Still, if you're hoping for partial credit, you may able to get some by correctly matching one of the purported definitions I give below.

The most obvious definition meant by "valid email address" is text that matches the addr-spec production of RFC 822. No regular expression can match this definition, though—and I am aware of the enormous regular expression that is often purported to solve this problem. This is because comments can be nested, which means you would need to solve the "balanced parentheses" language, which is easily provable to be non-regular [2].

Matching the addr-spec production, though, is the wrong thing to do: the production dictates the possible syntax forms an address may have, when you arguably want a more semantic interpretation. As a case in point, the two email addresses example@test.invalid and example @ test . invalid are both meant to refer to the same thing. When you ignore the actual full grammar of an email address and instead read the prose, particularly of RFC 5322 instead of RFC 822, you'll realize that matching comments and whitespace are entirely the wrong thing to do in the email address.

Here, though, we run into another problem. Email addresses are split into local-parts and the domain, the text before and after the @ character; the format of the local-part is basically either a quoted string (to escape otherwise illegal characters in a local-part), or an unquoted "dot-atom" production. The quoting is meant to be semantically invisible: "example"@test.invalid is the same email address as example@test.invalid. Normally, I would say that the use of quoted strings is an artifact of the encoding form, but given the strong appetite for aggressively "correct" email validators that attempt to blindly match the specification, it seems to me that it is better to keep the local-parts quoted if they need to be quoted. The dot-atom production matches a sequence of atoms (spans of text excluding several special characters like [ or .) separated by . characters, with no intervening spaces or comments allowed anywhere.

RFC 5322 only specifies how to unfold the syntax into a semantic value, and it does not explain how to semantically interpret the values of an email address. For that, we must turn to SMTP's definition in RFC 5321, whose semantic definition clearly imparts requirements on the format of an email address not found in RFC 5322. On domains, RFC 5321 explains that the domain is either a standard domain name [3], or it is a domain literal which is either an IPv4 or an IPv6 address. Examples of the latter two forms are test@[] and test@[IPv6:::1]. But when it comes to the local-parts, RFC 5321 decides to just give up and admit no interpretation except at the final host, advising only that servers should avoid local-parts that need to be quoted. In the context of email specification, this kind of recommendation is effectively a requirement to not use such email addresses, and (by implication) most client code can avoid supporting these email addresses [4].

The prospect of internationalized domain names and email addresses throws a massive wrench into the state affairs, however. I've talked at length in part 2 about the problems here; the lack of a definitive decision on Unicode normalization means that the future here is extremely uncertain, although RFC 6530 does implicitly advise that servers should accept that some (but not all) clients are going to do NFC or NFKC normalization on email addresses.

At this point, it should be clear that asking for a regular expression to validate email addresses is really asking the wrong question. I did it at the beginning of this post because that is how the question tends to be phrased. The real question that people should be asking is "what characters are valid in an email address?" (and more specifically, the left-hand side of the email address, since the right-hand side is obviously a domain name). The answer is simple: among the ASCII printable characters (Unicode is more difficult), all the characters but those in the following string: " \"\\<>[]();,@". Indeed, viewing an email address like this is exactly how HTML 5 specifies it in its definition of a format for <input type="email">

Another, much easier, more obvious, and simpler way to validate an email address relies on zero regular expressions and zero references to specifications. Just send an email to the purported address and ask the user to click on a unique link to complete registration. After all, the most common reason to request an email address is to be able to send messages to that email address, so if mail cannot be sent to it, the email address should be considered invalid, even if it is syntactically valid.

Unfortunately, people persist in trying to write buggy email validators. Some are too simple and ignore valid characters (or valid top-level domain names!). Others are too focused on trying to match the RFC addr-spec syntax that, while they will happily accept most or all addr-spec forms, they also result in email addresses which are very likely to weak havoc if you pass to another system to send email; cause various forms of SQL injection, XSS injection, or even shell injection attacks; and which are likely to confuse tools as to what the email address actually is. This can be ameliorated with complicated normalization functions for email addresses, but none of the email validators I've looked at actually do this (which, again, goes to show that they're missing the point).

Which brings me to a second quiz question: are email addresses case-insensitive? If you answered no, well, you're wrong. If you answered yes, you're also wrong. The local-part, as RFC 5321 emphasizes, is not to be interpreted by anyone but the final destination MTA server. A consequence is that it does not specify if they are case-sensitive or case-insensitive, which means that general code should not assume that it is case-insensitive. Domains, of course, are case-insensitive, unless you're talking about internationalized domain names [5]. In practice, though, RFC 5321 admits that servers should make the names case-insensitive. For everyone else who uses email addresses, the effective result of this admission is that email addresses should be stored in their original case but matched case-insensitively (effectively, code should be case-preserving).

Hopefully this gives you a sense of why email addresses are frustrating and much more complicated then they first appear. There are historical artifacts of email addresses I've decided not to address (the roles of ! and % in addresses), but since they only matter to some SMTP implementations, I'll discuss them when I pick up SMTP in a later part (if I ever discuss them). I've avoided discussing some major issues with the specification here, because they are much better handled as part of the issues with email headers in general.

Oh, and if you were expecting regular expression answers to the challenge I gave at the beginning of the post, here are the answers I threw together for my various definitions of "valid email address." I didn't test or even try to compile any of these regular expressions (as you should have gathered, regular expressions are not what you should be using), so caveat emptor.

RFC 822 addr-spec
Impossible. Don't even try.
RFC 5322 non-obsolete addr-spec production
RFC 5322, unquoted email address
HTML 5's interpretation
Effective EAI-aware version
[^\x00-\x20\x80-\x9f]()<>\[\]:;@\\,]+@[^\x00-\x20\x80-\x9f()<>\[\]:;@\\,]+, with the caveats that a dot does not begin or end the local-part, nor do two dots appear subsequent, the local part is in NFC or NFKC form, and the domain is a valid domain name.

[1] If you're trying to find guides on valid email addresses, a useful way to eliminate incorrect answers are the following litmus tests. First, if the guide mentions an RFC, but does not mention RFC 5321 (or RFC 2821, in a pinch), you can generally ignore it. If the email address test (not) @ would be valid, then the author has clearly not carefully read and understood the specifications. If the guide mentions RFC 5321, RFC 5322, RFC 6530, and IDN, then the author clearly has taken the time to actually understand the subject matter and their opinion can be trusted.
[2] I'm using "regular" here in the sense of theoretical regular languages. Perl-compatible regular expressions can match non-regular languages (because of backreferences), but even backreferences can't solve the problem here. It appears that newer versions support a construct which can match balanced parentheses, but I'm going to discount that because by the time you're going to start using that feature, you have at least two problems.
[3] Specifically, if you want to get really technical, the domain name is going to be routed via MX records in DNS.
[4] RFC 5321 is the specification for SMTP, and, therefore, it is only truly binding for things that talk SMTP; likewise, RFC 5322 is only binding on people who speak email headers. When I say that systems can pretend that email addresses with domain literals or quoted local-parts don't exist, I'm excluding mail clients and mail servers. If you're writing a website and you need an email address, there is no need to support email addresses which don't exist on the open, public Internet.
[5] My usual approach to seeing internationalization at this point (if you haven't gathered from the lengthy second post of this series) is to assume that the specifications assume magic where case insensitivity is desired.


Bill said...

How do I run these regex in PHP?

Jeff Stedfast said...

Bill: I get the impression you didn't read the blog post and simply skipped to the end :-)

I've found the only way to parse email addresses correctly is to use a proper tokenizer, and even that proves difficult especially when trying to parse From/To/Cc/etc headers where you realistically have to include backtracking logic if you get to a point where the tokens you've read make absolutely no sense and then try and interpret portions of it as a badly formatted phrase by working backwards, etc, etc.

You can tell a lot about the quality of a MIME parser implementation based on how the parser deals with parsing address headers.

It's pretty brutal.

Haridas Gowra said...

Gr8 blog..........

Andreas said...


From the extensive coverage on the parsing of e-mail addresses as specified in the various RFCs, I gather that you are well versed on the subject.

Do you have a clear opinion on the use of quoted pairs in e-mail addresses? I restrict the scope of the question to that of message field headers (rfc 5322).

From reading the spec in rfc 5322, one is lead to the conclusion that a quoted pair can only appear in the following two cases:

a. in a local-part inside a DQUOTE block and
b. in the domain.

The rule has not changed since the previous IMF rfc (2822) and most likely since the original one (822).

There are also no errata that revise the relative text in both documents.

Reading rfc 3696 (an informational document) a different picture surfaces. Here, the author (which is also the author of rfcs 5321/2) states, and I quote: "The exact rule is that any ASCII character, including control characters, may appear quoted, or in a quoted string."

The examples following the above quoted text (in the original document) do not place the local-part inside a DQUOTE block. Taking a look at the 3696 errata, the issue becomes even more complicated.
Here (Errata ID 246), the author himself replaces the relative examples, with ones where the local-part is always enclosed in a DQUOTE block. One could interpret this change as a sign that the author is trying to conform to the spec as defined in rfc 5322. The relative text that preceeds the examples remains unchanged.
The confusion reaches new highs, when one reads further down the 3696 errata list. Errata ID 3563 revokes the change in the examples that were made by errata entry with an id of 246 and aligns itself with the definition set forth in the text (may appear quoted, or in a quoted string). As this submitted errata has been verified, the logical conclusion is that rfc 3696 has always meant to allow for the use of a quoted-pair outside a DQUOTE block.
As rfc 3696 is meant to deal with issues close to the client-side (MUA), it should not have any consequences server-side (e.g. MSA). It does unfortunately create an issue and in doing so condracticts even itself, when it states that "It only identifies the correct tests to be made if tests are to be applied."

In answering the original question myself, the most cautious approach would be to expect to find quoted-pairs in local-parts outside a DQUOTE block, but allow generating an e-mail address with quoted-pairs in local-parts only inside a DQUOTE block. This is a clear application of the robustness principal.

Your thoughts?

Joshua Cranmer said...

The RFCs for email are rather bad at dealing with how to understand the sobering reality of email, in stark contrast to the level of detail that, say, the HTML specification goes into with respect to errors.

Here's my recommendations:
1. If at all possible, reject email addresses that require quoted localparts (i.e., they contain some character from the SPECIALS set). They're unlikely to actually be valid email addresses in the sense that a user-visible mailbox is actually associated with them.
2. If you can't reject technically-valid-but-unlikely email addresses, emit them as cleanly as possible: quote the localpart only if necessary.
3. Quoted pairs are technically illegal outside of comment text and quoted strings. I've not done studies on how clients interpret such occurrences, but my gut instinct and inspection of the libraries I've looked at seem to indicate that this tends to be treated the same as inside a quoted-string.