Wednesday, December 4, 2013

Why email is hard, part 4: Email addresses

This post is part 4 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. This post discusses the problems with email addresses.

You might be surprised that I find email addresses difficult enough to warrant a post discussing only this single topic. However, this is a surprisingly complex topic, and one which is made much harder by the presence of a very large number of people purporting to know the answer who then proceed to do the wrong thing [0]. To understand why email addresses are complicated, and why people do the wrong thing, I pose the following challenge: write a regular expression that matches all valid email addresses and only valid email addresses. Go ahead, stop reading, and play with it for a few minutes, and then you can compare your answer with the correct answer.

 

 

 

Done yet? So, if you came up with a regular expression, you got the wrong answer. But that's because it's a trick question: I never defined what I meant by a valid email address. Still, if you're hoping for partial credit, you may able to get some by correctly matching one of the purported definitions I give below.

The most obvious definition meant by "valid email address" is text that matches the addr-spec production of RFC 822. No regular expression can match this definition, though—and I am aware of the enormous regular expression that is often purported to solve this problem. This is because comments can be nested, which means you would need to solve the "balanced parentheses" language, which is easily provable to be non-regular [2].

Matching the addr-spec production, though, is the wrong thing to do: the production dictates the possible syntax forms an address may have, when you arguably want a more semantic interpretation. As a case in point, the two email addresses example@test.invalid and example @ test . invalid are both meant to refer to the same thing. When you ignore the actual full grammar of an email address and instead read the prose, particularly of RFC 5322 instead of RFC 822, you'll realize that matching comments and whitespace are entirely the wrong thing to do in the email address.

Here, though, we run into another problem. Email addresses are split into local-parts and the domain, the text before and after the @ character; the format of the local-part is basically either a quoted string (to escape otherwise illegal characters in a local-part), or an unquoted "dot-atom" production. The quoting is meant to be semantically invisible: "example"@test.invalid is the same email address as example@test.invalid. Normally, I would say that the use of quoted strings is an artifact of the encoding form, but given the strong appetite for aggressively "correct" email validators that attempt to blindly match the specification, it seems to me that it is better to keep the local-parts quoted if they need to be quoted. The dot-atom production matches a sequence of atoms (spans of text excluding several special characters like [ or .) separated by . characters, with no intervening spaces or comments allowed anywhere.

RFC 5322 only specifies how to unfold the syntax into a semantic value, and it does not explain how to semantically interpret the values of an email address. For that, we must turn to SMTP's definition in RFC 5321, whose semantic definition clearly imparts requirements on the format of an email address not found in RFC 5322. On domains, RFC 5321 explains that the domain is either a standard domain name [3], or it is a domain literal which is either an IPv4 or an IPv6 address. Examples of the latter two forms are test@[127.0.0.1] and test@[IPv6:::1]. But when it comes to the local-parts, RFC 5321 decides to just give up and admit no interpretation except at the final host, advising only that servers should avoid local-parts that need to be quoted. In the context of email specification, this kind of recommendation is effectively a requirement to not use such email addresses, and (by implication) most client code can avoid supporting these email addresses [4].

The prospect of internationalized domain names and email addresses throws a massive wrench into the state affairs, however. I've talked at length in part 2 about the problems here; the lack of a definitive decision on Unicode normalization means that the future here is extremely uncertain, although RFC 6530 does implicitly advise that servers should accept that some (but not all) clients are going to do NFC or NFKC normalization on email addresses.

At this point, it should be clear that asking for a regular expression to validate email addresses is really asking the wrong question. I did it at the beginning of this post because that is how the question tends to be phrased. The real question that people should be asking is "what characters are valid in an email address?" (and more specifically, the left-hand side of the email address, since the right-hand side is obviously a domain name). The answer is simple: among the ASCII printable characters (Unicode is more difficult), all the characters but those in the following string: " \"\\<>[]();,@". Indeed, viewing an email address like this is exactly how HTML 5 specifies it in its definition of a format for <input type="email">

Another, much easier, more obvious, and simpler way to validate an email address relies on zero regular expressions and zero references to specifications. Just send an email to the purported address and ask the user to click on a unique link to complete registration. After all, the most common reason to request an email address is to be able to send messages to that email address, so if mail cannot be sent to it, the email address should be considered invalid, even if it is syntactically valid.

Unfortunately, people persist in trying to write buggy email validators. Some are too simple and ignore valid characters (or valid top-level domain names!). Others are too focused on trying to match the RFC addr-spec syntax that, while they will happily accept most or all addr-spec forms, they also result in email addresses which are very likely to weak havoc if you pass to another system to send email; cause various forms of SQL injection, XSS injection, or even shell injection attacks; and which are likely to confuse tools as to what the email address actually is. This can be ameliorated with complicated normalization functions for email addresses, but none of the email validators I've looked at actually do this (which, again, goes to show that they're missing the point).

Which brings me to a second quiz question: are email addresses case-insensitive? If you answered no, well, you're wrong. If you answered yes, you're also wrong. The local-part, as RFC 5321 emphasizes, is not to be interpreted by anyone but the final destination MTA server. A consequence is that it does not specify if they are case-sensitive or case-insensitive, which means that general code should not assume that it is case-insensitive. Domains, of course, are case-insensitive, unless you're talking about internationalized domain names [5]. In practice, though, RFC 5321 admits that servers should make the names case-insensitive. For everyone else who uses email addresses, the effective result of this admission is that email addresses should be stored in their original case but matched case-insensitively (effectively, code should be case-preserving).

Hopefully this gives you a sense of why email addresses are frustrating and much more complicated then they first appear. There are historical artifacts of email addresses I've decided not to address (the roles of ! and % in addresses), but since they only matter to some SMTP implementations, I'll discuss them when I pick up SMTP in a later part (if I ever discuss them). I've avoided discussing some major issues with the specification here, because they are much better handled as part of the issues with email headers in general.

Oh, and if you were expecting regular expression answers to the challenge I gave at the beginning of the post, here are the answers I threw together for my various definitions of "valid email address." I didn't test or even try to compile any of these regular expressions (as you should have gathered, regular expressions are not what you should be using), so caveat emptor.

RFC 822 addr-spec
Impossible. Don't even try.
RFC 5322 non-obsolete addr-spec production
([^\x00-\x20()<>\[\]:;@\\,.]+(\.[^\x00-\x20()<>\[\]:;@\\,.]+)*|"(\\.|[^\\"])*")@([^\x00-\x20()<>\[\]:;@\\,.]+(.[^\x00-\x20()<>\[\]:;@\\,.]+)*|\[(\\.|[^\\\]])*\])
RFC 5322, unquoted email address
.*@([^\x00-\x20()<>\[\]:;@\\,.]+(\.[^\x00-\x20()<>\[\]:;@\\,.]+)*|\[(\\.|[^\\\]])*\])
HTML 5's interpretation
[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*
Effective EAI-aware version
[^\x00-\x20\x80-\x9f]()<>\[\]:;@\\,]+@[^\x00-\x20\x80-\x9f()<>\[\]:;@\\,]+, with the caveats that a dot does not begin or end the local-part, nor do two dots appear subsequent, the local part is in NFC or NFKC form, and the domain is a valid domain name.

[1] If you're trying to find guides on valid email addresses, a useful way to eliminate incorrect answers are the following litmus tests. First, if the guide mentions an RFC, but does not mention RFC 5321 (or RFC 2821, in a pinch), you can generally ignore it. If the email address test (not) @ example.com would be valid, then the author has clearly not carefully read and understood the specifications. If the guide mentions RFC 5321, RFC 5322, RFC 6530, and IDN, then the author clearly has taken the time to actually understand the subject matter and their opinion can be trusted.
[2] I'm using "regular" here in the sense of theoretical regular languages. Perl-compatible regular expressions can match non-regular languages (because of backreferences), but even backreferences can't solve the problem here. It appears that newer versions support a construct which can match balanced parentheses, but I'm going to discount that because by the time you're going to start using that feature, you have at least two problems.
[3] Specifically, if you want to get really technical, the domain name is going to be routed via MX records in DNS.
[4] RFC 5321 is the specification for SMTP, and, therefore, it is only truly binding for things that talk SMTP; likewise, RFC 5322 is only binding on people who speak email headers. When I say that systems can pretend that email addresses with domain literals or quoted local-parts don't exist, I'm excluding mail clients and mail servers. If you're writing a website and you need an email address, there is no need to support email addresses which don't exist on the open, public Internet.
[5] My usual approach to seeing internationalization at this point (if you haven't gathered from the lengthy second post of this series) is to assume that the specifications assume magic where case insensitivity is desired.

48 comments:

Unknown said...

Bill: I get the impression you didn't read the blog post and simply skipped to the end :-)

I've found the only way to parse email addresses correctly is to use a proper tokenizer, and even that proves difficult especially when trying to parse From/To/Cc/etc headers where you realistically have to include backtracking logic if you get to a point where the tokens you've read make absolutely no sense and then try and interpret portions of it as a badly formatted phrase by working backwards, etc, etc.

You can tell a lot about the quality of a MIME parser implementation based on how the parser deals with parsing address headers.

It's pretty brutal.

Unknown said...

Gr8 blog..........

Andreas said...

Hi.

From the extensive coverage on the parsing of e-mail addresses as specified in the various RFCs, I gather that you are well versed on the subject.

Do you have a clear opinion on the use of quoted pairs in e-mail addresses? I restrict the scope of the question to that of message field headers (rfc 5322).

From reading the spec in rfc 5322, one is lead to the conclusion that a quoted pair can only appear in the following two cases:

a. in a local-part inside a DQUOTE block and
b. in the domain.

The rule has not changed since the previous IMF rfc (2822) and most likely since the original one (822).

There are also no errata that revise the relative text in both documents.

Reading rfc 3696 (an informational document) a different picture surfaces. Here, the author (which is also the author of rfcs 5321/2) states, and I quote: "The exact rule is that any ASCII character, including control characters, may appear quoted, or in a quoted string."

The examples following the above quoted text (in the original document) do not place the local-part inside a DQUOTE block. Taking a look at the 3696 errata, the issue becomes even more complicated.
Here (Errata ID 246), the author himself replaces the relative examples, with ones where the local-part is always enclosed in a DQUOTE block. One could interpret this change as a sign that the author is trying to conform to the spec as defined in rfc 5322. The relative text that preceeds the examples remains unchanged.
The confusion reaches new highs, when one reads further down the 3696 errata list. Errata ID 3563 revokes the change in the examples that were made by errata entry with an id of 246 and aligns itself with the definition set forth in the text (may appear quoted, or in a quoted string). As this submitted errata has been verified, the logical conclusion is that rfc 3696 has always meant to allow for the use of a quoted-pair outside a DQUOTE block.
As rfc 3696 is meant to deal with issues close to the client-side (MUA), it should not have any consequences server-side (e.g. MSA). It does unfortunately create an issue and in doing so condracticts even itself, when it states that "It only identifies the correct tests to be made if tests are to be applied."

In answering the original question myself, the most cautious approach would be to expect to find quoted-pairs in local-parts outside a DQUOTE block, but allow generating an e-mail address with quoted-pairs in local-parts only inside a DQUOTE block. This is a clear application of the robustness principal.

Your thoughts?

Joshua Cranmer said...

Andreas:
The RFCs for email are rather bad at dealing with how to understand the sobering reality of email, in stark contrast to the level of detail that, say, the HTML specification goes into with respect to errors.

Here's my recommendations:
1. If at all possible, reject email addresses that require quoted localparts (i.e., they contain some character from the SPECIALS set). They're unlikely to actually be valid email addresses in the sense that a user-visible mailbox is actually associated with them.
2. If you can't reject technically-valid-but-unlikely email addresses, emit them as cleanly as possible: quote the localpart only if necessary.
3. Quoted pairs are technically illegal outside of comment text and quoted strings. I've not done studies on how clients interpret such occurrences, but my gut instinct and inspection of the libraries I've looked at seem to indicate that this tends to be treated the same as inside a quoted-string.

klotylda said...

Interesting insight, thanks for sharing this with us! Also, you might want to have a look at this: https://correct.email/ - I fell in love with this thing the moment I tried it!
Best regards!

Jim Rhodes said...

It's not hard at all. Check this https://college-homework-help.org/blog/research-essay page and I will teach you everything I know about email.

John Smith said...

Hey, I’m John. I’m a web developer living in 145 Kelley Blvd, Millbrook AL 36054. I am a fan of technology, writing, and web development. You can read my blog with a click on the button Below.
webroot.com/safe
Norton.com/safe

Jack Davis said...

Quickbooks is the advanced accounting software to track and records business data. Sometimes users found some common issues while accessing the tool. To fix all the common errors like login credentials, Installation issues, Printing issues, PDF-related issues, performance issues, and network connectivity issues, you can use the Quickbooks Tool hub which is the hub of all essential tools to diagnose the errors.
Quickbooks Tools Hub

hulu285748 said...


www.office.com/setup

PcWorld247 said...

HOUSTON TAXI - YELLOW CAB - TAXI HOUSTON
HIRE US FOR BEST LIMO & YELLOW CAB TAXI SERVICES IN HOUSTON.

Houston Taxi

Taxi Houston

Quickbooks error said...

There is one of the Quickbooks Errors is QB error 1603 which encounters when you are installing or updating Quickbooks or when the Windows Installer component is damaged. When the QuickBooks Error 1603 occurs then an error message comes up which states: "The update installer encounters an internal error." This happens when windows installer components are missing.

Quickbooks Error 1603

Emma Jackson said...

Assignment help from GoAssignmentHelp is surely a thing you don't want to miss out on. AssignmentHelp offers you the best math homework help in Australia. GoAssignment Help has a team of highly experienced writers; they have already set a benchmark with their work word problem solver all around the world. Our team consists of many Ph.D. assignment experts who will help you with expert
academic writing help assistance in every way possible.

Anonymous said...

When you do connect the Canon printer into your pc, your system does not need to install the driver on it. https //ij.start.canon , http //ij.start.canon .

singapore assignment help said...

Thanks for the blog loaded with so many information. Stopping by your blog helped me to get what I was looking for. help me with my assignment

Michael Jones said...

Just like your professor your my assignment help expert can also teach you valuable things on a topic. Yes, you will have to ask for the guidance because no expert will know that you need guidance until you will ask for it. Unlike a classroom set up (offline or online) where in you have direct access to a professor, while taking homework helper online you cannot approach an expert without prior information. So, be a little mindful and keep your queries ready before approaching the expert. You can always leverage a lot of benefits when you pay someone to do assignment.

ireland assignment help said...

This site is very complete, there are various kinds of information on this site making it easier for us to find information. ireland assignment help reviews

Henry Jones said...

This is great and the best thing which I found here is the people who are contributing with the assignment help and should be focused equally with the things that is required to go on assignment help services in Australia.

hulu285748 said...


https://sites.google.com/site/office0com0setup/

Sarah Winget said...

A discussion chapter contrasts your outcomes and other research regarding the matter to work out how we have realized and what it affects what's to come. It discusses whether old research has been affirmed or negated, any new hypotheses or clarifications that may have arisen, and what the outcomes may mean for strategy and practice. economics assignment help

Unknown said...

Thanks that you write this post. I'm ready to get more info. I know that here I also can find a capstone project
This paper is my new task. I want to get more and more information and details about this kind of writing.

Alex Kim said...

Nice article! The information you have shared is very engaging and impressive. Translate document service

SAVIOLA said...

I was very impressed by this post, this site has always been pleasant news. Thank you very much for such an interesting post. Keep working, great job! To know about UNIDEL post utme past question online

Ina said...

We help them by giving arrangements in most effortless way and before the cutoff time with the goal that students don't confront any sort of issue in regards to subject related. Students can likewise participate in conversation during the critical thinking or they can address why the specific arrangement is given. We work for the students and help them to work in an agreeable zone and give fundamental help at whatever point required. visit - my assignment help

hani said...

Great work. Do you want help with case study assignment help? sourceessay.com will be ideal place to explore numerous blog on different subjects. Online Assignment help France

Anonymous said...

Great work Online Assignment Help Perth.

Anonymous said...

Good Work Online Assignment help melbourne

Mia Oscar said...


Nice post.Writing articles and essay seriouly very boring work to do. I do not like practicles things For example making assignment or submit on time. Evewn I have to make assignment on environmnetal subject and i ma search best environmental essay topics on the google hope i get soon.

singapore assignment help said...

Woow i have found very informative article its really unique content provide i really enjoyed these stuff keep it up. write essays for students

ireland assignment help said...

Excellent Blog! I would like to thank you for the efforts you have made in writing this post.writing my assignment

lishasingh said...

Therapist with her hands body massage near me or fingers , will be able to exaggerate your body, particularly the shank and shoulder as the strain and fatigue is absorbed into these potions in our bodies. She will move your muscles and bones in order to bring you a sense of relaxation.

singapore assignment help said...

Excellent blog post, keep more post for sharing. Thanks a lot for sharing this posT. do my assignment

Carmen Devlin said...

If you wonder how to write deductive essay, consider that you may also buy deductive essay and your free time you may devote to something else. We can do our best to help you with your essay.

Ryan Cooper said...

Quickbook file doctor will be useful for users in resolving some errors like network issues, 6000 series errors. QuickBooks file doctor is a free tool offered by Intuit QuickBooks. So, in this article,
we’ll go over how to use Quickbook File Doctor. Read More

diploma assignmenthelp uk said...

I would like to thank you for the efforts you have made in writing this article. mba assignment help

assignment helper said...

Interesting insight, thanks for sharing this with us! Also, you might want to have a look at this: coding homework help

rennasweety said...

However, it must be noted that massage therapists have different views of Tantric massage, so make sure you chat with our massage therapist sexy massage (masseuse) with details of what you expect during sessions.

Nuru massage in Chennai said...

Find the best girls offering good oily massage and spas with jaccuzz with best standards in the world.

mbuotidem said...

Thank you very much for providing this with us. delsu jupeb admission form

Massage spa near me said...


Thank you so much for sharing the informative post, I appreciate your work.

happylife.es said...

Car insurance in Alicante is an essential consideration for all drivers in the region. As one of the major cities in Spain, Alicante experiences a significant amount of traffic, making it crucial to have adequate insurance coverage to protect both oneself and others on the road. When it comes to car insurance options in Alicante, there are several reputable companies offering a wide range of coverage plans. More information about Car Insurance in Alicante on happylife.es web and find out more information about lifestyle in Alicante.

b2b massage spa in koramangala said...

Hot stone massage is a type of therapeutic massage that’s similar to b2b massage spa in koramangala, only the therapist uses heated stones in lieu of or in addition to their hands, so come and enjoy our service

shanjanaarora said...

It is recommended to use oil to get more out body to body massage centres in chennai of the massage. Stay here with us to get to know the best oils for massage.

izspa.net said...

During a percussion massage, the therapist will use a handheld device that delivers rapid, percussive strokes to the muscles. This can b2b massage near me help to increase blood flow, reduce inflammation, and promote relaxation

Arthur Wilson said...

Writing an assignment often becomes overwhelming for the students doing higher education. It requires in-depth conceptual clarity, and practical implementation of the theoretical concepts. Sometimes, students also need to do time-consuming research to complete their assignments.
In order to assist the students across the world, Global Assignment Help provides the best quality assignments on-time. The professional writers of the organization are highly qualified, and well-known for delivering plagiarism-free contents including correct and precise information. The affordable pricing, and the round-the-clock customer services makes them one of the best assignment help service providers in the world. Many students have gained benefit by choosing the firm, and scored high in their academics.

nude massage in bangalore said...

Pain occurs when receptors in the body send a message to the brain telling it part of the body has been damaged.

nuru massage in hyderabad said...

It can also be deeply relaxing for the man, and relaxing the anus, a part of the body that often holds tension, can be beneficial for the whole body.

female to male body massage centres pune said...

Exercise and stretch: Use a therapy cane or a hard therapy ball to massage out or stretch your neck and shoulder muscles.

happy ending massage in bangalore said...

In addition to a complete body treatment, treat your body to an improved therapies choose from our selection of pure essential oils to help awaken your mind and rebalance your system