Wednesday, December 4, 2013

Why email is hard, part 4: Email addresses

This post is part 4 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. This post discusses the problems with email addresses.

You might be surprised that I find email addresses difficult enough to warrant a post discussing only this single topic. However, this is a surprisingly complex topic, and one which is made much harder by the presence of a very large number of people purporting to know the answer who then proceed to do the wrong thing [0]. To understand why email addresses are complicated, and why people do the wrong thing, I pose the following challenge: write a regular expression that matches all valid email addresses and only valid email addresses. Go ahead, stop reading, and play with it for a few minutes, and then you can compare your answer with the correct answer.




Done yet? So, if you came up with a regular expression, you got the wrong answer. But that's because it's a trick question: I never defined what I meant by a valid email address. Still, if you're hoping for partial credit, you may able to get some by correctly matching one of the purported definitions I give below.

The most obvious definition meant by "valid email address" is text that matches the addr-spec production of RFC 822. No regular expression can match this definition, though—and I am aware of the enormous regular expression that is often purported to solve this problem. This is because comments can be nested, which means you would need to solve the "balanced parentheses" language, which is easily provable to be non-regular [2].

Matching the addr-spec production, though, is the wrong thing to do: the production dictates the possible syntax forms an address may have, when you arguably want a more semantic interpretation. As a case in point, the two email addresses example@test.invalid and example @ test . invalid are both meant to refer to the same thing. When you ignore the actual full grammar of an email address and instead read the prose, particularly of RFC 5322 instead of RFC 822, you'll realize that matching comments and whitespace are entirely the wrong thing to do in the email address.

Here, though, we run into another problem. Email addresses are split into local-parts and the domain, the text before and after the @ character; the format of the local-part is basically either a quoted string (to escape otherwise illegal characters in a local-part), or an unquoted "dot-atom" production. The quoting is meant to be semantically invisible: "example"@test.invalid is the same email address as example@test.invalid. Normally, I would say that the use of quoted strings is an artifact of the encoding form, but given the strong appetite for aggressively "correct" email validators that attempt to blindly match the specification, it seems to me that it is better to keep the local-parts quoted if they need to be quoted. The dot-atom production matches a sequence of atoms (spans of text excluding several special characters like [ or .) separated by . characters, with no intervening spaces or comments allowed anywhere.

RFC 5322 only specifies how to unfold the syntax into a semantic value, and it does not explain how to semantically interpret the values of an email address. For that, we must turn to SMTP's definition in RFC 5321, whose semantic definition clearly imparts requirements on the format of an email address not found in RFC 5322. On domains, RFC 5321 explains that the domain is either a standard domain name [3], or it is a domain literal which is either an IPv4 or an IPv6 address. Examples of the latter two forms are test@[] and test@[IPv6:::1]. But when it comes to the local-parts, RFC 5321 decides to just give up and admit no interpretation except at the final host, advising only that servers should avoid local-parts that need to be quoted. In the context of email specification, this kind of recommendation is effectively a requirement to not use such email addresses, and (by implication) most client code can avoid supporting these email addresses [4].

The prospect of internationalized domain names and email addresses throws a massive wrench into the state affairs, however. I've talked at length in part 2 about the problems here; the lack of a definitive decision on Unicode normalization means that the future here is extremely uncertain, although RFC 6530 does implicitly advise that servers should accept that some (but not all) clients are going to do NFC or NFKC normalization on email addresses.

At this point, it should be clear that asking for a regular expression to validate email addresses is really asking the wrong question. I did it at the beginning of this post because that is how the question tends to be phrased. The real question that people should be asking is "what characters are valid in an email address?" (and more specifically, the left-hand side of the email address, since the right-hand side is obviously a domain name). The answer is simple: among the ASCII printable characters (Unicode is more difficult), all the characters but those in the following string: " \"\\<>[]();,@". Indeed, viewing an email address like this is exactly how HTML 5 specifies it in its definition of a format for <input type="email">

Another, much easier, more obvious, and simpler way to validate an email address relies on zero regular expressions and zero references to specifications. Just send an email to the purported address and ask the user to click on a unique link to complete registration. After all, the most common reason to request an email address is to be able to send messages to that email address, so if mail cannot be sent to it, the email address should be considered invalid, even if it is syntactically valid.

Unfortunately, people persist in trying to write buggy email validators. Some are too simple and ignore valid characters (or valid top-level domain names!). Others are too focused on trying to match the RFC addr-spec syntax that, while they will happily accept most or all addr-spec forms, they also result in email addresses which are very likely to weak havoc if you pass to another system to send email; cause various forms of SQL injection, XSS injection, or even shell injection attacks; and which are likely to confuse tools as to what the email address actually is. This can be ameliorated with complicated normalization functions for email addresses, but none of the email validators I've looked at actually do this (which, again, goes to show that they're missing the point).

Which brings me to a second quiz question: are email addresses case-insensitive? If you answered no, well, you're wrong. If you answered yes, you're also wrong. The local-part, as RFC 5321 emphasizes, is not to be interpreted by anyone but the final destination MTA server. A consequence is that it does not specify if they are case-sensitive or case-insensitive, which means that general code should not assume that it is case-insensitive. Domains, of course, are case-insensitive, unless you're talking about internationalized domain names [5]. In practice, though, RFC 5321 admits that servers should make the names case-insensitive. For everyone else who uses email addresses, the effective result of this admission is that email addresses should be stored in their original case but matched case-insensitively (effectively, code should be case-preserving).

Hopefully this gives you a sense of why email addresses are frustrating and much more complicated then they first appear. There are historical artifacts of email addresses I've decided not to address (the roles of ! and % in addresses), but since they only matter to some SMTP implementations, I'll discuss them when I pick up SMTP in a later part (if I ever discuss them). I've avoided discussing some major issues with the specification here, because they are much better handled as part of the issues with email headers in general.

Oh, and if you were expecting regular expression answers to the challenge I gave at the beginning of the post, here are the answers I threw together for my various definitions of "valid email address." I didn't test or even try to compile any of these regular expressions (as you should have gathered, regular expressions are not what you should be using), so caveat emptor.

RFC 822 addr-spec
Impossible. Don't even try.
RFC 5322 non-obsolete addr-spec production
RFC 5322, unquoted email address
HTML 5's interpretation
Effective EAI-aware version
[^\x00-\x20\x80-\x9f]()<>\[\]:;@\\,]+@[^\x00-\x20\x80-\x9f()<>\[\]:;@\\,]+, with the caveats that a dot does not begin or end the local-part, nor do two dots appear subsequent, the local part is in NFC or NFKC form, and the domain is a valid domain name.

[1] If you're trying to find guides on valid email addresses, a useful way to eliminate incorrect answers are the following litmus tests. First, if the guide mentions an RFC, but does not mention RFC 5321 (or RFC 2821, in a pinch), you can generally ignore it. If the email address test (not) @ would be valid, then the author has clearly not carefully read and understood the specifications. If the guide mentions RFC 5321, RFC 5322, RFC 6530, and IDN, then the author clearly has taken the time to actually understand the subject matter and their opinion can be trusted.
[2] I'm using "regular" here in the sense of theoretical regular languages. Perl-compatible regular expressions can match non-regular languages (because of backreferences), but even backreferences can't solve the problem here. It appears that newer versions support a construct which can match balanced parentheses, but I'm going to discount that because by the time you're going to start using that feature, you have at least two problems.
[3] Specifically, if you want to get really technical, the domain name is going to be routed via MX records in DNS.
[4] RFC 5321 is the specification for SMTP, and, therefore, it is only truly binding for things that talk SMTP; likewise, RFC 5322 is only binding on people who speak email headers. When I say that systems can pretend that email addresses with domain literals or quoted local-parts don't exist, I'm excluding mail clients and mail servers. If you're writing a website and you need an email address, there is no need to support email addresses which don't exist on the open, public Internet.
[5] My usual approach to seeing internationalization at this point (if you haven't gathered from the lengthy second post of this series) is to assume that the specifications assume magic where case insensitivity is desired.


Bill said...

How do I run these regex in PHP?

Unknown said...

Bill: I get the impression you didn't read the blog post and simply skipped to the end :-)

I've found the only way to parse email addresses correctly is to use a proper tokenizer, and even that proves difficult especially when trying to parse From/To/Cc/etc headers where you realistically have to include backtracking logic if you get to a point where the tokens you've read make absolutely no sense and then try and interpret portions of it as a badly formatted phrase by working backwards, etc, etc.

You can tell a lot about the quality of a MIME parser implementation based on how the parser deals with parsing address headers.

It's pretty brutal.

Unknown said...

Gr8 blog..........

Andreas said...


From the extensive coverage on the parsing of e-mail addresses as specified in the various RFCs, I gather that you are well versed on the subject.

Do you have a clear opinion on the use of quoted pairs in e-mail addresses? I restrict the scope of the question to that of message field headers (rfc 5322).

From reading the spec in rfc 5322, one is lead to the conclusion that a quoted pair can only appear in the following two cases:

a. in a local-part inside a DQUOTE block and
b. in the domain.

The rule has not changed since the previous IMF rfc (2822) and most likely since the original one (822).

There are also no errata that revise the relative text in both documents.

Reading rfc 3696 (an informational document) a different picture surfaces. Here, the author (which is also the author of rfcs 5321/2) states, and I quote: "The exact rule is that any ASCII character, including control characters, may appear quoted, or in a quoted string."

The examples following the above quoted text (in the original document) do not place the local-part inside a DQUOTE block. Taking a look at the 3696 errata, the issue becomes even more complicated.
Here (Errata ID 246), the author himself replaces the relative examples, with ones where the local-part is always enclosed in a DQUOTE block. One could interpret this change as a sign that the author is trying to conform to the spec as defined in rfc 5322. The relative text that preceeds the examples remains unchanged.
The confusion reaches new highs, when one reads further down the 3696 errata list. Errata ID 3563 revokes the change in the examples that were made by errata entry with an id of 246 and aligns itself with the definition set forth in the text (may appear quoted, or in a quoted string). As this submitted errata has been verified, the logical conclusion is that rfc 3696 has always meant to allow for the use of a quoted-pair outside a DQUOTE block.
As rfc 3696 is meant to deal with issues close to the client-side (MUA), it should not have any consequences server-side (e.g. MSA). It does unfortunately create an issue and in doing so condracticts even itself, when it states that "It only identifies the correct tests to be made if tests are to be applied."

In answering the original question myself, the most cautious approach would be to expect to find quoted-pairs in local-parts outside a DQUOTE block, but allow generating an e-mail address with quoted-pairs in local-parts only inside a DQUOTE block. This is a clear application of the robustness principal.

Your thoughts?

Joshua Cranmer said...

The RFCs for email are rather bad at dealing with how to understand the sobering reality of email, in stark contrast to the level of detail that, say, the HTML specification goes into with respect to errors.

Here's my recommendations:
1. If at all possible, reject email addresses that require quoted localparts (i.e., they contain some character from the SPECIALS set). They're unlikely to actually be valid email addresses in the sense that a user-visible mailbox is actually associated with them.
2. If you can't reject technically-valid-but-unlikely email addresses, emit them as cleanly as possible: quote the localpart only if necessary.
3. Quoted pairs are technically illegal outside of comment text and quoted strings. I've not done studies on how clients interpret such occurrences, but my gut instinct and inspection of the libraries I've looked at seem to indicate that this tends to be treated the same as inside a quoted-string.

klotylda said...

Interesting insight, thanks for sharing this with us! Also, you might want to have a look at this: - I fell in love with this thing the moment I tried it!
Best regards!

Jim Rhodes said...

It's not hard at all. Check this page and I will teach you everything I know about email.

John Smith said...

Hey, I’m John. I’m a web developer living in 145 Kelley Blvd, Millbrook AL 36054. I am a fan of technology, writing, and web development. You can read my blog with a click on the button Below.

Luz Orr said...

Howdy, I’m Luz. I’m a software engineer living in Glasgow, United Kingdom. I am a fan of writing, web development. I’m also interested in Technology. You can hire me with a click on the button below.,
Luz Orr

Jack Davis said...

Quickbooks is the advanced accounting software to track and records business data. Sometimes users found some common issues while accessing the tool. To fix all the common errors like login credentials, Installation issues, Printing issues, PDF-related issues, performance issues, and network connectivity issues, you can use the Quickbooks Tool hub which is the hub of all essential tools to diagnose the errors.
Quickbooks Tools Hub

hulu285748 said...

PcWorld247 said...

mcafee activation

PcWorld247 said...


Houston Taxi

Taxi Houston

Quickbooks error said...

There is one of the Quickbooks Errors is QB error 1603 which encounters when you are installing or updating Quickbooks or when the Windows Installer component is damaged. When the QuickBooks Error 1603 occurs then an error message comes up which states: "The update installer encounters an internal error." This happens when windows installer components are missing.

Quickbooks Error 1603

Emma Jackson said...

Assignment help from GoAssignmentHelp is surely a thing you don't want to miss out on. AssignmentHelp offers you the best math homework help in Australia. GoAssignment Help has a team of highly experienced writers; they have already set a benchmark with their work word problem solver all around the world. Our team consists of many Ph.D. assignment experts who will help you with expert
academic writing help assistance in every way possible.

Kevin Wick said...

Scholars need to connect with effective online assignment writing for do my assignment query. For that, visit website of online service providers and check their reliability before placing your order for any subject.

Anonymous said...

Download Microsoft Office 365 and install it after ensuring that your system meets basic Office 365 requirements ,You'll require unique 25-characters keycode and a Microsoft account to start Microsoft download

Anonymous said...

When you do connect the Canon printer into your pc, your system does not need to install the driver on it. https // , http // .

mark stone said...

McAfee antivirus on any device. Once you visit the retail store like Walmart and Best buy, they issue you a retail card that holds McAfee 25 digit activation code in its backside , said...

McAfee antivirus is a full threat protection based antivirus program that you can install and activate at Mcafe . You cannot deny that in the present time, antivirus is the necessity of each system due to multiple threats, malware, and viruses.

singapore assignment help said...

Thanks for the blog loaded with so many information. Stopping by your blog helped me to get what I was looking for. help me with my assignment

Michael Jones said...

Just like your professor your my assignment help expert can also teach you valuable things on a topic. Yes, you will have to ask for the guidance because no expert will know that you need guidance until you will ask for it. Unlike a classroom set up (offline or online) where in you have direct access to a professor, while taking homework helper online you cannot approach an expert without prior information. So, be a little mindful and keep your queries ready before approaching the expert. You can always leverage a lot of benefits when you pay someone to do assignment.

Emberly Joe said...

Take an Assignment Help from best team. We offer best the support at best price with Immediate chat with expert facility.

ireland assignment help said...

This site is very complete, there are various kinds of information on this site making it easier for us to find information. ireland assignment help reviews

Henry Jones said...

This is great and the best thing which I found here is the people who are contributing with the assignment help and should be focused equally with the things that is required to go on assignment help services in Australia.

hulu285748 said...

david anderson said...

Look no further for Assignment Help in Canada, as we have experienced professionals who can craft your content in no time. We deliver authentic assignments that are written from scratch by gathering relevant information from reliable sources.

Sarah Winget said...

A discussion chapter contrasts your outcomes and other research regarding the matter to work out how we have realized and what it affects what's to come. It discusses whether old research has been affirmed or negated, any new hypotheses or clarifications that may have arisen, and what the outcomes may mean for strategy and practice. economics assignment help

Fannie Davis said...

Thanks that you write this post. I'm ready to get more info. I know that here I also can find a capstone project
This paper is my new task. I want to get more and more information and details about this kind of writing.

Alex Kim said...

Nice article! The information you have shared is very engaging and impressive. Translate document service

SAVIOLA said...

I was very impressed by this post, this site has always been pleasant news. Thank you very much for such an interesting post. Keep working, great job! To know about UNIDEL post utme past question online

James Martin said...

Assignment Help assistance that is taken online is easy smooth and efficient. It does the job for all the students and all the students love to take the online assignment assistance for finishing their assignments.

Ina said...

We help them by giving arrangements in most effortless way and before the cutoff time with the goal that students don't confront any sort of issue in regards to subject related. Students can likewise participate in conversation during the critical thinking or they can address why the specific arrangement is given. We work for the students and help them to work in an agreeable zone and give fundamental help at whatever point required. visit - my assignment help

Gaurav said...

786 number meaning 786 marks every Muslim considers it very holy and a boon to Allah. This is the reason that people who follow this religion.

hani said...

Great work. Do you want help with case study assignment help? will be ideal place to explore numerous blog on different subjects. Online Assignment help France

Unknown said...

Great work Online Assignment Help Perth.

Unknown said...

Good Work Online Assignment help melbourne

sarah said...

Nice thesis manchester

Unknown said...

Good Post online assignment help Perth

Daisy Lilly said...

Now, this is what I call a great read. I loved how you presented the need for professional Assignment Help Australia services among students in today's competitive scenario. If you belong to the same lot you should definitely try out the My Assignment Help facility offered by the diligent experts of the MyAssignmentHelpsAu platform. The highly qualified panel works day in and day out dedicatedly to produce 100% original and legitimate assignment outcomes for students pursuing their education in any nook and cranny of the world.

Expert Market Research said...

The global Base Metal Mining Market size was valued at USD 324.8 billion in 2018 and is expected to register a CAGR of 4.1% from 2019 to 2025. Increasing demand for base metals from construction, electrical and electronics, and automotive sectors is projected to remain a key growth driver. Growth of residential and non-residential construction sectors in Southeast Asia and other developing countries is projected to drive the demand for aluminium and copper products. Furthermore, emergence of electric vehicles and other commercial vehicles is expected to contribute to the demand. The growth of the electronics and electrical segment is a key market driver. Metals such as copper, aluminium, zinc, and lead find wide application in the electrical and electronics industry.

Mia Oscar said...

Nice post.Writing articles and essay seriouly very boring work to do. I do not like practicles things For example making assignment or submit on time. Evewn I have to make assignment on environmnetal subject and i ma search best environmental essay topics on the google hope i get soon.

Assignment Help Pro said...

Nice Post !! Here I am sharing with you superb trending tips on writing. if you are in urgent need of assignment helpers help and seeking for the best assignment help service provider. Get A++ grades in your academics by taking advantage of assignment help services.

singapore assignment help said...

Woow i have found very informative article its really unique content provide i really enjoyed these stuff keep it up. write essays for students

ireland assignment help said...

Excellent Blog! I would like to thank you for the efforts you have made in writing this post.writing my assignment

Anonymous said...

I have been reading your article. I am finding some attractive and interesting stuff about email is hard to part 4. he shared a few points in the article. I got some trouble with your landing page so I have added a VPN. I purchased it from gammatechnology . now I can easily visit your site without any kind of trouble.

Ridhikansal said...

I am glad to see this site.
Girls at low price in Shimla
Godhra scott service |
Jabalpur shemale |
Aligarh women seeking men in |
Hot Girls in Palanpur |
Ratnagiri khed Independent Girls |
Housewifes in Amravati

Ridhikansal said...

Gujarat Hot Girls |
college Girls in Anantapur
Holalkere Hot Girls
Anand bhabhi whatapp number
Girlfriend relationship in Aligarh
Jalgaon teenage service
Fatehpur Russians
airhostess in Hubli
Chennai VIP Girls

Bella spa said...

Bellaspa Bangalore ensure that our client and clients gets the best Massage near me knead administrations in the business as are utilizing top notch innovations so our clients get the elite body rub administrations and experience .