Thursday, March 27, 2008

Mail storage

There are few things in the world that are universally agreed upon. Mail storage is not one of those: many people say that mbox is a poor format and would rather have some other form of mail storage. The suggestions I've seen include maildir (qmail-style or other), database storage, creating a false filesystem, or use IMAP and shunt the storage problem off to somebody else. Most of these have their own problems.

So why is mbox such a bad format? Supposedly, it doesn't scale. An mbox measuring a few GB's causes problems because it's a large file. There's also the tricky problem of deleting: a one-byte change is cheap, and so is appending, but midfile deletion or insertion is expensive.

In contrast, people sing the praises of maildir: by using one file per message, deletion is cheap. But there are hidden costs. Stating a directory to find new messages or deleted messages is relatively expensive. Also, modern filesystems attach metadata to each file. A 1KB metadata is not noticeable in a 1GB file, but 1KB metadata for each of 50,000 files is 50MB, which can be noticeable.

Using databases for mail storage? Yes, people have suggested it (bug 361087), and one even has the gall to request it as blocking-thunderbird3 (Point of order: I would probably reject maildir as blocking and even pluggable storage APIs I would only go so far as to say wanted). The basic reason cited for doing so is that "databases... are very stable and robust." Note however that mboxes are older, more stable, and more robust in theory and probably in practice too. And scalability? Exact same problems with mbox, only slightly exacerbated (probably going to have more indexes).

The second-to-last option (false filesystem) has problems of its own. From the comments I read, it would appear to force mozilla to carry along another lib*** implementation that I suspect is ill-tested. I also suspect that no one has tried (at least very hard) to port it to Windows. I also suspect this holds the same scalability flaws (the argument for this is "individual mail storage is [not] the job of the MUA anymore," to be fair).

So where are we? The primary argument against mbox is that it scales poorly. Yet all of the other suggested replacements suffer the same problems, manifested in different ways. Echoing Churchill's comment on democracy, mbox is the worst mail storage format except for all the others. It actually has a lot going for it: it's simple and universal, more than the others can claim.

If you really want to fix scalability, there are two options. First, don't keep GB of mail. I may accumulate 100 MB of mail in a year (half of it spam, actually), but I clean my mail out at least yearly to prune conversations that are outdated. Option 2: keep your folders small. Mailing list archives starts a new archive each month by default, which tends to keep the mailing list from getting large.

9 comments:

Anonymous said...

There is also the problem of incremental backups. Mbox is notorious greedy in this field. And of course there is Spotlight on Mac OS X, for which a workaround was implemented by creating an individual content file for each message. All in all it is my opinion that maildir is the way to go.

Taras said...

Joshua,
Maildir makes sense in that a filesystem is usually designed to do hierarchical things. So stating files is expensive, then make a cache. On a real filesystem..aka on unix, stating 50000 isn't that bad ;)
XFS and other filesystems do directory readahead, and sys calls are dirt cheap.

Of course filesystems are dirt slow on Windows, so that sucks.

I'd actually vote for some sort of a storage abstraction layer. Is IMAP already done this way?

jwalden said...

"a filesystem is usually designed to do hierarchical things"

...yet mail is increasingly non-hierarchical, with tagging systems gradually overtaking folders and explicit hierarchies. Maildir locks you into an increasingly anachronistic view of your email corpus as a set of nested folders containing messages, and I'm not yet convinced bolting tags onto that system is the best way to do it.

Justin said...

A way to alleviate some of the DB-Is-Good thoughts for me, would be a more widely-understood query syntax (for modern developers) for our code that touches mbox at least.

An SQL/SQLite syntax would be great!

Even if the mbox has to generate a cached/temp .sqlite file would be enough for me (though probably suboptimal)

Anonymous said...

I don't think there is a problem with mbox.
Regarding the flexibility with large GB+ directories, Thunderbird should be able to do easy backups with complete directory hierarchy, including moving the old mail to the backup directory, so that the current directories don't grow.
I do this manually every year and I don't have any problems with performance in a 4GB Inbox.
The downside is, that the global search must be performed over more directories (current tree and year backups).

HoĆ  said...

-----------
The suggestions I've seen include maildir (qmail-style or other), database storage, creating a false filesystem, or use IMAP and shunt the storage problem off to somebody else. Most of these have their own problems.
-------------
using IMAP is not a solution since the local caching needs to be stored somewhere.

-------------
If you really want to fix scalability, there are two options. First, don't keep GB of mail. I may accumulate 100 MB of mail in a year (half of it spam, actually), but I clean my mail out at least yearly to prune conversations that are outdated.
Option 2: keep your folders small. Mailing list archives starts a new archive each month by default, which tends to keep the mailing list from getting large.
----------------

I think thunderbird application should not tell the users what to do but fit the users needs.

about the database solution, at least, it fixes the problem of message deletion in mbox and brings no additional problem (as far as I know).

David Fraser said...


If you really want to fix scalability, there are two options. First, don't keep GB of mail. I may accumulate 100 MB of mail in a year (half of it spam, actually), but I clean my mail out at least yearly to prune conversations that are outdated. Option 2: keep your folders small. Mailing list archives starts a new archive each month by default, which tends to keep the mailing list from getting large.


That's not solving scalability, it's just avoiding it. I don't know which conversations may be useful to me in a few years time - I like having all that data available all the time rather than being removed.

A lot of this just sounds like supporting mbox because it's what's currently done...

DigDug said...

I'm not that concerned about exactly how my mail is stored on my computer (although I want an easy way to back up/import important things, which TB really seems to suck at). What always confuses me about these discussions is that I really want something thats easy to search. iTunes like searching. So I can run a search against all my accounts for anything in a folder called Inbox, or anything from my boss with the word "laser" in it. I'm sure that's doable with all the storage formats out there, but it certainly doesn't seem easy for them, as I haven't seen TB pick it up in the last 5 or so years. Is there a format (or combination of formats) that's more ameniable to that? That's what I'd push for.

James Napolitano said...

David Ascher has mentioned the CouchDB project, which seems to involve a novel approach to mail storage:

(quote)
Unlike SQL databases which are designed to store and report on highly structured, interrelated data, CouchDB is designed to store and report on large amounts of semi-structured, document oriented data. CouchDB greatly simplifies the development of document oriented applications, which make up the bulk of collaborative web applications.

In an SQL database, as needs evolve the schema and storage of the existing data must be updated. This often causes problems as new needs arise that simply weren’t anticipated in the initial database designs, and makes distributed “upgrades” a problem for every host that needs to go through a schema update.

With CouchDB, no schema is enforced, so new document types with new meaning can be safely added alongside the old. The view engine, using Javascript, is designed to easily handle new document types and disparate but similar documents.
(end quote)

On a separate note, it would be nice if TB used a mail storage format that didn't require separate .msf files, i.e. it stored all the data related to a given email in the same place.