Wednesday, February 18, 2009

A database proposal

A sore point in the mailnews code is the message database code. Most of the backend ends up being a single amorphous blob with fuzzy boundaries amassing huge couplings between a server's implementation and the database. Add into account the fact that the database documentation (like most of mailnews, but worse) is often either poorly documented or sometimes just plain wrong, and you get a recipe for disaster. There's also the issue, probably the most important one, that the database has grown past its original intent.

Originally, the message was merely a cache of the information for the display. Since it was only a cache, it doesn't matter that much if it is blown away and reparsed from the original source. Well, there's a little matter of the ability to set an arbitrary property that isn't reflected in the mbox source. This capability, among other features, has made the message database a ticking time bomb. And, in essence, the bomb recently exploded when I attempted to make it usable from JavaScript.

So, in the mid-to-long-term, the database needs serious fixing, not the incremental band-aids applied all over it. It needs a real design to fit its modern and future purposes. Naturally, the first question is what does a database need to do. Following are salient points:

The database is really multiple, distinct components.
One part of the database is a relational database: metadata for a message that is not reflected in the message itself. If an extension wants to keep information on certain message properties (like how junk-y it is), it would stick the information in this relational database. The second part of the database is a combination of the message store and cache. This part is what the database used to be: a store of information easily recoverable from the message store. Note that this part of the database needs to be at least partially coupled with the message store, more on this later.
The relational database is separate from the cache database.
The cache database exposes a unique, persistent identifier for messages.
While the cache database can, and probably will, be regenerated often, the relational database is permanent. Indeed, the cache database blowing itself away should not cause the relational database to have to do anything. At present, the cache uses ephemeral IDs as unique identifers: IMAP UIDs (destroyed if UIDVALIDITY is sent), mbox file offsets (destroyed if the mbox changes), or NNTP article keys (can of worms there [1]). In my proposal, the cache would map these IDs to more persistent ones. Yes, it makes reconstructing the database more difficult, but it makes everyone else's lives easier.
The cache database may be rebuilt if the store is newer.
The cache database rebuild should be incremental.
The relational database should not be ever automatically rebuilt.
One of the main problems as it stands is the rebuild of the cache database. It has been, in general, assumed that rebuilding the database would never lose information, but the database has become the only store of some information. I am not certain of technical feasibility, but there is in general no need to reparse a 2GB mbox file if you compact out a few messages. Even in an IMAP UIDVALIDITY event, I expect that not all of the UIDs would be changed. Incrementalism would make the database more usable during painful rebuilds, but, naturally, it would require more careful coding.
The cache database's rebuild policy is caller-configurable.
What I mean about this is that the cache database will be accessible via one of three calls: an immediate call that will get the database, even if invalid; a call which will get the database but spawn an asynchronous rebuild event [2]; and a call that will block until the database finishes rebuilding, if necessary. The implications of having asynchronous rebuild would require the database to be thread-safe, but I expect that the future of the database already includes asynchronous calls. At the very least, it might help in some cases where we've run into thread safety issues in the past (such as import).
The cache database has access to the message store.
There are three types of store: local-only, local caching remote, and remote-only.
The folder can only access the store through the database.
These points are probably the ones I'm least comfortable with, but I think it's necessary. In the long-term, pluggable message stores and the store-specific mechanisms of database means that the cache database needs to have intimate access with the store. Having explicit interfaces for the message store should allow us to avoid having to subclass nsIMsgDatabase for the different storage types. Limiting access via the folder should help cut down the bloat on nsIMsgFolder. On the other hand, it would probably make the code do a lot more round-tripping, which could lead to more leaks.
The cache database is per-account, not per-folder.
A cleverly-designed per-account store could alleviate some problems. It would make working with cross-posted messages easier, and could, in principle, use less disk if you move messages between folders on the same local stores or caches. Copied messages could point to the same messages (in the spirit of filesystem hard links), so long as we don't permit local message modification.

If I haven't missed anything, that is how I see a future database system. Obviously, implementation would not be easy; I expect it would take at least a year or even two years of concentrated work to produce something close to these ideals. There are incremental steps to this future, but they seem to me to be towering steps at many cases (for example, introducing the database to the store, or making it usable from different threads). In any case, I'm interested in hearing feedback on this proposal.

[1]In recent months, some of Giganews' binary newsgroups had begun to press distressingly close to the signed-32 bit limit, which raised the question of what to do. One proposal would have been to reset the ids or maybe wrap around. A news client should be able to handle this case if practical to do so, IMO.

[2]I expect that this method would use the invalid database, although it could be implemented by having the various method calls block until validity. Since it's possible that a caller could use the blocking-get-database call as well, this approach makes significantly less sense to me.


jmdesp said...

Very interesting.

One thing that you didn't seem to cover is off-line storage. As the database is currently also used for off-line storage of IMAP and NNTP, when it's invalidated, it's *painful* for those who make use of that functionnality. Maybe less and less people use it, but I recently activated NNTP off-line mode for some newsgroups, because it allows to make full-text searches on all the messages that are available locally.

Also last time I checked, a good part of the mailnews code would similarly better have been sent to the garbage can and rewritten again. By this I mean mostly anything related to the handling of MIME and MIME handler, with multiple layer of genericity with no use except making the code so complex and redundant, specific method to access and handle every MIME header, instead a generic method (with optimisations for the most common but they don't need to be visible externally).

Joshua Cranmer said...

I mentioned the offline cache briefly, when I referred to a local store caching the remote. I consider it more of a store issue, but presumably it should rather be using the persistent UIDs rather than the ephemeral ones as well.

Archaeopteryx said...

Just two hints:
1. Please create an easy extensible database format (glazou wanted to extend Places with webslices which seems to be pretty difficult). Feeds, task, notes could be such data types (Postbox already has tasks in the mailbox lists).
2. Symlinks could cause privacy problems, i. e. you get something from your coworker/boss with which you don't agree and share the message in a public folder, but copy it also to private folder (of same account) and then create an annotation...

Kent James said...

It would be pretty trivial to add a persistent identifier in the existing message database prior to TB3. I think we should do it. I've chatted with asuth about, he's also on board.

Work on the existing database got less interesting once gloda entered the scene, as nobody (or at least not I) knew how much of existing nsIMsgDatabase would be affected by it. I've now developed some pretty strong opinions about how they should interact, and it goes something like this. An underlying message store with very simple capabilities provides attribute/property pairs. Those could be from existing mork/nsIMsgDatabase, or something like my generic nsIGdb interface. It is important that cloud-based stores be supported there as well. Then gloda is used to tie them all together, and generate the actual views that are used by the UI. So the local client code could be relatively free to use complex queries through gloda and SQLite, but the system cold still support a wide variety of cloud datastores for innovative uses.

I don't share your views of how badly the existing store works. One thing that makes upgrades difficult is that the existing store actually works pretty well, so any "improvements" can easily cause performance regressions. The issues that I fight are generally related to the level above the basic datastore anyway.

So I'd rather bet my effort on attempts to expand the reach of the database beyond the message store, rather than try to redo the existing setup without adding any significant new capability.

viagra said...

Can you give me a tutorial about how to manage a cache memory properly ? I have lots of problems with this stuff in my personal computer.