Wednesday, December 9, 2009

Google Wave in Thunderbird 3

While it's not in the format I would ideally want, I recently got Google Wave inside Thunderbird 3. How, you might ask. Simple: the new content tabs feature.

So, to do it, go into the Error Console, and type this line in: Components.classes['@mozilla.org/appshell/window-mediator;1'].getService(Components.interfaces.nsIWindowMediator).getMostRecentWindow("mail:3pane").document.getElementById("tabmail").openTab("contentTab", {contentPage: "https://wave.google.com/wave/?nouacheck"});. Note that Google Wave for some idiotic reason decides that Thunderbird isn't a valid UA to be using, so you have to convince it to disable the UA with the ?nouacheck. I thought browser sniffing died out years ago...

For bonus points, if you restart Thunderbird, the tab will stay open, so all you need to do is login again!

Sunday, November 29, 2009

Thunderbird extensibility

With Thunderbird 3.0rc1 now released, it is time, in my opinion, to start looking towards the goals that Thunderbird 3.next and future versions should have. One category of goals would be the extensibility of Thunderbird.

When I say "extensibility," I specifically refer to replacing backend components or adding new components to certain type categories. I do recognize that things like the power of the address book API is important for extensions, but I don't consider that "extensibility" (perhaps "API usability" is a better name?). I am also going to focus exclusively on mailnews components here.

This blog post is a set of my opinions on the state of extensibility in Thunderbird and SeaMonkey. With one exception, I am not recommending what should or shouldn't be worked on: I am just giving my biased beliefs on these topics. I also want to mention that I focus primarily on the backend, so I have a very distorted view of what can be done in extensions and how easily it can be done.

The first metric I consider is possibility, the ability of one to adequately and easily make an extension in the given field in Thunderbird 3. Since it is theoretically possible to replace nearly any component given enough effort, a key component is the ease with which one can do it. High numbers correspond to requiring a little bit of JS and a bit of UI to go along with it; low numbers generally indicate that thousands of lines of C++ reimplementing large swathes of platform code is needed.

The second metric I consider is difficulty, the amount of effort it would take to make this facet of extensibility rather close to perfect. For this metric, having a lower number is better: a very low number indicates that it would only take a few hours to produce a patch (if necessary) to fix it, while a very high number indicates that several people would have to work for a few months to produce a patch, not including the review cycles.

The final metric I consider is desirability, how useful it would be to improve this facet. Again, higher numbers are better.

A final note about numbers: they're just qualitative indicators of relative value. Don't read too much into actual values, as I more or less came up with these numbers and then tweaked them up or down as I wrote more of this blog. A low value for possibility in particular doesn't mean it's very difficult to do right now, it just means it's not as easy as some other things.

Facets of extensibility:

UI:

Possibility: 10
Difficulty: 1
Desirability: 10

Here, when I talk about UI, I specifically mean the ability to add, remove, replace, or rearrange elements of the UI, including toolbars, menus, and keyboard shortcuts, among others. To my knowledge, there aren't many problems with being able to modify the UI modulo the usability of XPCOM, Mozilla's toolkit, or other components; any problems that come up could be worked out mostly by adding an id attribute or perhaps reshuffling a portion of the dialog. Then again, as I mostly stick to the backend, I could be dead wrong about this.

Folder pane views:

Possibility: 9
Difficulty: 2
Desirability: 5

Right now my 18 accounts (19 if you include the "Smart Folders" account) are arrayed in a specific order with trees of various folders hanging off of them. Perhaps I don't want to see them like that. After all, there is rather little correlation between an account and the contents of folders (in my case at least). None of the other layouts are particularly useful for my needs as well. Why not make a layout of folders that suits my needs better? For that matter, why limit it to looking at folders at all?

In a supreme effort, Joey Minta replaced the folder pane implementation with an extensible JS version (with regressions tracked down and fixed by myriads of others). This is perhaps the only major part where SeaMonkey and Thunderbird diverge greatly: SeaMonkey has yet to port the patch, for good reason (the original produced some 33 regressions, so it's a rather risky and complex bug to port). Asides from that, there's probably not much that would have to be done to get easy-to-implement extensions working here.

Alternative database views:

Possibility: 4
Difficulty: 7
Desirability: 3

This concept loosely ties into the creation of new account types, but I think many of the more exotic account types could use other ways of selecting messages than a tree of threads. Even email lists could sometimes use better visualization; here is a graphical thread view discussed by Andrew Sutherland that struck my eyes a year ago.

If you don't want to set your sights so high, making essentially a new filtered view is a matter of mere C++ extension combined with a touch of UI; replacing the <tree> XUL tag would probably be a lot more involved. Making the latter part easier would involve removing or shifting around a fair amount of logic around.

Message display:

Possibility: 5
Difficulty: 8
Desirability: 4

Not all messages are best viewed as static rich text. For example, feed summaries in many cases don't include the full article, so it's sometimes better to think of it as the URI that the summary is associated with. I also imagine that there are some wacky message visualizations for some domain-specific messages that would be nice to play with. Though I say that, I'm at a loss to think of any of them as being particularly useful right now.

It's not terribly hard to change message display right now, as one can replace the preview pane or full window with another URI on message load—this is what RSS does right now. At the same time, though, a lot of information in libmime is rather locked away, and more primitive forms of extension (replacing consideration of parts in specific circumstances) would require changing mostly-hardcoded libmime code. Which would probably require a C++ infection first.

Protocol-level extensions:

Possibility: 1
Difficulty: 5
Desirability: 2

There are 38 official IMAP extensions; the IMAP client code implements far fewer. In addition, there is also the possibility that servers may implement their own custom XAWESOMEIMAPCOMMAND which might be ideal to implement in an extension. Of all of the facets I describe here, this is the most difficult to do in an extension right now: you'd have to rewrite the protocol objects themselves. Making this possible to do is also no cakewalk, as this is inherently connection-level, but the server/service interfaces tend to abstract away connection issues. As for desirability, I mostly consider this a "neat concept," but probably not worth spending time to sit down and actually do. There is one example which recently came up, though: the IMAP ANNOTATE command is something that seems perfect to implement in an extension.

A subcategory of this which is probably more useful is the ability to merely add in new authentication schemes. All of IMAP, NNTP, POP, and SMTP have a mechanism for generic SASL authentication schemes; HTTP authentication (for RSS and possibly others) also has a similar generic authentication measure (I don't actually know how well HTTP-authed RSS feeds work, but I'd be surprised if there were a major problem).

Import:

Possibility: 8
Difficulty: 4
Desirability: 4

Coding an importer into an extension isn't too hard: I've done it before. The hardest issue I had was the fact that import is actually multithreaded (unlike most actions, which merely run on event queues). My understanding is that this implies that you cannot write it in JS, only C++.

That said, though, I'm not sure that it's something to be concerned with. Import is not something that I think a lot of people will be doing time and time again (if they were, you've pretty much entered synchronization land). Instead, import is something that matters more for people redistributing mailnews code, such as Linux distributions. The interesting cases, then, would mostly be other system applications, which generally implies that you're writing stuff in native code anyways (although JSCtypes can access those native APIs from JS, as long as they are written in C and not C++ (so no qt) and you don't need to expand dozens of macros to get a simple method (so no gtk)).

Synchronization:

Possibility: 7
Difficulty: 3
Desirability: 9

By synchronization, I mean specifically the ability to synchronize data—be it preferences, mail, contacts, calendar, or anything else people come up with—between Thunderbird and any other accepting client, which includes other instances of Thunderbird or SeaMonkey, online web services, PDAs, and smartphones, among others. You can already do it now, as evidenced by two of the more popular extensions doing this. Indeed, until recently, comm-central had Palm synchronization code in its tree. The problem is that some APIs could stand to cater more to this use case; I would even go so far as to say that a centralized framework might be useful.

Possibility: 6
Difficulty: 6
Desirability: 8

I've never precisely tabulated how much mail I get in a day, but I reliably receive around 100 messages a day from someone called "bugzilla-daemon@mozilla.org", another 50 or so from various other mailing lists, on top of innumerable RSS feed messages (I subscribe to 60 or so feeds) and NNTP messages. Compared to some other people, that list seems small. In short, many users may be suffering from acute information overload, and how does one remedy that? With comprehensive searching and filtering.

It seems to me, therefore, that search extensibility is vital. For example, for triage purposes, I might want to search for all bugzilla threads whose bugs are currently fixed (irrespective of the bug's status when I received the email). This area has really improved recently in Thunderbird 3, but I don't think it's quite there yet. For example, searching on the current status of a bug in bugzilla is not really feasible under the gloda search engine, but it may be possible under the conventional search. The high value for the difficulty is due to the fact that I think to put this where it truly ought to be, one needs to do some massive database overhauls.

Filter backends:

Possibility: 5
Difficulty: 7
Desirability: 7

My rationale for the desirability here is similar to searching, but it's slightly lower since I think that filtering is slightly less useful than searching (then again, my modus operandi is typically a trigger-happy finger on the Delete key). Again, filters have had massive improvement in the course of Thunderbird 3 development, but I don't think they're quite where they should be.

What makes this more annoying than search in terms of ease of use and hacking is the fact that filters live in a neverland where the headers have not yet been added to the database, so some potentially useful operations (notably, thread lookup) won't work. Changing this would require one to audit carefully the entire new message process and handle the database manipulation in precise ways. Nevertheless, I think this stands out as one of the more likely things to be improved, since the APIs seem to be slowly moving this way as a result of the other new message bugs.

Creating new address book types:

Possibility: 6
Difficulty: 2
Desirability: 5

This was an idea that I was once a very big proponent of; in recent months, though, I've tempered my enthusiasm for this. Unlike accounts, I don't think there's a lot of reason to prefer a new address book type over synchronization. There is also the possibility of faking a new address book type by selecting an address book to be automatically and zealously synchronized.

At this point, it was rather possible to create a new type—provided you write it in C++, you do some internal sleights of hand (it's a MAPI address book! I swear! Or maybe "none," which (oddly enough) also seems to work…), and you remember to make sure it's an RDF resource. Actually, getting a very rudimentary, read-only SQL address book implementation to work only took a few hours. The remaining internal work, to my knowledge, is merely to finish eradicating RDF, overhaul nsDirPrefs, and stop assuming the presence of nsIAddrDatabase in so many places. I'm not including the related address book API overhauls in the difficulty metric, although it seems the hardest part there is getting people to review the changes.

Message storage:

Possibility: 2
Difficulty: 9
Desirability: 5

Message storage is the ability to specify other ways to store messages. Predominantly this is is the storage for local folders and offline storage, although theoretically one can consider the remote storage of new account types to fall under this depending on how the API is done. Simply put, right now it is really difficult to do this in an extension: for all of mailnews's heavy usage of XPCOM, this is one area where the using classes create instances via new.

This is probably the most difficult of the facets to make extensible, even more so than the ability to make new account types. Why? First you have to create a new API, and then make everyone who does offline storage or local folders use it. The next thing is to consider the optimizations made under the hood to speed up local copy/moves and work out how to optimize them for more general APIs. Finally, you have to set up the UI. Flame wars of what to make the default may be involved as well.

With all that said, I don't think it's particularly desirable. For all of its foibles, I'm not particularly convinced that the proposed replacements for mbox are appreciably better. With that said, a well-designed API for this feature would also make creating new account types easier, as it would likely cause simplification of the nsIMsgFolder interface.

Metadata storage:

Possibility: 2
Difficulty: 8
Desirability: 1

This facet of extensibility is very similar to what I call message storage, except this is concerned with the ability to replace the database storage format itself. Creating an extension to do this now is at least as hard as creating one that replaces the message storage, I think. On the other hand, the core work needed to make it easier to do seems to mostly involve creating a few new interfaces and a large automated rewrite (depending on how much you want to depart from the current mork interface). Switching between databases would probably be largely lossy, though.

Of all the ways to extend Thunderbird, I think this is one of the least interesting. Seperate from message storage (which tends to imply a need to rework metadata as well), there is little value to being able to create new storage backends for metadata. That is not to say that I think this area couldn't use an entirely new format, I just think that the work needed to make it more pluggable is less desirable. I put it in here mostly to bring up for discussion of an approach mentioned in bug 11050.

Creating new account types:

Possibility: 3
Difficulty: 9
Desirability: 10

Last, but not least, is what I would consider to be the holy grail of extensibility, to be able to create new account types. While I have several reasons for considering this so important, the best reason in my mind is that this, more than any other facet of extensibility, would be able to best showcase the ingenuity of the world at large.

Assigning possibility and difficulty numbers here is difficult, as something like this has not really been fully attempted in recent years. Making a server in the ilk of POP, movemail, or RSS (one whose mail is downloaded and stored locally) is probably rather possible and not too difficult; making one in the format of IMAP or NNTP appears to only have been attempted once, and in that case, the author just ended up writing an IMAP server to bridge the two ends. I myself have tried more than once to create a working prototype; my furthest effort got to the downloading mail stage before I lost the will to continue.

As far as I know, there is nothing absolutely stopping somebody from just implementing this, at least in C++. That said, truly exotic account types (instant messaging or social networking are the common examples here) would probably require large contortions to work right, and the frontend is littered with manual type checks for account types that would probably have to be overriden in a more generic format. One large sector of possible account types is hampered by a difficulty in turning HTML text into a DOM tree for easier manipulation (i.e., screenscraping). I am therefore assigning rather pessimistic values for possibility and difficulty, since implementation currently requires brute force to attempt and the difficulty of making this easier to use requires a prototype implementation to fully ascertain.

On top of that, though, there is the issue that many account types would also like to be able to send stuff via the compose window. That part is probably the single hardest part, as compose cannot fathom the existence of anything other than SMTP or NNTP. One would therefore have to create new APIs for extensions, ramrod support for these into compose, and probably fix up nsIMsgIdentity to not be so mail-centric. Right now, the easiest way to do it is probably to make an SMTP server that acts as a middleman. I doubt a good way to do this can be finished before 3.next, though.

With that said, of all the facets I mention, the ability to create new account types is certainly the one I will put the most effort into personally. I have a goal of producing a usable extension in this regard, implemented entirely in JS, by the first beta, and this is my highest-priority 3.next goal.

Saturday, August 22, 2009

A guide to pork, part 5

In the previous two sections of my guide, I discussed basics of functions and then went into the specifics of types and names, as represented by the Elsa AST nodes. In this section, I will cover classes and some more errata on declarations. As I have mentioned earlier, the subsections of this third step will be visited out of the order of their numbering.

What has been covered so far:
Step 1: Building and running your tool
Step 1.1: Running the patcher
Step 2: Using the patcher
Step 3: The structure of the Elsa AST
Step 3.1: Declarations and other top-level-esque fun
Step 3.1.3: Function
Step 3.1.5: Declaration
Step 3.1.6: TypeSpecifier
Step 3.1.8: Enumerator
Step 3.1.11: Declarator
Step 3.1.12: IDeclarator
Step 3.4: The AST objects that aren't classes
Step 3.4.4: CVFlags
Step 3.4.5: DeclFlags
Step 3.4.8: PQName
Step 3.4.9: SimpleTypeId
Step 3.4.11: TypeIntr

Step 3.1.9: MemberList (inside a class)

MemberList has only a single variable: ASTList<Member> list.

Step 3.1.10: Member (part of a class)

Member nodes represent a member of a class type. Like other nodes, this is mostly represented by its subclasses, MR_decl, MR_func, MR_access, MR_usingDecl, and MR_template. The node itself has two members, SourceLoc loc and SourceLoc endloc. End locations represent the location just after the semicolon or closing brace, such that the range of text matches [loc, endloc); if you recall, this is the same syntax that the patcher uses when working with ranges of location.

Each of the subclasses of Member only adds one variable, which is the type that member represents. MR_decl adds Declaration d, hence it is used for all declarations within the class, be it a variable declaration like int x, a type definition, or a function without a body. MR_func adds Function f, so it represents all functions that have their body in the class declaration (A(); is a declaration, but A() {} is a function). MR_usingDecl has as its member ND_usingDecl decl (which is covered under the section NamespaceDecl), and MR_template uses TemplateDeclaration d.

The final subclass of Member is MR_access, which has its member AccessKeyword k. This node represents all of the declarations like private:. Since the information keeping track of the access is stuffed in separate AST nodes, you may be wondering how to retrieve this information given only a specific member. The answer, naturally, lies in the auxiliary classes to the AST, something which I have avoided mentioning. Some nodes provide access to a Variable member, one of whose methods retrieves the access to the member. More information about this will be discussed when I talk about that class in detail.

A final thing to note is that Elsa will add some nodes into the AST by the time you use the visitor. These are the implicit methods dictated by the C++ standard. You can check if one of these members is implicit if the DeclFlags variable contains DF_IMPLICIT. Another flag that will also be set is the DF_MEMBER flag.

Step 3.1.7: BaseClassSpec (extending classes)

BaseClassSpec nodes represent a superclass for a class. It has three variables, bool isVirtual, AccessKeyword access, and PQName name, all of which are self-explanatory.

Step 3.4.1: AccessKeyword (controlling access)

AccessKeyword is an enumerated type with three important members. These are AK_PUBLIC, AK_PROTECTED, and AK_PRIVATE, all of which represent what you think they represent. There is also a AK_UNSPECIFIED member, but that should not be present by the time you get to the AST nodes. Naturally, there is also a const char *toString(AccessKeyword) method for converting these types into a string.

Step 3.4.7: OperatorName (operator overloading)

These nodes are informational nodes about operators. You should only find them within the PQ_operator type. The class OperatorName only has a single method const char *getOperatorName(). This is the basis for the PQ_operator name strings, so its results are as mentioned there.

The first subclass, ON_newDel, represents the memory operator overloads. It has two members, bool isNew and bool isArray. The first differentiates between operator new and operator delete, the latter differentiation between operator new and operator new[].

The second subclass, ON_operator, represents the standard operator overloads. It has only one member, OverloadableOp op, which represents the operator being overloaded. The names in the OverloadableOp enum all begin with OP_ and can be idiosyncratic. Example operators are OP_NOT, OP_BITNOT, OP_PLUSPLUS, OP_STAR, OP_AMPERSAND, OP_DIV, OP_LSHIFT, OP_ASSIGN, OP_MULTEQ, OP_GREATEREQ, OP_AND, OP_ARROW_STAR, OP_BRACKETS, and OP_PARENS. There are naturally more, but the other names should be derivable from this sample (pretty much all the idiosyncracies were added to the list); the full list is in cc_flags.h if you need to see it. There is also the standard toString(OverloadableOp) method if you are confused about a particular operator.

Note that some of the operators can be used in different ways. For example, OP_STAR both represents the multiplication operator and the pointer dereference operator. The way to differentiate between the two is via the number of arguments, although one must keep in mind that operators that are class members have one less argument. The postfix increment and decrement operators are differentiated from the prefix forms in that the postfix forms add a second int argument, which is incidentally always 0 if you don't explicitly call the function.

The final subclass is the type conversion operator, ON_conversion. This contains a single member, ASTTypeId type. The member type will have a terminal D_name in its declaration with a null name; the main purpose of the declaration under the ASTTypeId is to capture the pointers.

Step 3.4.2: ASTTypeId (a less powerful version of Declaration)

ASTTypeId is modelled after the Declaration node, but it's used in places where multiple declarations are not usable. Indeed, its most common usage is to represent a type (nominally TypeSpecifier) that can have pointers or references. It has two members, TypeSpecifier spec and Declarator decl, both of which act as their analogues in Declaration.

Step 3.1.4: MemberInit (simple constructors)

When working with constructors, the initialization of members is treated separately from the rest of the constructor. In Elsa, the nodes where this happens are the MemberInit nodes. These nodes contain a few members:

SourceLoc loc
SourceLoc endloc
PQName *name
FakeList<ArgExpression> *args

The source location and end locations have the standard meanings. The name attribute refers to the name of member being initialized. The arguments refer to the arguments of the function-like calls.

Step 3.1.14: Initializer (the last part of declarations)

The Initializer nodes represent various ways to initialize an object. The class itself has a single member, SourceLoc loc, but it has three subclasses, each representing some form of initialization.

IN_expr represents the standard forms of initialization people are used to seeing, something along the lines of int x = 3;. These nodes have a single member in addition, the Expression e member which represents the expression initializing the declaration.

IN_ctor represents the constructor-like initialization forms, such as int x(3);. These have a single member, FakeList<ArgExpression> *args, which represents the arguments within the parentheses.

IN_compound is the final form, which represents the array-like initialization for structs or arrays. For example, int x[1] = { 0 };. This has a single member as well, ASTList<Initializer> inits, which is a list of the initializers within the aggregate syntax. Some words of caution, though, is that aggregate initialization can have unexpected results: multidimensional arrays need not have nested braces, and, in C++0x (and gcc since a long time, though it gives you a warning), you can also omit the braces for nested structures. Bit-fields and statics are omitted from initializers, and, if you have less elements in the initializer than you need, the rest are "value-initialized" (i.e., the equivalent of 0). Elsa, unfortunately, does not aide you any further in deducing which element is actually initialized by any given initializer.

Step 3.1.13: ExceptionSpec (saying what you may throw)

ExceptionSpec nodes correspond to the throw declarations on function declarations. These nodes only contains one member, FakeList<ASTTypeId> *types, which represents the types that method is declared to be able to throw.

That is all I have for this part of the pork guide. Part 6 should finish up sections 3.1 and 3.4, so I look track to have part 7 start discussion statements and expressions. I will probably defer the auxilliary API until around part 9 or so, as I really need to play with it some more first.

Monday, August 17, 2009

A guide to pork, part 4

The last portion of the guide started covering declarations. This week, I will be covering a lot more about declarations. In particular, types and names are covered in a lot more detail. I had intended to talk about classes in more detail as well, but the post was getting long enough as it was, so I'll save discussion for a fifth part.

What has been covered so far:
Step 1: Building and running your tool
Step 1.1: Running the patcher
Step 2: Using the patcher
Step 3: The structure of the Elsa AST
Step 3.1: Declarations and other top-level-esque fun
Step 3.1.3: Function
Step 3.1.6: TypeSpecifier
Step 3.1.11: Declarator
Step 3.1.12: IDeclarator
Step 3.4: The AST objects that aren't classes
Step 3.4.5: DeclFlags

Aside 1: An introduction to porky (continued)

It seems that Chris Jones finally blogged about porky. If you're interested, go read about it.

Aside 2: Pork Web

In the course of writing this guide, I got the idea of writing a tool to display the Elsa AST nodes without having to constantly fidget around with dumpAST. The result is Pork Web, which is also a good expository of how much a little CSS will get you.

Step 3.1.6: TypeSpecifier (continued)

In the last article, I mentioned TypeSpecifier but elided details of its subclasses, who hold the interesting information, because I held a misunderstanding of key pieces of information.

For projects that are sufficiently large to be considered good candidates for automated rewriting, chances are that basic types like int are going to be rather rare, in favor of typedefs that give more precise storage sizes (such as mozilla's PRInt32). The parsing of the AST in Elsa and pork happens at a different stage from the type verification, which means that typedefs have an impact on the structure of nodes. That is not to say that you can't get type information; it just means that you want to use Elsa's type information (embodied in Variable) for more accuracy here. Naturally, #define has no impact on type information, because we are dealing with preprocessed files.

Which of the subclasses of TypeSpecifier is used depends on the format. If you are using a standard type keyword like int, you get the TS_simple flavor, which I covered last week. Structures parsed as classes in C++ (i.e., class, struct, and union) are all TS_classSpec nodes; enums form TS_enumSpec nodes. Class nodes, if you do not provide an actual definition, are classified as TS_elaborated nodes. If all you have is a simple name, then the node is a TS_name node, regardless if that type is a class, enum, or other such type. Names will never be null; for anonymous constructs like enum {a} x;, a unique string beginning with __ will be used instead.

TS_name has two variables: a PQName *name, and a bool typenameUsed. Both of these parameters are self-explanatory. For the curious, the latter comes about via an elabarator of typename, such as in the below:
template<class T> class Y { T::A a; };

TS_elaborated again has two variables, the same PQName *name variable, as well as a TypeIntr keyword variable. The keyword variable is an explanation of which keyword was used as the elaboration.

TS_enumSpec has again two variables, this time a StringRef /*(const char *)*/ name, as well as a FakeList<Enumerator> elts, which contains the elements in the enumeration.

TS_classSpec is the most complex of the subclasses, as it represents the definition of a class. It contains the same name and keyword variables as TS_elaborated, but it also has the base classes in the form of a FakeList<BaseClassSpec> *bases and its members in a MemberList *members.

Step 3.4.9: SimpleTypeId (Primitives, if you come from Java)

The SimpleTypeId enum represents the primitive types defined by C++, namely char, bool, int, long, long long, short, wchar_t, float, double, and void, as well as their unsigned and signed counterparts (if they exist). The name of each of these follows the general scheme ST_UNSIGNED_INT, although short and long are ST_LONG_INT and ST_SHORT_INT, respectively (but not long long!).

That's not all, though. For simplicity, some places have fake type codes. The most common of these will be ST_ELLIPSIS, the varargs portion of functions; there is also ST_CDTOR, the return type for constructors and destructors. The source code also mentions GNU or C99 support for complex numbers, but I have not found the magic needed to get those to work.

Step 3.4.4: CVFlags (CV-qualified IDs)

Whenever something can be const or volatile, there is a CVFlags enum. It can either be CV_NONE (no qualifiers), CV_CONST, CV_VOLATILE, or both of the latter. There also exists a method sm::string toString(CVFlags cv) that will print a string representation of such a variable. Need I say more?

Step 3.1.5: Declaration (The outer part of declarations)

Any time a variable is declared, one of the wrappers is Declaration (which may itself be found in various places). This has just three members, a DeclFlags dflags variable that represents the flags on the declaration, a TypeSpecifier *spec that is the type of the declaration, and the FakeList<Declarator> *decllist that contains the rest of the declaration. All of these have been covered in more detail earlier.

Step 3.4.11: TypeIntr (Differentiation between classes and structures)

TypeIntr is a little enum that has four members: TI_STRUCT, TI_CLASS, TI_UNION, and TI_ENUM. The descriptions of them are, I think, straightfoward. There is also a top-level method to convert to a string representation, char const *toString(TypeIntr tr), which will do what you think it does.

Step 3.1.8: Enumerator (The members of enumerations)

Within the definition of an enum is a FakeList of Enumerator nodes. These have a standard location and a StringRef name. The values can be represented in the potentially null Expression *expr variable, or the actual value in int enumValue.

Step 3.4.8: PQName (Everybody's name)

In declarations and other places, in lieu of a string representing name, you have the AST node PQName. The name stands for "possibly qualified." It comes about because there is the necessity of finding the different components of a name.

This class has four subclasses, PQ_qualifier, PQ_name, PQ_operator, and PQ_template. In addition to these, it has a plethora of functions intended to help you with the task of printing these names, as well as overloaded operators to aid in output (to std::ostream and stringBuilder). They are:

SourceLoc loc
bool hasQualifiers()
sm::string qualifierString()
sm::string toString()
sm::string toString_noTemplArgs()
StringRef /* const char * */ getName()
sm::string toComponentString()
PQName *getUnqualifiedName() /* (And a const version) */
bool templateUsed()

PQ_qualifier represents a namespace or similar component to a qualified name. This is handled in a right-associative manner, such that std::tr1::shared_ptr would be the qualifier std, which qualifies tr1, which qualifies shared_ptr. This class has three variables: StringRef qualifier (the name to the left of the double-colon), TemplateArgument *templArgs (which represents the template arguments for templated class qualifiers), and PQName *rest (the right of the double-colon).

PQ_operator represents a PQName that is actually an operator overload. It has just two variables: OperatorName *o, the operator in question, and StringRef fakeName, a string representation of the operator. The latter is essentially a space-less name of the function (except that operater new and friends are represented as such, as well as conversion operators having the poor name of conversion-operator).

PQ_template represents a templated argument name. It again has two variables: the StringRef name of the base type and the TemplateArgument *templArgs that contains the arguments to the templatization. Note that if you are getting a member of a templated class, the name tree will have the PQ_qualifier node instead.

PQ_name is the other of PQName (note the minor spelling difference). This has a single variable StringRef name which is the name. This is by far the most common name node, since everything that is not an instantiated template or an operator name will have this in the name somewhere.

For standard names, all of the various string output methods save qualifierString (which returns the empty string) will return the same thing, the name variable from PQ_name or the fakeName from PQ_operator. The differences arise when you have templates or qualified names.

If you have a qualified name, the methods change rather predictably. qualifierString returns the entire qualification string before the tail node (e.g., std::auto_ptr<T> becomes std::). The toString method and toString_noTemplArgs return the fully qualified names (optionally without template instantation). toComponentString becomes idempotent to the qualifier variable. getName will be essentially identical to getUnqualifedName()->getName(): it returns the name of the right most declarator.

Templated names also modify stuff predictably. toString_noTemplArgs and getName return the base name, without template arguments; toString and toComponentString return the name with template arguments.

The interesting stuff happens when you have a templated qualifier (e.g., std::set<int>::iterator). In that case, the toString_noTemplArgs will not strip the template args from the qualifier.

hasQualifiers() is identical to the isPQ_qualifier() method. templateUsed() is true if the qualifier or template used the template keyword. This is a feature that would be used for disambiguation purposes, such as this example (taken from the ISO C++ spec):


struct X {
  template<std::size_t> X* alloc();
  template<std::size_t> static X* adjust();
};
template<class T> void f(T* p) {
  // T* p1 = p->alloc<200>(); (p->alloc)<200>() -- syntax error
  T* p2 = p->template alloc<200>();
}

The last method to talk about is getUnqualifiedName. This method simply returns the PQName at the end of the name.

With all of the methods discussed, the most important question you're probably wondering about is the easiest way to get the name of a PQName. If you're trying to find a method or class name, getName is your safest option. If you need to know the type arguments as well (say you're looking for particular instantations), getUnqualifiedName()->toString() is a better option.

If you're looking at class members, you can probably use toString_templNoArgs successfully (when you're looking for a particular function), unless you're interested in the qualifierString() (when you're looking for any function of the class). For cases where the namespace information is necessary, you probably want to investigate the name with the parallel type APIs of elsa, unless you want to maintain your own state for using declarations and other shenanigans that make the original type non-obvious.

Unfortunately, while I earlier said that I intended to reference classes in this part of pork, it looks like I have not the time this week to cover them as well. At this point, it seems likely that part 5 will cover classes and part 6 will cover templates and errata. I cannot say what part 7 will touch: it will either be more errata or a start on the expression and statement code.

Tuesday, August 4, 2009

More jshydra news

Taking a break from my ongoing series on pork, it's now time for an overdue update on jshydra.

First off, I have pushed Andrew Sutherland's changes (most of them, anyways) to my repo. These consist of the ability to push arguments, proper JS constructor handling (which I had locally but didn't commit for some reason I long forgot), and some minor comment handling fixes. I also never mentioned my JS inheritance divination work from way back in May. I also fixed a bug and finally got around to making TOK_* variables visible in JS, like the JSOP_* variables.

David Dahl has decided to use jshydra to find dead code in JS. That bug also has some discussion on how to run jshydra, as well as some work-in-progress files to grab JS information from the mozilla build system in a jshydra-friendly format.

Andrew Sutherland also used jshydra to generate documentation for TB. His blog posting does it more justice than I ever could.

Last night, I finally got around to creating an analogue of David Humphrey's Dehydra Web for JSHydra, predictably called JSHydra Web. As a note of caution, please don't upload your massive, 100+KB JS file for perusing: you'll find the output hard to read, and I don't want to overload the server.

As always, for more information on jshydra, you can either contact me by email (found on bugzilla), by IRC (most topical place of discussion is #static), or via the mozilla.dev.static-analysis newsgroup.

Saturday, August 1, 2009

A guide to pork, part 3

In the previous installment, I covered details on the patcher API and the basics of the Elsa AST API. If I were to write this as a reference manual, this is the point at which I would be spewing out a lot of verbose information on the C++ AST which would be hard to use for an introductory patch. One goal of this series of articles is to produce something close to a reference manual, so the parts will be numbered in that order. Instead, I will present the information in an order more likely to facilitate understanding.

In summary:
Step 1: Building and running your tool
Step 1.1: Running the patcher
Step 2: Using the patcher
Step 3: The structure of the Elsa AST

Aside 1: An introduction to porky

Having had more time to evaluate the new porky wrappers since my announcement of its commit last week, let me introduce you to it briefly. The tool takes in a list of rewrite specifications such as type PRCondVar** => mozilla::CondVar**, which will automatically change everything of the first type to the second type. It can also transform method calls into other method calls or even new/delete expressions. I would go into more detail, but that's for Chris Jones to discuss. If he ever blogs about it.

Step 3 (continued):

The following is an image representation of the C++ AST, excluding specific Expression and Statement subclasses:

To examine AST examples in detail, let us consider the example of a typical hello world program in C++:


#include <iostream>
 
int main() {
  std::cout << "Hello World!" << std::endl;
  return 0;
}

After being preprocessed, this small program becomes 19,153 lines of C++ goodness, thanks to the many header files being recursively included. Although we only define a single function, g++ defines for us namespaces (with GCC-specific attributes), templates, template specialization, typedefs, structs, classes, unions, function declarations, and half of the other C++ language features.

Let us look at each of these in turn.

Step 3.1.3: Function (Function definitions)

The Function AST node represents the definition of a function. It contains these variables:

DeclFlags dflags
TypeSpecifier retspec
Declarator nameAndParams
S_compound /* (Statement) */ body
FakeList<MemberInit> inits
FakeList<Handler> handlers

Unlike most AST nodes, the Function node does not have a SourceLoc loc member; instead, it has a getLoc() method which returns the location of the nameAndParams member, which would represent the location of the name. In our example, this would be line 3, column 5 (the beginning of `main').

Most of the members are self-explanatory: they represent, in order, the declaration flags (such as static or inline), the type of the return, the name and parameters aglomeration, and the statements of the body. The last two will be less common and NULL in the example I have here. The inits member represents the intializers of a constructor, while the handlers represents try-catch blocks that are scoped for the entire method. An example of such a block:


class Base {
  Base(const char *name)
    try : data(name) { } catch (...) { std::cerr << "Oops!" << std::endl; }
  const std::string data;
};

Strictly speaking, the enclosing try-catch is not limited to constructors, although I doubt it would be used outside of them except for stress-testing compilers.

There is little that you will want to do with these objects that does not fall under the use of other objects. Probably the most burning question you have is how to get the name of the function. These are the shortest ways: func->nameAndParams->var->fullyQualifiedName0().c_str() (going to a const char *) and func->nameAndParams->decl->getDeclaratorId()->toString() (going to an sm::string). The latter form will probably be more helpful if you are looking for specific methods, since it overrides the == operator.

Step 3.4.5: DeclFlags (declaration modifiers)

DeclFlags is an enum that specifiers certain flags about declarations. Each of the values is in the form DF_, always uppercase. The standard declarations--auto, register, static, extern, mutable, inline, virtual, explicit, friend, and typedef--have values in this manner. Other flags in the enum have uses for the Variable constructs, which I will not go into detail now.

Step 3.1.6: TypeSpecifier (first half of declarations)

A TypeSpecifier node represents a declaration of type, such as char. If you recall your C/C++ syntax, the declaration char *x, y; declares only one variable as a pointer--the * is matched with the variable name and not the type itself; in the AST therefore, TypeSpecifier does not receive these pointers (there come about in Declarator nodes).

By themselves, these nodes only have a CVFlags cv variable (representing the const- and volatile-ness of the type), as well as a SourceLoc loc location member. Instead, it has five subclasses with more specific attributes.

TS_name, TS_elaborated, TS_classSpec, and TS_enumSpec will be discussed in a future part.

TS_simple nodes represent the built-in simple types, like char. It has a single additional member, a SimpleTypeId id member.

Step 3.1.11: Declarator (the other half of declarations)

A Declarator node represents the non-type part of a declaration. It contains these variables and methods:

IDeclarator *decl
Initializer *init
PQName *getDeclaratorId() /*(And a const version)*/

Step 3.1.12: IDeclarator (the real declaration part)

The IDeclarator node does the heavy work in terms of annotating declarations. In some corner cases, the AST for a declarator can get very deep; the parallel type structure makes working with declarations much easier.

With the exception of the D_name and D_bitfield subclasses, all subclasses contain at least an IDeclarator *base member. In addition, IDeclarator holds these variables and methods:

SourceLoc loc
PQName *getDeclaratorId() /*(And a const version)*/
IDeclarator *getBase() /*(And a const version)*/
IDeclarator *skipGroups()
bool bottomIsDfunc()
D_func getD_func()

The skipGroups method skips through any excess groupings (i.e., parentheses layers). getBase returns either the base of the declaration or NULL if it is a leaf. getD_func returns the bottom-most D_func node, while bottomIsDfunc will tell you if the declaration is a function (but not a function pointer). getDeclaratorId obviously returns the name of the object at the very base of the declaration tree.

The first subclass is D_name, which is the typical leaf of the declaration. It has a single PQName *name attribute, which returns the name of the variable or function being declared. The other leaf class is D_bitfield, which has both the name attribute as well as an Expression *bits representing the number of bits in the declaration.

D_pointer and D_reference represent a pointer or reference indirection, respectively; D_pointer additionally has a CVFlags cv variable representing the qualifications of its pointer type.

D_ptrToMember is similar to D_pointer, but it adds another PQName *nestedName attribute to represent the construct whose member the pointer is pointing to. For those who don't recognize this feature, it can be demonstrated in this C++ snippet:


struct Foo { int Bar() { return 0; } };
typedef int (Foo::*ptrToMember)();
// The declarator tree (following base):
// D_func->D_grouping->D_ptrToMember->D_name
ptrToMember p = &Foo::Bar;

D_array represents an array declaration. In addition to its base, it also has the possibly null Expression *size member to represent the size of the array. Multidimensional arrays have these members arranged from the outside in.

D_grouping is a dummy node used mostly for the purposes of AST disambiguation during the parsing phase (it represents the use of parentheses in declarations). The skipGroups function can be used to pass these nodes.

D_func is necessarily the most complex declarator. Its base is typically the name of the function, although function pointers can have nested declarators. It has a FakeList<ASTTypeId> *params member for the parameters, CVFlags cv member for the const member functions, and ExceptionSpec *exnSpec for the exception specifiers.

Predicting declarator trees is not difficult. In general, you can apply standard rules to find the declarator: each * and & at the beginning creates the respective declarator indirection node; a [] or () at the end creates the array and function nodes, respectively; surrounding with paranthesis yields a grouping node. Pointers to members are created when the syntax is used (look for the ::*), and the choice between bitfield and name as the leaf comes from obvious decisions. One also needs to remember that the structures on the left are parsed before the ones on the right, unless overriden by parentheses, so the obscene int (*(*asdf)())[0] is yielded as array, grouping, pointer, function, grouping, pointer, name. If you're wondering, it's a zero-sized array named "asdf" that contains pointers to functions that return poiinters to integers.

There is much more to talk about in the world of ASTs. This will get you started on function bodies and declarations; the next part of the guide will cover more in-depth knowledge on declarations and begin to introduce classes.

Wednesday, July 22, 2009

A guide to pork, part 2

Last time, I covered the very basics of using pork. In this portion of the guide, I will cover enough to get you to be able to write a small patch.

Since the time I wrote the first part of the guide, Chris Jones committed some tool wrappers known collectively as porky, which may necessitate updates to first steps.

In summary:
Step 1: Building and running your tool
Step 1.1: Running the patcher

Step 2: Using the patcher

The patcher works internally (more or less) by keeping a list of ranges and their replacement text, which it eventually uses to build up hunks that it then spits out to an output stream. The public API it provides (as of the current tip, in any case) comes in two sections: some file utility functions and text replacement functions.

Locations can be represented by one of three different types. The first is SourceLoc, which is a bit-packed integer that the elsa AST nodes give you. Then there is CPPSourceLoc, which is an only slightly less manageable location format. The final form is UnboxedLoc, which is the easiest one to work with.

As I mentioned earlier, the patcher actually works with pairs of these objects. PairLoc and UnboxedPairLoc are pairs of CPPSourceLoc and UnboxedLoc, respectively. The two are constructed in a pretty intuitive manner (although note that as UnboxedLocs do not store the file, you need to pass that into its pair type). Note that ranges include the left but not the right endpoint.

The class Patcher itself contains two methods for patching stuff: printPatch, which replaces text, and insertBefore, which inserts the text before a location. If you want to delete text, the answer is to replace a range with the empty string.

While this is nice, the patcher does suffer from a few flaws. The biggest of these that I've found is really a flaw in elsa: not all nodes have source and end locations (only statements and expressions), requiring me to roll my own search functions. Fortunately, the file API of patcher helps here.

The other big flaw is the difficulty of coping with visually important but semantically meaningless clues, namely comments and whitespace. If you naïvely delete text, you may end up with comments whose referents no longer exist or blocks of whitespace where code once was. Inserted text may violate local code conventions. I have not yet expended the effort yet to get this to work; you will either have to do this yourself, bug taras to do it, or possibly both.

Now, if you want to see some code in action:


// Here, func is a pointer to an elsa AST expression node
// And type a string representing its replacement
// patcher is of course a Patcher object.
patcher.printPatch(type, PairLoc(func->loc, func->endloc));

// Elsewhere
UnboxedPairLoc findAndMakePair(Patcher p, const SourceLoc &loc,
    char toFind) {
  int lLine, lCol;
  StringRef file;
  sourceLocManager->decodeLineCol(loc, file, lLine, lCol);
  int lineNo = lLine, col;
  do {
    std::string line = patcher.getLine(lineNo++, file);
    col = line.find(toFind);
  } while (col == -1);

  return UnboxedPairLoc(file, UnboxedLoc(lLine, lCol),
    UnboxedLoc(lineNo - 1, col + 2));
}

Step 3: The structure of the Elsa AST

The core of pork is the ability to parse AST nodes. In general, these fall under three categories: top-level declarations (possibly within classes or namespaces), statement and expression nodes, and utility nodes.

The basic structure of an AST node class is like this:


// A typical node type
class TypeSpecifier {
public:
  // Almost all nodes have these
  // Those that don't wouldn't make sense
  SourceLoc loc;

  // These methods are for nodes with subtypes
  // if returns null if it isn't the correct type; as throws
  char const *kindName() const;
  TS_name const *ifTS_nameC() const;
  TS_name *ifTS_name();
  TS_name const *asTS_nameC() const;
  TS_name *asTS_name();
  bool isTS_name() const;

  // There's another parameter that you'll never use
  void debugPrint(std::ostream &, int indent);
  void traverse(ASTVistor &vis);
};
class TS_name: public TypeSpecifier {
public:
  // Typically has some more data nodes
  PQName *name;
  bool typenameUsed;
};

To use these nodes, pork follows a typical visitor pattern. The class ASTVisitor will visit all of the node types; ExpressionVisitor subtypes have individual methods for visiting subtypes of statements or expressions. You can choose to look at nodes in either a pre or postorder traversal. A previsit traversal function is in the form:
virtual bool visitTypeSpecifier(TypeSpecifier *);
(where the return is whether or not to dig down deeper), and a postvisit in the form:
virtual void postvisitTypeSpecifier(TypeSpecifier *);.

Hopefully, this is enough to get you started on being able to use pork. In my next part, I will cover the AST nodes in more detail.

Monday, July 20, 2009

A guide to pork, part 1

As one of the first people to have actually used pork (apparently the third, after taras and cjones), I feel obliged to give a guide as to how to write an automatic patch generator, so as best to prevent people from asking the same question a fourth time. This also contains some ranting about some of my annoyances with pork (sm::string, I'm looking at you). So, without further ado, I present Part 1: It works!

A brief introduction

Pork essentially consists of three main areas of API (enumerated in order of my discovery): the patcher API, the C++ AST structure direct from the parser, and the annotated APIs that make finding information more than a bit easier. There is something which constitutes a sort of fourth API, the utilities that partially replicate functionality in the STL.

My original interest in pork came from an idea to rewrite libmime, which is roughly a basic C++ implementation in C, into the equivalent C++ code. Such a patch would be on the upper end of difficulty for a normal shell, python, or awk script to rewrite: I need to combine classes, rewrite function prototypes, rename variables, and refactor globs of code like
return ((MimeObjectClass*)&MIME_SUPERCLASS)->initialize(object);
into
return MimeLeaf::initialize();

Step 1: Building and running your tool

The first step is to build pork. Taras's guide will likely be more up-to-date than any instructions I give. Now you have an installation of pork. After that, you can plug in your own tool into the structure. I've personally handled this by making a tools/ subdirectory and making a very neat Makefile that automatically adds files to be compiled into the tools themselves.

Your tool will eventually be invoked tool <args> filename if you are using the pork-barrel script. All that pork-barrel does is to run the programs one at a time and to merge the outputted patch in the end; you don't need to use it (and I recommend you don't) as you start your tool. The files it runs on are preprocessed files, generally with the extension i or ii. Invoking gcc with -save-temps is a nice way of generating these files. You don't need to use mcpp if you're not overly concerned about stuff lurking in macros.

Step 1.1: Running the patcher

Once your tool processes its arguments, it will eventually be reading the C++ files and patching them. Here is some sample code to do that, which I provide without comment (it's just boilerplate):

#include "piglet.h"
#include "expr_visitor.h"
#include "patcher.h"

class MainVisitor: public ExpressionVisitor {
public:
  MainVisitor(Patcher &p): patcher(p) {}
private:
  Patcher &patcher;
};

int main(int argc, char **argv) {
  PigletParser parser;
  Patcher p;
  MainVisitor visitor(p);
  for (int i = 1; i < argc; i++) {
    TranslationUnit *unit = parser.getASTNoExc(argv[i]);
    unit->traverse(visitor);
  }
  return 0;
}

The necessary APIs for the utilities will eventually be covered in more detail. Unfortunately, it's late, and you now have a working, if idempotent, pork utility. Next time, I'll discuss the basics of Patcher and ExpressionVisitor.

Wednesday, May 6, 2009

jshydra news

Since I happen to have some downtime thanks to the short space between college and a summer job, and some spare time with four wisdom teeth newly missing from my mouth, I have returned to work on jshydra. I am in the midst of simplifying the build structure for new or non-Mozilla hackers, and am also patching up one of the main scripts I have, the documentation-association script.

On Monday, May 11, 2008, at 1:00 PM EDT (10:00 AM PDT, or 1700 UTC), I will be holding in the #static IRC room for the benefit of users a "learn jshydra" day. You may of course ask me questions any time I'm actually on IRC (nick: jcranmer or derivatives thereof).

I hope to finish up the build structure by then and get a wiki article on the topic started as well. A feature request list is already starting, including the ability to take JS out from an HTML file. If you have others, communicate it to me somehow, and I'll stick it somewhere where I can remember it.

Saturday, May 2, 2009

Is this faster?

Last Wednesday morning (in the EDT timezone that I go to school in), I took my final in Probability & Statistics. It was a moment that I didn't think too much about until this morning, when, in one of the newsgroups I frequent, there was a discussion as to how much faster one option was compared to another. This post come up:

A good idea would be to measure each iteration separately and then discard outliers by e.g. discarding those that exceed the abs diff between the mean and the stddev.

I leave it as an exercise to the reader to figure out why that's not a good idea. In any case, the last unit of the class dealt with the fun part of statistics, which is to actually evaluate whether observed data is statistically significant. The actual math involved isn't too bad, assuming you have someone spit out a cumulative distribution function for the t-distribution for you (here is Boost's code), but it's a bit convoluted to write here, so read Wikipedia's page. The correct tests are actually the two-sampled t-tests.

But I got to thinking. One thing I've wanted to see for a while is a profiling extension, one that would run some JS code snippet multiple times and produce various reports on profiling runs. Wouldn't it be nice if such an extension could compare two runs and determine if they was a statistically significant difference between them?

Wednesday, April 1, 2009

Bugday: help us unconfirm new bugs!

Yesterday, Reed helpfully enabled the NEW -> UNCO transition in Mozilla's bugzilla installation, which will enable us to reverse some mistaken decisions in the past. Looking at the current list of bugs, we have way too many NEW bugs to be healthy (Five thousand three hundred eighty-five, to be precise, at this time of writing).

So please help us triage our components by finding NEW bugs that do not have clear steps to reproduce and unconfirming them. While you're at it, you might want to try to do other triage on the bugs.

Wednesday, February 18, 2009

A database proposal

A sore point in the mailnews code is the message database code. Most of the backend ends up being a single amorphous blob with fuzzy boundaries amassing huge couplings between a server's implementation and the database. Add into account the fact that the database documentation (like most of mailnews, but worse) is often either poorly documented or sometimes just plain wrong, and you get a recipe for disaster. There's also the issue, probably the most important one, that the database has grown past its original intent.

Originally, the message was merely a cache of the information for the display. Since it was only a cache, it doesn't matter that much if it is blown away and reparsed from the original source. Well, there's a little matter of the ability to set an arbitrary property that isn't reflected in the mbox source. This capability, among other features, has made the message database a ticking time bomb. And, in essence, the bomb recently exploded when I attempted to make it usable from JavaScript.

So, in the mid-to-long-term, the database needs serious fixing, not the incremental band-aids applied all over it. It needs a real design to fit its modern and future purposes. Naturally, the first question is what does a database need to do. Following are salient points:

The database is really multiple, distinct components.
One part of the database is a relational database: metadata for a message that is not reflected in the message itself. If an extension wants to keep information on certain message properties (like how junk-y it is), it would stick the information in this relational database. The second part of the database is a combination of the message store and cache. This part is what the database used to be: a store of information easily recoverable from the message store. Note that this part of the database needs to be at least partially coupled with the message store, more on this later.
The relational database is separate from the cache database.
The cache database exposes a unique, persistent identifier for messages.
While the cache database can, and probably will, be regenerated often, the relational database is permanent. Indeed, the cache database blowing itself away should not cause the relational database to have to do anything. At present, the cache uses ephemeral IDs as unique identifers: IMAP UIDs (destroyed if UIDVALIDITY is sent), mbox file offsets (destroyed if the mbox changes), or NNTP article keys (can of worms there [1]). In my proposal, the cache would map these IDs to more persistent ones. Yes, it makes reconstructing the database more difficult, but it makes everyone else's lives easier.
The cache database may be rebuilt if the store is newer.
The cache database rebuild should be incremental.
The relational database should not be ever automatically rebuilt.
One of the main problems as it stands is the rebuild of the cache database. It has been, in general, assumed that rebuilding the database would never lose information, but the database has become the only store of some information. I am not certain of technical feasibility, but there is in general no need to reparse a 2GB mbox file if you compact out a few messages. Even in an IMAP UIDVALIDITY event, I expect that not all of the UIDs would be changed. Incrementalism would make the database more usable during painful rebuilds, but, naturally, it would require more careful coding.
The cache database's rebuild policy is caller-configurable.
What I mean about this is that the cache database will be accessible via one of three calls: an immediate call that will get the database, even if invalid; a call which will get the database but spawn an asynchronous rebuild event [2]; and a call that will block until the database finishes rebuilding, if necessary. The implications of having asynchronous rebuild would require the database to be thread-safe, but I expect that the future of the database already includes asynchronous calls. At the very least, it might help in some cases where we've run into thread safety issues in the past (such as import).
The cache database has access to the message store.
There are three types of store: local-only, local caching remote, and remote-only.
The folder can only access the store through the database.
These points are probably the ones I'm least comfortable with, but I think it's necessary. In the long-term, pluggable message stores and the store-specific mechanisms of database means that the cache database needs to have intimate access with the store. Having explicit interfaces for the message store should allow us to avoid having to subclass nsIMsgDatabase for the different storage types. Limiting access via the folder should help cut down the bloat on nsIMsgFolder. On the other hand, it would probably make the code do a lot more round-tripping, which could lead to more leaks.
The cache database is per-account, not per-folder.
A cleverly-designed per-account store could alleviate some problems. It would make working with cross-posted messages easier, and could, in principle, use less disk if you move messages between folders on the same local stores or caches. Copied messages could point to the same messages (in the spirit of filesystem hard links), so long as we don't permit local message modification.

If I haven't missed anything, that is how I see a future database system. Obviously, implementation would not be easy; I expect it would take at least a year or even two years of concentrated work to produce something close to these ideals. There are incremental steps to this future, but they seem to me to be towering steps at many cases (for example, introducing the database to the store, or making it usable from different threads). In any case, I'm interested in hearing feedback on this proposal.

[1]In recent months, some of Giganews' binary newsgroups had begun to press distressingly close to the signed-32 bit limit, which raised the question of what to do. One proposal would have been to reset the ids or maybe wrap around. A news client should be able to handle this case if practical to do so, IMO.

[2]I expect that this method would use the invalid database, although it could be implemented by having the various method calls block until validity. Since it's possible that a caller could use the blocking-get-database call as well, this approach makes significantly less sense to me.

Sunday, January 25, 2009

JSHydra

Over the past week or so, in between my myriads of projects, I managed to find time to revive a partially-completed tool called jshydra (properly capitalized as "JSHydra"). JSHydra, in both name and code, is derived from Dehydra and Treehydra: it is a static analyzer for JavaScript. I first decided that such a tool was needed when I was in the middle of a large refactoring of address book code.

The source code for jshydra can be found at my Mozilla user hg repo. Presently, it requires some hackery and file modifications to merely build the system. I use SpiderMonkey internal APIs (the parsing APIs, to be exact), so that's where most of the hackery comes into play.

For the time being, I'm probably going to put aside doing more work on jshydra, as I have more pressing work on my plate at this time. If you have any questions, feel free to contact me via IRC (handle: jcranmer (I should be in #mmgc)) or via my email address as listed in bugzilla.

Tuesday, January 20, 2009

My other life

I have just opened up a new blog covering those topics that are not oriented towards my work with Mozilla or other similar technically-oriented projects.

Tuesday, January 13, 2009

Eight things you didn't know about me

Apparently it's my turn for these meme, according to taras. Oh yes, and KaiRo too. The rules:

  1. Link back to your original tagger and list the rules in your post.
  2. Share seven facts about yourself.
  3. Tag some (seven?) people by leaving names and links to their blogs.
  4. Let them know they’ve been tagged.

So let's get this show on the road! Eight things you didn't know about me (why eight? Because I like to go one step further):

  1. When I was tested around first grade, my auditory processing skills were rated as "retarded." Possibly related to this fact, I was apparently deaf or near-deaf at the same time, and I'm still rather poor with auditory processing to this day.
  2. I nearly flunked 7th grade biology but got the second-highest grade in my 9th grade biology class, I believe averaging over 100%. Indeed, I probably still recall large hunks of biology although biology is my least favorite core science and I wish I could forget it.
  3. Despite having a high school GPA of 3.99 (where the best grade is a 4.0, and AP credit adds 0.5 to the final grade), the first time in my entire life that I achieved a perfect 4.0 was my first semester of college. My grades have often been littered with scores that just make the next grade level, such as a 93.52 in one year of French, or a 89.50 in Psychology.
  4. Nearly all of my programming knowledge I taught myself. The only languages I learned as part of a class were (in order) Python, FORTRAN, and Smalltalk. All other languages (roughly speaking, Java, C/C++, x86 assembly, PHP, bash scripting, awk, sed, etc.). I taught myself before any courses that used them taught it to us. One day, for Senior Switch Day (a high school tradition), I taught a Computer Architecture lecture covering, in order, how vtables are implemented in gcc, how to use setjmp and longjmp for exception handling, and how to write a stack walker in C. And yes, I do know that I'm going to hell for using setjmp and longjmp.
  5. In 7th grade, I tied for 10th place in the state for the Mathematics League. I personally know 11 of the other people who did at least as well. One of those people I actually referred to earlier in this post. And it's not Haitao.
  6. I was once in a fight in 7th grade. The fight happened like this. Step 1: I was knocked into a bench in a locker room. Step 2: I was knocked over this bench into the lockers in same locker room. Fight over. I don't think I actually bled, but it hurt, IIRC.
  7. For my senior prom, I went with four girls, for a definition of "went" more exclusive than "went in the same dinner/limo group" (there were actually around 12 or so girls in our group). Incidentally, all four girls were Asian, which is fairly representative of the ethnic makeup of my high school (or at least among those I knew).
  8. Because of my speech impediment (or so I've heard), I've been fairly often asked if I'm British. Technically speaking, it's not a lisp—my problems are with r's, not s's. I wonder if people here think I sound British (I certainly don't)?

Time for poking people:

Gary Kwong
Fellow intern and bug triager, I wonder what he's done.
Siddharth Agarwal
His blog has been pretty empty.
David Bienvenu
He needs the excuse to blog.
Jesse Ruderman
For finding so many typos in my blog postings.
Shawn Wilsher
mozStorage writer, and one with whom I've spent some time trying to debug a TB news problem…
timeless
Because…he's awesome?
Saul (whose blog is inaccessible)
Because it doesn't take a computer scientist to know the problem with pyramid schemes.

Hey look, I got through the entire post without mentioning demorkification! Oh cra—

A problem with feature requests

Among many of my secret lives, I personally follow several mailing lists, newsgroups, and technology updates for reasons beyond the scope of this posting. I follow the WHATWG mailings list somewhat closely. One issue that has been raging quite ferociously recently was whether or not to include RDFa, eRDF, RDF, or whatever into HTML 5.

Many of my readers, especially those for whom RDF is immediately associated with cumbersome, inflexible APIs that should be rid from programs, are probably wondering why one would want to include such a specification into HTML 5. Don't worry, you're not alone. Even Ian Hickson, the editor of the HTML 5 spec, is having problems trying to figure out why. To whit:

One of the outstanding issues for HTML5 is the question of whether HTML5 should solve the problem that RDFa solves, e.g. by embedding RDFa straight into HTML5, or by some other method. Before I can determine whether we should solve this problem, and before I can evaluate proposals for solving this problem, I need to learn what the problem is.

A bit of context: earlier in 2008, there was a previous thread about RDF measuring well in excess of 100 messages. Being a good editor, Hixie asked in this message about what problems it was trying to solve. The response? Seventy-three email messages, most of which promptly ignored the issue. Many sub-discussions centered around things such as quality of search engines, how it should be implemented, etc. The idea of trying to figure out why people should use it got lost in the wind.

And to me, this signifies a problem. One of the most important questions when deciding whether or not to include a feature is why. And it seems to me that this question is the one that is least pondered by proponents of new features. The answer is often some variant of "it's obvious" or a description of what the feature does. The last bit is like trying to answer a question of "Why do you want to put a door in the wall here?" with an answer "Because we can have quicker access to the other side of the wall." At first glance, it's acceptable, but in reality, it doesn't justify the feature (why then, do you want quicker access?).

There are more instances where I've seen this. One of them that aggravates me the most is the proposal to include closures in Java. There are several conflating issues in the entire controversy, so here's some background. There are three proposals for closures: CICE (usually with ARM included), which really isn't a closures proposal, more of a "let's decrease anonymous inner class verbosity;" FCM, a "lite" version of closures which basically just makes methods first-class objects, and BGGA, which is full-blown closures support. By now, when people refer to a closure proposals, it's the full BGGA closure; FCM has all but disappeared, and CICE+ARM is generally only mentioned as a compromise opportunity.

The BGGA proposal can be viewed as roughly comprising three parts. The first is the idea of function pointers, the second is the ability to convert a function pointer to an interface so long as the interface has a single method and the function signatures match, and the third is (in a nutshell) the ability to create control structures. I have seen a lot of valid arguments on both sides for the first two portions, but the third portion still mystifies. Yet it is this third part which truly differentiated BGGA from FCM, and it is there where almost all of BGGA's complexity comes into play.

Just one catch: Why do you want or need the ability to create control structures? One control structure almost invariably pops out: the with construct, or (in other words) syntactic sugar for a try { ... } finally { ... } block. This is where the ARM portion of the CICE proposal comes in—it would add the only construct desired by any sizable amount of people. Okay then, that's one, how about another? And therein lies the problem. It's hard to come up with other examples. Proponents always mention the with and assume that everything else is evident. For something that is definitely going to increase complexity and difficulty in programming greatly (e.g., return 5 would behave differently, in some cases, than return 5;), you better have more than one, easily manually-addable use case.

Another problem is underestimation of what it takes to include a feature. A 16 MB extension for a 6 MB program is completely untenable, so let's include it into the 6MB program! How about introducing multiple tokens as a CSS unit token (think about it for a moment...)?

The final thing that irks me with feature requests is the importance with which people attach. The news that Java 7 will definitely not be containing closures (adding a complex, controversial feature to a specification already behind schedule isn't exactly tenable) seems to have been treated with proclamations that Java is dead or that it will die as a result. I somehow can't imagine that millions of Java programmers will suddenly switch because Java won't have closures—indeed, no programming languages ever in the Top 5 have ever had closures.

Similarly, the announcement that JavaScript had been disabled in trunk Thunderbird builds for email was met with a few vocal opponents complaining (I don't know of any non-Mozilla product that ever had support for JavaScript in email to begin with). And most of my readers are no doubt aware of the furor that removing MNG support caused.

Sunday, January 4, 2009

News submodule roadmap

It's the new year! As the owner of the NNTP submodule of Mozilla Mailnews, let me lay out a rough roadmap on what I seek to get done on the submodule for some time to come.

For Thunderbird 3 (and SeaMonkey 2.0, of course), the ability to filter on any header is probably in and of itself sufficient to make people happy. The only other change I think reasonable to make it into 3.0 is making get messages actually get messages. More substantive changes will have to wait for later.

For 3.next, I think the most important thing to tackle is news URIs, as they are incredibly broken for most practical purposes. High importance is ensuring that news://<server>/message-id and news://<server>/group are usable via external protocol handlers and command line. Making server-less URIs work would also be delightful.

Also important to me is more protocol support: AUTHINFO SASL, STARTTLS, and CAPABILITIES in roughly that order. While I'm at it, I'll also clean up some of the cruft surrounding nsNNTPProtocol. I would also like to have a lot more comprehensive tests for 3.next.

What's not on my plate is more binaries support. Specifically, I'll not be working on combine-and-decode (as useful as it is in general). Offline love I think will have to wait for 3.next.next before I can find some time to work on it.

Finally, I'd like to get all of the news bugs neatly triaged into categories. As of me writing this message, I count some 150 bugs in the Networking: News component, and I'm sure there are a few more bugs floating around in other components that I haven't found. I'd like to remove dupes and categorize all of these into specific categories (at least locally). And, more importantly, remove the cruft of these bugs. Many of the NEW bugs should be UNCO because no one ever really confirmed them.