Tuesday, May 7, 2013

Understanding the comm-central build system

Among the build systems peer, I am very much a newcomer. Despite working with Thunderbird for over 5 years, I've only grown to understand the comm-central build system in gory detail in the past year. Most of my work before then was in trying to get random projects working; understanding it more thoroughly is a result of attempting to eliminate the need for comm-central to maintain its own build system. The goal of moving our build system description to a series of moz.build files has made this task much more urgent.

At a high level, the core of the comm-central build system is not a single build system but rather three copies of the same build system. In simple terms, there's a matrix on two accesses: which repository does the configuration of code (whose config.status invokes it), and which repository does the build (whose rules.mk is used). Most code is configured and built by mozilla-central. That comm-central code which is linked into libxul is configured by mozilla-central but built by comm-central. tier_app is configured and built by comm-central. This matrix of possibilities causes interesting bugs—like the bustage caused by the XPIDLSRCS migration, or issues I'm facing working with xpcshell manifests—but it is probably possible to make all code get configured by mozilla-central and eliminate several issues for once and all.

With that in mind, here is a step-by-step look at how the amorphous monster that is the comm-central build system works:

python client.py checkout

And comm-central starts with a step that is unknown in mozilla-central. Back when everyone was in CVS, the process of building started with "check out client.mk from the server, set up your mozconfig, and then run make -f client.mk checkout." The checkout would download exactly the directories needed to build the program you were trying to build. When mozilla-central moved to Mercurial, the only external projects in the tree that Firefox used were NSPR and NSS, both of which were set up to pull from a specific revision. The decision was made to import NSPR and NSS as snapshots on a regular basis, so there was no need for the everyday user to use this feature. Thunderbird, on the other hand, pulled in the LDAP code externally, as well as mozilla-central, while SeaMonkey also pulls in the DOM inspector, Venkman, and Chatzilla as extensions. Importing a snapshot was not a tenable option for mozilla-central, as it updates at an aggressive rate, so the functionality of checkout was ported to comm-central in a replacement python fashion.

./configure [comm-central]

The general purpose of configure is to discover the build system and enable or disable components based on options specified by the user. This results in a long list of variables which is read in by the build system. Recently, I changed the script to eliminate the need to port changes from mozilla-central. Instead, this script reads in a few variables and tweaks them slightly to produce a modified command line to call mozilla-central's configure...

./configure [mozilla-central]

... which does all the hard work. There are hooks in the configure script here to run a few extra commands for comm-central's need (primarily adding a few variables and configuring LDAP). This is done by running a bit of m4 over another file and invoking that as a shell script; the m4 is largely to make it look and feel "like" autoconf. At the end of the line, this dumps out all of the variables to a file called config.status; how these get configured in the script is not interesting.

./config.status [mozilla/comm-central]

But config.status is. At this point, we enter the mozbuild world and things become very insane; failure to appreciate what goes on here is a very good way to cause extended bustage for comm-central. The mozbuild code essentially starts at a directory and exhaustively walks it to build a map of all the code. One of the tasks of comm-central's configure is to alert mozbuild to the fact that some of our files use a different build system. We, however, also carefully hide some of our files from mozbuild, so we run another copy of config.status again to add in some more files (tier_app, in short). This results in our code having two different notions of how our build system is split, and was never designed that way. Originally, mozilla-central had no knowledge of the existence of comm-central, but some changes made in the Firefox 4 timeframe suddenly required Thunderbird and SeaMonkey to link all of the mailnews code into libxul, which forced this contorted process to come about.

make

Now that all of the Makefiles have bee generated, building can begin. The first directory entered is the top of comm-central, which proceeds to immediately make all of mozilla-central. How mozilla-central builds itself is perhaps an interesting discussion, but not for this article. The important part is that partway through building, mozilla-central will be told to make ../mailnews (or one of the other few directories). Under recursive make, the only way to "tell" which build system is being used is by the directory that the $(DEPTH) variable is pointing to, since $(DEPTH)/config/config.mk and $(DEPTH)/config/rules/mk are the files included to get the necessary rules. Since mozbuild was informed very early on that comm-central is special, the variables it points to in comm-central are different from those in mozilla-central—and thus comm-central's files are all invariably built with the comm-central build system despite being built from mozilla-central.

However, this is not true of all files. Some of the files, like the chat directory are never mentioned to mozilla-central. Thus, after the comm-central top-level build completes building mozilla-central, it proceeds to do a full build under what it thinks is the complete build system. It is here that later hacks to get things like xpcshell tests working correctly are done. Glossed over in this discussion is the fun process of tiers and other dependency voodoo tricks for a recursive make.

The future

With all of the changes going on, this guide is going to become obsolete quickly. I'm experimenting with eliminating one of our three build system clones by making all comm-central code get configured by mozilla-central, so that mozbuild gets a truly global view of what's going on—which would help not break comm-central for things like eliminating the master xpcshell manifest, or compiling IDLs in parallel. The long-term goal, of course, is to eliminate the ersatz comm-central build system altogether, although the final setup of how that build system works out is still not fully clear, as I'm still in the phase of "get it working when I symlink everything everywhere."

Wednesday, April 10, 2013

TBPL

When running final tests for my latest patch queue, I discovered that someone has apparently added a new color to the repertoire: a hot pink. So I now present to you a snapshot of TBPL that uses all the colors except gray (running), gray (pending), and black (I've never seen this one):

Sunday, April 7, 2013

JSMime status update

This post is admittedly long overdue, but I kept wanting to delay this post until I actually had patches up for review. But I have the tendency to want to post nothing until I verify that the entire pipeline consisting of over 5,000 lines of changes is fully correct and documented. However, I'm now confident in all but roughly three of my major changes, so patch documentation and redistribution is (hopefully) all that remains before I start saturating all of the reviewers in Thunderbird with this code. An ironic thing to note is that these changes are actually largely a sidetrack from my original goal: I wanted to land my charset-conversion patch, but I first thought it would be helpful to test with nsIMimeConverter using the new method, which required me to implement header parsing, which is best tested with nsIMsgHeaderParser, which turns out to have needed very major changes.

As you might have gathered, I am getting ready to land a major set of changes. This set of changes is being tracked in bugs 790855, 842632, and 858337. These patches are implementing structured header parsing and emission, as well as RFC 2047 decoding and encoding. My goal still remains to land all of these changes by Thunderbird 24, reviewers permitting.

The first part of JSMime landed back in Thunderbird 21, so anyone using the betas is already using part of it. One of the small auxiliary interfaces (nsIMimeHeaders) was switched over to the JS implementation instead of libmime's implementation, as well as the ad-hoc ones used in our test suites. The currently pending changes would use JSMime for the other auxiliary interfaces, nsIMimeConverter (which does RFC 2047 processing) and nsIMsgHeaderParser (which does structured processing of the addressing headers). The changes to the latter are very much API-breaking, requiring me to audit and fix every single callsite in all of comm-central. On the plus side, thanks to my changes, I know I will incidentally be fixing several bugs such as quoting issues in the compose windows, a valgrind error in nsSmtpProtocol.cpp, or the space-in-2047-encoded-words issue.

It's not all the changes, although being able to outright remove 2000 lines of libmime is certainly a welcome change. The brunt of libmime remains the code that is related to the processing of email body parts into the final email display method, which is the next target of my patches and which I originally intended to fix before I got sidetracked. Getting sidetracked isn't altogether a bad thing, since, for the first time, it lets me identify things that can be done in parallel with this work.

A useful change I've identified that is even more invasive than everything else to date would be to alter our view of how message headers work. Right now, we tend to retrieve headers (from, say, nsIMsgDBHdr) as strings, where the consumer will use a standard API to reparse them before acting on their contents. A saner solution is to move the structured parsing into the retrieval APIs, by making an msgIStructuredHeaders interface, retrievable from nsIMsgDBHdr and nsIMsgCompFields from which you can manipulate headers in their structured form instead of their string from. It's even more useful on nsIMsgCompFields, where keeping things in structured form as long as possible is desirable (I particularly want to kill nsIMsgCompFields.splitRecipients as an API).

Another useful change is that our underlying parsing code can properly handle groups, which means we can start using groups to handle mailing lists in our code instead of…the mess we have now. The current implementation sticks mailing lists as individual addresses to be expanded by code in the middle of the compose sequence, which is fragile and suboptimal.

The last useful independent change I can think of is rewriting the addressing widget in the compose frontend to store things internally in a structured form instead of the MIME header kludge it currently uses; this kind of work could also be shared with the similar annoying mailing list editing UI.

As for myself, I will be working on the body part conversion process. I haven't yet finalized the API that extensions will get to use here, as I need to do a lot of playing around with the current implementation to see how flexible it is. The last two entry points into libmime, the stream converter and Gloda, will first be controlled by preference, so that I can land partially-working versions before I land everything that is necessary. My goal is to land a functionality-complete implementation by Thunderbird 31 (i.e., the next ESR branch after 24), so that I can remove the old implementation in Thunderbird 32, but that timescale may still be too aggressive.

Friday, February 15, 2013

Why software monocultures are bad

If you do anything with web development, you are probably well aware that Opera recently announced that it was ditching its Presto layout engine and switching to Webkit. The reception of the blogosphere to this announcement has been decidedly mixed, but I am disheartened by the news for a very simple reason. The loss of one of the largest competitors in the mobile market risks entrenching a monoculture in web browsing.

Now, many people have attempted to argue against this risk by one of three arguments: that Webkit already is a monoculture on mobile browsing; that Webkit is open-source, so it can't be a "bad" monoculture; or that Webkit won't stagnant, so it can't be a "bad" monoculture. The first argument is rather specious, since it presumes that once a monoculture exists it is pointless to try to end it—walk through history, it's easier to cite examples that were broken than ones that weren't. The other two arguments are more dangerous, though, because they presume that a monoculture is bad only because of who is in charge of it, not because it is a monoculture.

The real reason why monocultures are bad are not because the people in control do bad things with it. It's because their implementations—particularly including their bugs—becomes the standards instead of the specifications themselves. And to be able to try to crack into that market, you have to be able to reverse engineer bugs. Reverse engineering bugs, even in open-source code, is far from trivial. Perhaps it's clearer to look at the problems of monoculture by case study.

In the web world, the most well-known monoculture is that of IE 6, which persisted as the sole major browser for about 4 years. One long-lasting ramification of IE is the necessity of all layout engines to support a document.all construct while pretending that they do not actually support it. This is a clear negative feature of monocultures: new things that get implemented become mandatory specifications, independent of the actual quality of their implementation. Now, some fanboys might proclaim that everything Microsoft does is inherently evil and that this is a bad example, but I will point out later known bad-behaviors of Webkit later.

What about open source monocultures? Perhaps the best example here is GCC, which was effectively the only important C compiler for Linux until about two years ago, when clang become self-hosting. This is probably the closest example I have to a mooted Webkit monoculture: a program that no one wants to write from scratch and that is maintained by necessity by a diverse group of contributors. So surely there are no aftereffects from compatibility problems for Clang, right? Well, to be able to compile code on Linux, Clang has to pretty much mimic GCC, down to command-line compatibility and implementing (almost) all of GCC's extensions to C. This also implies that you have to match various compiler intrinsics (such as those for atomic operations) exactly as GCC does: when GCC first implemented proper atomic operations for C++11, Clang was forced to change its syntax for intrinsic atomic operations to match as well.

The problem of implementations becoming the de facto standard becomes brutally clear when a radically different implementation is necessary and backwards compatibility cannot be sacrificed. IE 6's hasLayout bug is a good example here: Microsoft thought it easier to bundle an old version of the layout engine in their newest web browser to support old-compatibility webpages than to try to adaptively support it. It is much easier to justify sacking backwards compatibility in a vibrant polyculture: if a website works in only one layout engine when there are four major ones, then it is a good sign that the website is broken and needs to be fixed.

All of these may seem academic, theoretical objections, but I will point out that Webkit has already shown troubling signs that do not portend to it being a "good" monoculture. The decision to never retire old Webkit prefixes is nothing short of an arrogant practice, and clearly shows that backwards compatibility (even for nominally non-production features) will be difficult to sacrifice. The feature that became CSS gradients underwent cosmetic changes that made things easier for authors that wouldn't have happened in a monoculture layout engine world. Chrome explicitly went against the CSS specification (although they are trying to change it) in one feature with the rationale that is necessary for better scrolling performance in Webkit's architecture—which neither Gecko nor Presto seem to require. So a Webkit monoculture is not a good thing.

Updated DXR on dxr.mozilla.org

If you are a regular user of DXR, you may have noticed that the website today looks rather different from what you are used to. This is because it has finally been updated to a version dating back to mid-January, which means it also includes many of the changes developed over the course of the last summer, previously visible only on the development snapshot (which didn't update mozilla-central versions). In the coming months, I also plan to expand the repertoire of indexed repositories from one to two by including comm-central. Other changes that are currently being developed include a documentation output feature for DXR as well as an indexer that will grok our JS half of the codebase.

Monday, January 21, 2013

DXR's future potential

The original purpose of DXR, back when the "D" in its name wasn't a historical artifact, was to provide a replacement for MXR that grokked Mozilla's source code a lot better. In the intervening years, DXR has become a lot better at being an MXR replacement, so I think that perhaps it is worth thinking about ways that DXR can start going above and beyond MXR—beyond just letting you searched for things like "derived" and "calls" relationships.

The #ifdef problem

In discussing DXR at the 2011 LLVM Developers' Conference, perhaps the most common question I had was asking what it did about the #ifdef problem: how does it handle code present in the source files but excluded via conditional compilation constructs? The answer then, as it is now, was "nothing:" at present, it pretends that code not compiled doesn't really exist beyond some weak attempts to lex it for the purposes of syntax highlighting. One item that has been very low priority for several years was an idea to fix this issue by essentially building the code in all of its variations and merging the resulting database to produce a more complete picture of the code. I don't think it's a hard problem at all, but rather just an engineering concern that needs a lot of little details to be worked out, which makes it impractical to implement while the codebase is undergoing flux.

Documentation

Documentation is an intractable unsolved problem that makes me wonder why I bring it up here…oh wait, it's not. Still, from the poor quality of most documentation tools out there when it comes to grokking very large codebases (Doxygen, I'm looking at you), it's a wonder that no one has built a better one. Clang added a feature that lets it associate comments to AST elements, which means that DXR has all the information it needs to be able to build documentation from our in-tree documentation. With complete knowledge of the codebase and a C++ parser that won't get confused by macros, we have all the information we need to be able to make good documentation, and we also have a very good place to list all of this documentation.

Indexing dynamic languages

Here is where things get really hard. A language like Java or C# is very easy to index: every variable is statically typed and named, and fully-qualified names are generally sufficient for global uniqueness. C-based languages lose the last bit, since nothing enforces global uniqueness of type names. C++ templates are effectively another programming language that relies on duck-typing. However, that typing is still static and can probably be solved with some clever naming and UI; dynamic languages like JavaScript or Python make accurately finding the types of variables difficult to impossible.

Assigning static types to dynamic typing is a task I've given some thought to. The advantage in a tool like DXR is that we can afford to be marginally less accurate in typing in trade for precision. An example of such an inaccuracy would be ignoring what happens with JavaScript's eval function. Inaccuracies here could be thought of as inaccuracies resulting from a type-unsafe language (much like any C-based callgraph information is almost necessarily inaccurate due to problems inherent to pointer alias analysis). The actual underlying algorithms for recovering types appear known and documented in academic literature, so I don't think that actually doing this is theoretically hard. On the other hand, those are very famous last words…

Wednesday, November 7, 2012

Autotools, how I hate thee

When writing custom passes for a compiler, it's often a good idea to try running them on real-world programs to assess things like scalability and correctness. Taking a large project and seeing your pass work (and provide useful results!) is an exhilarating feeling. On the other hand, trying to feed in your compiler options into build systems is a good route to an insane asylum.

I complained some time ago about autoconf 2.13 failing because it assumes implicit int. People rightly pointed out that newer versions of autoconf don't assume that anymore. But new versions still come with their own cornucopias of pain. Libtool, for example, believes that the best route to linking a program is to delete every compiler flag from the command line except those it knows about. Even if you explicitly specify them in LDFLAGS. Then there's this conftest program that I found while compiling gawk:

/* Define memcpy to an innocuous variant, in case <limits.h> declares memcpy.
   For example, HP-UX 11i <limits.h> declares gettimeofday.  */
#define memcpy innocuous_memcpy

/* System header to define __stub macros and hopefully few prototypes,
    which can conflict with char memcpy (); below.
    Prefer <limits.h> to <assert.h> if __STDC__ is defined, since
    <limits.h> exists even on freestanding compilers.  */

#ifdef __STDC__
# include <limits.h>
#else
# include <assert.h>
#endif
#undef memcpy

/* Override any GCC internal prototype to avoid an error.
   Use char because int might match the return type of a GCC
   builtin and then its argument prototype would still apply.  */
#ifdef __cplusplus
extern "C"
#endif
char memcpy ();
/* The GNU C library defines this for functions which it implements
    to always fail with ENOSYS.  Some functions are actually named
    something starting with __ and the normal name is an alias.  */
#if defined __stub_memcpy || defined __stub___memcpy
choke me
#endif

int
main ()
{
return memcpy ();
  ;
  return 0;
}

…I think this code speaks for itself in how broken it is as a test. One of the parts of the compiler pass involved asserted due to memcpy not being used in the right way, crashing the compiler. Naturally, this being autoconf, it proceeded to assume that I didn't have a memcpy and thus decided to provide me one, which causes later code to break in spectacular bad function when you realize that memcpy is effectively a #define in modern glibc. And heaven forbid if I should try to compile with -Werror in configure scripts (nearly every test program causes a compiler warning along the lines of "builtin function is horribly misused").

The saddest part of all is that, as bad as autoconf is, it appears to be the least broken configuration system out there…