I first noticed a problem when I was compiling and found myself at 4 GiB of space remaining and dwindling fast. I'm used to fairly tight space restrictions (a result of malportioning the partitions on my last laptop), so that's not normally a source of concern. Except I had been at 10 GiB of free space an hour before that, before I started the run to index mozilla-central.
The original indexing algorithm is simple: for each translation unit (i.e., C++ file), I produce a corresponding CSV file of the "interesting" things I find in that translation unit—which includes all of the header files included. As a result, I emit the data for every header file each time it is included by some compiled file. The total size of all data files produced by the mozilla-central run turned out to be around 12GiB. When I collected all of the data and removed duplicate lines, the total size turned out to be about 573MiB.
Step back and think about what this means for a moment. Since "interesting" things to DXR basically boil down to all warnings, declarations, definitions, and references (macros and templates underrepresented), this implies that every declaration, definition, and reference is parsed by the compiler, on average, about 20-25 times. Or, if you take this as a proxy for the "interesting" lines of code, the compiler must read every line of code in mozilla-central about 20-25 times.
The solution for indexing is obviously not to output that much data in the first place. Since everyone who includes, say, nscore.h does so in the same way, there's no reason to reoutput its data in all of the thousands of files that end up including it. However, a file like nsTString.h (for the uninitiated, this is part of the standard internal string code interfaces, implemented using C's versions of templates, more commonly known as "the preprocessor") can be included multiple times but produce different results: one for nsString and one for nsCString in the same place. Also, there is the question of making it work even when people compile with -jN, since I have 4 cores that are begging for the work.
It was Taras who thought up the solution. What we do is we separate out all of the CSV data by the file that it comes in. Then, we store each of the CSV data in a separate file whose name is a function of both the file it comes in and its contents (actually, it's <file>.<sha1(contents)>.csv, for the curious). This also solves the problem of multiple compilers trying to write the same file at the same time: if we open with O_CREAT | O_EXCL and the open fails because someone else created the file… we don't need to do anything because the person who opens the file will write the same data we wanted to write! Applying this technique brings the total generated CSV file data down to around 1GiB (declaration/definition mappings account for the need for duplicates), or down to about 2 times the real data size instead of 20 times. Hence why the commit message for fixing this is titled Generate much less data. MUCH LESS.
Of course, that doesn't solve all of my problems. I still have several hundred MiBs of real data that need to be processed and turned into the SQL database and HTML files. Consuming the data in python requires a steady state of about 3-3.5 GiB of resident memory, which is a problem for me since I stupidly gave my VM only 4GiB of memory and no swap. Switching the blob serialization from using cPickle (which tries to keep track of duplicate references) to marshal (which doesn't) allowed me to postpone crashing until I generate the SQL database, where SQL allocates just enough memory to push me over the edge, despite generating all of the SQL statements using iterators (knowing that duplicate processing is handled more implicitly). I also switched from using multiple processes to using multiple threads to avoid Linux failing to fork Python due to not enough memory (shouldn't it only need to copy-on-write?).
Obviously, I need to do more work on shrinking the memory usage. I stripped out the use of the SQL for HTML generation and achieved phenomenal speedups (I swore that I had to have broken something since it went from a minute to near-instantaneous). I probably need to move to a light-weight database solution, for example, LevelDB, but I don't quite have a simple (key, value) situation but a (file, plugin, table, key, value) one. Still not hard to do, but more than I want to test for a first pass.
In short, DXR indexing for mozilla-central is being recreated as I write this. Enjoy.