Wednesday, June 29, 2011

Mozilla-central and DXR

Some of you may be wondering why it is taking me so long to build a DXR index of mozilla-central. Part of the answer is that I'm waiting to bring DXR on clang back up to the original instantiation in terms of feature support. The other part of the answer is that mozilla-central is rather big to index. How big? Well, some statistics await…

I first noticed a problem when I was compiling and found myself at 4 GiB of space remaining and dwindling fast. I'm used to fairly tight space restrictions (a result of malportioning the partitions on my last laptop), so that's not normally a source of concern. Except I had been at 10 GiB of free space an hour before that, before I started the run to index mozilla-central.

The original indexing algorithm is simple: for each translation unit (i.e., C++ file), I produce a corresponding CSV file of the "interesting" things I find in that translation unit—which includes all of the header files included. As a result, I emit the data for every header file each time it is included by some compiled file. The total size of all data files produced by the mozilla-central run turned out to be around 12GiB. When I collected all of the data and removed duplicate lines, the total size turned out to be about 573MiB.

Step back and think about what this means for a moment. Since "interesting" things to DXR basically boil down to all warnings, declarations, definitions, and references (macros and templates underrepresented), this implies that every declaration, definition, and reference is parsed by the compiler, on average, about 20-25 times. Or, if you take this as a proxy for the "interesting" lines of code, the compiler must read every line of code in mozilla-central about 20-25 times.

The solution for indexing is obviously not to output that much data in the first place. Since everyone who includes, say, nscore.h does so in the same way, there's no reason to reoutput its data in all of the thousands of files that end up including it. However, a file like nsTString.h (for the uninitiated, this is part of the standard internal string code interfaces, implemented using C's versions of templates, more commonly known as "the preprocessor") can be included multiple times but produce different results: one for nsString and one for nsCString in the same place. Also, there is the question of making it work even when people compile with -jN, since I have 4 cores that are begging for the work.

It was Taras who thought up the solution. What we do is we separate out all of the CSV data by the file that it comes in. Then, we store each of the CSV data in a separate file whose name is a function of both the file it comes in and its contents (actually, it's <file>.<sha1(contents)>.csv, for the curious). This also solves the problem of multiple compilers trying to write the same file at the same time: if we open with O_CREAT | O_EXCL and the open fails because someone else created the file… we don't need to do anything because the person who opens the file will write the same data we wanted to write! Applying this technique brings the total generated CSV file data down to around 1GiB (declaration/definition mappings account for the need for duplicates), or down to about 2 times the real data size instead of 20 times. Hence why the commit message for fixing this is titled Generate much less data. MUCH LESS.

Of course, that doesn't solve all of my problems. I still have several hundred MiBs of real data that need to be processed and turned into the SQL database and HTML files. Consuming the data in python requires a steady state of about 3-3.5 GiB of resident memory, which is a problem for me since I stupidly gave my VM only 4GiB of memory and no swap. Switching the blob serialization from using cPickle (which tries to keep track of duplicate references) to marshal (which doesn't) allowed me to postpone crashing until I generate the SQL database, where SQL allocates just enough memory to push me over the edge, despite generating all of the SQL statements using iterators (knowing that duplicate processing is handled more implicitly). I also switched from using multiple processes to using multiple threads to avoid Linux failing to fork Python due to not enough memory (shouldn't it only need to copy-on-write?).

Obviously, I need to do more work on shrinking the memory usage. I stripped out the use of the SQL for HTML generation and achieved phenomenal speedups (I swore that I had to have broken something since it went from a minute to near-instantaneous). I probably need to move to a light-weight database solution, for example, LevelDB, but I don't quite have a simple (key, value) situation but a (file, plugin, table, key, value) one. Still not hard to do, but more than I want to test for a first pass.

In short, DXR indexing for mozilla-central is being recreated as I write this. Enjoy.


Anonymous said...

What analysis is DXR doing on these files? Does it matter if the meaning of a file is changed not only based on its filename and hash, but also on some preceding input? (For example, if you #define GNU_SOURCE before #including the file, or typedef something that the file uses.)

I suspect it doesn't matter for DXR, but that's the thought that immediately jumped into my mind because those are the things that bedevil precompiled headers.

Joshua Cranmer said...

Ah sorry, I should have made this clearer.

The hash of the contents is the hash of the CSV that I'm storing in the CSV file, not the hash of the original source file itself.

Zizzle said...

You can add swap without a swap partition. Just create a swap file.

[from memory]

dd if=/dev/zero of=/swap.file bs=1M count=200

mkswap /swap.file

echo "/swap.file none swap sw 0 0" >> /etc/fstab

swapon -a

Gordon P. Hemsley said...

I get an Internet Server Error when I try to search something.

And I noticed that you haven't fixed the ampersand escape issue that I mentioned on GitHub yet.