Wednesday, August 3, 2011

Not-so-random mozilla-central factoids

Back when I was looking at reducing disk usage of running DXR, I used the sizes of the generated CSV files to make a rough estimate of how many times the average line of mozilla-central needs to be parsed and compiled (the answer is around 20). Now, having just respun a build of mozilla-central with the newest version of DXR, I have a larger database with some fairly accurate statistics of its size. This all started when I wondered what our most common function name was, so I decided to make a list of factoids (it's also a chance to refresh my SQL memory). All of the statistics come from a build of Firefox today on Linux x86-64 with debug and tests disabled.

Notes on accuracy

DXR tries as hard as possible to correlate data back to the original source code, and it is very likely that it can get itself confused in some weird circumstances. I don't yet have a good idea of where all of the buggy cases are, but I am aware of some broad strokes. Pretty much all information that deals with references is likely missing large swathes of the codebase. Information about callers only counts explicit calls (i.e, call expressions). The most accurate data I have is the macro information and the least accurate is information involving a templated class in almost any fashion. I also suspect that scope information is malformed in a non-empty set of cases, and I'm pretty sure that the number of global objects in any count is overcounted.

Type statistics

I count in mozilla-central just 33,555 distinct types. Of these, we have 13,648 typedefs, 6,523 structs, 6,470 classes, 5,147 enums, 1,410 interfaces, and 357 unions. Of all of these, 13,740 types are nested in another in some fashion, and another 4,524 types are templated. I'm not counting separate template instantiations as new types, but I am counting specializations individually.

Then there's the inheritance. I found 7,715 direct inheritance relations, all but a handful of which (around 200) are public. These relations account for 2,317 distinct base classes and 6,157 distinct subclasses. Naturally, some types are inherited much more than others. The winner, by a factor of 6, is nsISupports, having a whopping 3,156 implementations. Pickle and IPC::Message tie for second place with 501 implementations each; nsIRunnable takes 4th place at 282 implementations. The 5th place goes to nsXPCOMCycleParticipant (242). Rounding out the top 10 are nsIDOMElement (226), nsRunnable (224), nsScriptObjectTracer (209), nsSupportsWeakReference (154), and nsIObserver with 144. Subclass relationships involving templates are not counted, which may bump a few classes up into this list.

Macros

There are 42,457 distinct definitions of macros to produce 38,045 distinct names. Of these, 30,475 look like variables and 7,647 look like functions, which implies that 77 macros take on both depending on how they got defined.

How about calling them? I count just 482,142 macro invocations, so each macro is being invoked about 12 times on average. But... 20,628 of our macros are never used (or almost 48.6% of them), so the average is closer to 25 times. Of course, some macros really get used. Here are the top 5:

CountMacro name
23,109nsnull
22,941PR_FALSE
20,139NS_OK
13,154NS_IMETHOD
13,138NS_IMETHODIMP

Functions

There are 137,903 functions in mozilla-central. Of these, there are 68,867 having a distinct name. Of these, I found 33,693 in the global scope, and 8,943 that were a member of a class. Directly templated functions comprise about 2,291 functions. Just for fun, I found that there are 774 functions named exactly "Init" and a further 1,681 that begin with "Init" (case-insensitively in the last case).

In terms of calling these functions, I found 291,598 distinct edges in the callgraph (this is definitely an underestimate, since I am missing a large number of cases). For my usage, a callgraph is not a traditional directed graph but rather a hypergraph, where each edge goes from a single head to a set of nodes in the tail. These comprise 85,144 distinct callers and 67,098 distinct targets. Of the targets, I found 51,590 distinct functions being called statically, 13,274 distinct virtual functions invoked, and 2,234 distinct function pointers or pointers-to-member-functions being called. If I break it up by calls, 246,086 of the calls are static function calls, 41,770 virtual function calls, and 3,742 function pointer calls. I want to emphasize here that information pertaining to templates, in particular nsCOMPtr is completely missing, so a lot of calls to nsISupport's methods are missing, which is going to horribly skew the statistics.

Counting the function pointers, we have 65 pointers-to-member-functions and around 1700-1800 function pointers (those numbers do not add up to what I should get above, but I'm not sure who's in error here). I count about 19,721 virtual functions that I generated target information for (a subset of all virtual functions), and 65,761 implementations of those virtual functions, so the average virtual function has about 3.4 implementations. In addition, I found about 981 of these virtual functions were also called statically.

Now I'm sure, having mentioned it earlier, that you too are now wondering what the most common function name in mozilla-central is. The answer should be pretty obvious when you consider the most heavily-implemented class. And the winners are...

CountFunction name
1,687Release
1,680AddRef
1,600GetIID
1,587QueryInterface
774Init
659operator=
549Log
517Read
506Write
396GetType

The top 4 methods are related to XPCOM; the famed Init method is a mere 5th place. Of interesting note is that there are 659 assignment operators; I'm guessing some of these may be default copy constructors implemented for non-POD classes.

Variables

There's not much to say here. We have some 623,237 variables of some kind. Most of these, naturally, are local variables or parameters: 516,627, to be precise. We additionally have 82,008 members of some compound type, and 24,602 global variables. Some interesting statistics would be to compute the number of variables whose names defy our naming convention or the number of static constructors that need to be run before startup, but that data is harder to compute.

Warnings

I count 5,283 reported warnings for mozilla-central. Of these, 1,608 are warning about the use of non-virtual destructors with virtual functions. Another 940 warn about our use of mismatched enumeration types. There are 1,348 warnings of unused things. Finally, there are 1,551 warnings about use of extensions, leaving 466 "miscellaneous warnings".

Build statistics

The final set of statistics I have is mere size statistics for the build information. There are about 8,581,746 non-empty lines of text comprising some 320MiB of data. These are organized into about 51,469 files, including 1,469 files of generated files in the build directory. My output SQLite file was 464MiB, easily comprising around 3.5 million rows of data. We also have about 38MiB of binary files (like PNGs) in the source tree. Almost 15MiB of the generated included files were produced, comprising around 354,893 non-empty lines of text.

I think the best way to summarize this data is "mozilla-central is a massive codebase." It also goes to show you why just looking at source code without compiling is a bad idea: around 3-5% of our source code actually isn't in the source tree to begin with but is instead automatically created at compile time.

4 comments:

Anonymous said...

There is a little problem:

"Of these, we have [...], 5,147 enums, 1,410 interfaces, and 357 enums."

Is it 5,147 or 357 enums?

Joshua Cranmer said...

Gah, that was a typo I copied from another typo. The second enum should be 'unions'; this is fixed now.

Some more searching with respect to macros:
macros which are only checked in preprocessor-land are considered unused for the purposes of macro counts right now. This includes, most importantly include guards, so our 20K unused macros is really only about 14K unused macros if I exclude macros that have "_h_". Some perusing of more unused macros turns up NS_FORWARD_* as a common unused macro; excluding the generated files yields us with around 9K unused macro names.

Jittering out referencing bugs leads me to believe there is around a few hundred unused macros we could get rid of.

On a side note, I also see 5,823 warnings emitting 971 distinct messages.

Anonymous said...

That is really an awesome summary.
As your comment says, fixing the false positives for unused macros would be cool.

That way we would know how much of the code is unused...

kairo said...

This is awesome analysis - I take it that's just C++ etc. and not including JS code?