Saturday, June 4, 2011

DXR thinking aloud

A primary goal of DXR is to be easy to use. Like any simple statement, translating that into design decisions is inordinately difficult, and I best approach such issues by thinking out loud. Normally, I do this in IRC channels, but today I think it would be best to think in a louder medium, since the problem is harder.

There are three distinct things that need to be made "easy to use". The first is the generation of the database and the subsequent creation of the web interface. The second is extension of DXR to new languages, while the last is the customization of DXR to provide more information. All of them have significant issues.

Building with DXR

Starting with the first one, build systems are complicated. For a simple GNU-ish utility, ./configure && make is a functioning build system. But where DXR is most useful is on the large, hydra-like projects where figuring out how to build the program is itself a nightmare: Mozilla, OpenJDK, Eclipse, etc. There is also a substantial number of non-autoconf based systems which throw great wrenches in everything. At the very least, I know this much about DXR: I need to set environment variables before configuring and building (i.e., replace the compiler), I need to "watch" the build process (i.e., follow warning spew), and I need to do things after the build finishes (post-processing). Potential options:

  1. Tell the user to run this command before the build and after the build. On the plus side, this means that DXR needs to know absolutely nothing about how the build system works. On the down side, this requires confusing instructions: in particular, since I'm setting environment variables, the user has to specifically type ". " in the shell to get them set up properly. There are a lot of people who do not have significant shell exposure to actually understand why that is necessary, and general usage is different enough from the commands that people are liable to make mistakes doing so.
  2. Guess what the build system looks like and try to do it all by ourselves. This is pretty much the opposite extreme, in that it foists all the work on DXR. If your program is "normal", this won't be a problem. If your program isn't... it will be a world of pain. Take also into consideration that any automated approach is likely to fail hard on Mozilla code to begin with, which effectively makes this a non-starter.
  3. Have the user input their build system to a configuration file and go from there. A step down from the previous item, but it increases the need for configuration files.
  4. Have DXR spawn a shell for the build system. Intriguing, solves some problems but causes others.

Conclusion: well, I don't like any of those options. While the goal of essentially being able to "click" DXR and have it Just Work™ is nice, I have reservations about such an approach being able to work in practice. I think I'll go for a "#1 and punt on this issue to someone with more experience."

Multiple language

I could devote an entire week's worth of blog posts to this topic, I think, and I would wager that this is more complicated and nebulous than even build systems are. In the end, all we really need to worry about with build systems is replacing compilers with our versions and getting to the end; with languages, we actually need to be very introspective and invasive to do our job.

Probably the best place to start is actually laying out what needs to be done. If the end goal is to produce the source viewer, then we need to at least be able to do syntax highlighting. That by itself is difficult, but people have done it before: I think my gut preference at this point is to basically ask authors of DXR plugins to expose something akin to vim's syntax highlighting instead of asking them to write a full lexer for their language.

On the other end of the spectrum is generating the database. The idea is to use an instrumenting compiler, but while that works for C++ or Java, someone whose primary code is a website in Perl or a large Python utility has a hard time writing a compiler. Perhaps the best option here is just parsing the source code when we walk the tree. There is also the question about what to do with the build system: surely people might want help understanding what it is their Makefile.in is really doing.

So what does the database look like? For standard programming languages, we appear to have a wide-ranging and clear notion of types/classes, functions, and variables, with slightly vaguer notions of inheritance, macros (in both the lexical preprocessing sense and the type-based sense of C++'s templates), and visibility. Dynamic languages like JavaScript or Python might lack some reliable information (e.g., variables don't have types, although people often still act as if they have implicit type information), but they still uphold this general contract. If you consider instead things like CSS and HTML or Makefiles in the build system, this general scheme completely fails to hold, but you can still desire information in the database: for example, it would help to be able to pinpoint which CSS rules apply to a given HTML element.

This begs the question, how does one handle multiple languages in the database? As I ponder this, I realize that there are multiple domains of knowledge: what is available in one language is not necessarily available in another. Of the languages Mozilla uses, C, assembly, C++, and Objective-C[++] all share the same ability to access any information written in the other languages; contrast this to JS code, which can only interact with native code via the use of a subset of IDL or interactions with native functions. IDL is a third space, which is a middle ground between native and JS code, but is insufficiently compatible with either to be lumped in with one. Options:

  1. Dump each language into the same tables with no distinction. This has problems in so far as some languages can't be shoehorned into the same models, but I think that in such cases, one is probably looking for different enough information anyways that it doesn't matter. The advantage of this is that searching for an identifier will bring it up everywhere. The disadvantage... is that it gets brought up everywhere.
  2. Similar to #1, but make an extra column for language, and let people filter by language.
  3. Going a step further, take the extra language information and build up the notion of different bindings: this foo.bar method on a python object may be implemented by this Python_foo_bar C binding. In other words, add another table which lists this cross-pollination and takes it into account when searching or providing detailed information<.
  4. Instead of the language column in tables, make different tables for every language.
  5. Instead of tables, use databases?

Hmm. I think the binding cross-reference is important. On closer thought, it's not really languages themselves that are the issue here, it's essentially the target bindings: if we have a system that is doing non-trivial build system work that involves cross-compiling, it matters if what we are doing is being done for the host or being done for the target. Apart from that, I think right now that the best approach is to have different tables.

Extraneous information

The previous discussion bleeds into this final one, since they both ultimately concern themselves with one thing: the database. This time, the question is how to handle generation of information beyond the "standard" set of information. Information, as I see it, comes in a few forms. There is additional information at the granularity of identifiers (this function consumes 234 bytes of space or this is the documentation for the function), lines (this line was not executed in the test suite), files (this file gets compiled to this binary library), and arguably directories or other concepts not totally mappable to constructs in the source tree (e.g., output libraries).

The main question here is not on the design of the database: it's only a question of extra tables or extra columns (or both!). Instead, the real question is in the design of the programmatic mechanisms. In dxr+dehydra, the simple answer is to load multiple scripts. For dxr+clang, however, the question becomes a lot more difficult since the code is written in C++ and isn't dynamically loading modules like dehydra does. It also begins to beg the question of the exposed API. On the other hand, I'm not sure I know enough of the problem space to be able to actually come up with solutions. I think I'll leave this one for later

7 comments:

Anonymous said...

Yeah, first #1 is good. We can also get rid of sql, we'll have to eventually if we want to make this sucker perform :)

I doubt dxr needs more than a fast key-value store. the db is only generated once and is readonly after that.

I think making a generic frontend is a hopeless task. We should focus on specific languages and then once someone else wants to add a language we can address that problem when we come to it. Overdesign is worse than no design

Taras

Mook said...

For the building part, would something like what ccache does work? Make a wrapper to gcc that takes the same arguments, and call it "gcc" and stick it earlier in $PATH.

jmdesp said...

The strength of dxr is to be able to do far more than just syntax based analysis, if you limit yourself to that, there's several other tools that are able to do the same.
Also understanding bindings between separate languages and being able to go across them would be a big killer feature. I think for the mozilla code database it'd be hugely useful to have that for xpcom based bindings, but based on an extendable base that can easily be modified to work also for other kinds of bindings.

Pike said...

One thing that has come up out of the l10n community tons of times is being able to find strings, and how they're used.

Finding the source of a localized string just in en-US would be really helpful, spanning all localizations even more. And cross referencing entity references and their definition inside DTD files would be great. Same for properties files, though I'd have a harder time to think something up to find the source.

I'd be happy to help with the analysis step for our DTD and propertiers files, if you see a point in that.

Joshua Cranmer said...

@Taras: I don't know if I made this clear, but the literal output of the clang indexer is a (or several) giant CSV file(s) which gets plopped into the postprocessor to be turned into the SQL to insert into the database. Once I get things basically working in ship-shape form, I'll spend some time profiling everything to figure out where we're burning the CPU time.

@Mook: I don't trust gcc to be the first compiler people look for. Although I hadn't thought about mucking with $PATH; so far, it's just $CC and $CXX I've mucked with. Still has the same issue with environment paths.

@jmdesp: Insofar as dxr can be language-agnostic, I think cross-language bindings is important. However, I suspect that it's hard to generalize the finding of the bindings...

@Pike: Neat. A related idea would be to try to find usages of *.property strings as well via some static analysis.

Anonymous said...

I would also say "look at ccache", except I hate the bogus gcc wrapper mode of using it.

Look at configure.in's --with-ccache option. It just sets CC="ccache $CC" and likewise for CXX.

So you have a wrapper script named "dxr-build" or whatever that simply sets your env vars and execs the rest of its arguments:

#!/bin/sh
export DXR_DOSTUFF=1
exec "$@"

or something. Although if you're really just setting environment variables, then that's too complex. Just make a --with-dxr that does

CC="env DXR_DOSTUFF=1 $CC"
CXX=...

Pike said...

Static analysis for .properties references would be awesome.

We have an existing python library for parsing DTDs and properties with http://hg.mozilla.org/l10n/silme/, too.

Also, feel free to CC me on bugs.