There are three distinct things that need to be made "easy to use". The first is the generation of the database and the subsequent creation of the web interface. The second is extension of DXR to new languages, while the last is the customization of DXR to provide more information. All of them have significant issues.
Building with DXR
Starting with the first one, build systems are complicated. For a simple GNU-ish utility, ./configure && make is a functioning build system. But where DXR is most useful is on the large, hydra-like projects where figuring out how to build the program is itself a nightmare: Mozilla, OpenJDK, Eclipse, etc. There is also a substantial number of non-autoconf based systems which throw great wrenches in everything. At the very least, I know this much about DXR: I need to set environment variables before configuring and building (i.e., replace the compiler), I need to "watch" the build process (i.e., follow warning spew), and I need to do things after the build finishes (post-processing). Potential options:
- Tell the user to run this command before the build and after the build. On the plus side, this means that DXR needs to know absolutely nothing about how the build system works. On the down side, this requires confusing instructions: in particular, since I'm setting environment variables, the user has to specifically type ".
" in the shell to get them set up properly. There are a lot of people who do not have significant shell exposure to actually understand why that is necessary, and general usage is different enough from the commands that people are liable to make mistakes doing so.
- Guess what the build system looks like and try to do it all by ourselves. This is pretty much the opposite extreme, in that it foists all the work on DXR. If your program is "normal", this won't be a problem. If your program isn't... it will be a world of pain. Take also into consideration that any automated approach is likely to fail hard on Mozilla code to begin with, which effectively makes this a non-starter.
- Have the user input their build system to a configuration file and go from there. A step down from the previous item, but it increases the need for configuration files.
- Have DXR spawn a shell for the build system. Intriguing, solves some problems but causes others.
Conclusion: well, I don't like any of those options. While the goal of essentially being able to "click" DXR and have it Just Work™ is nice, I have reservations about such an approach being able to work in practice. I think I'll go for a "#1 and punt on this issue to someone with more experience."
I could devote an entire week's worth of blog posts to this topic, I think, and I would wager that this is more complicated and nebulous than even build systems are. In the end, all we really need to worry about with build systems is replacing compilers with our versions and getting to the end; with languages, we actually need to be very introspective and invasive to do our job.
Probably the best place to start is actually laying out what needs to be done. If the end goal is to produce the source viewer, then we need to at least be able to do syntax highlighting. That by itself is difficult, but people have done it before: I think my gut preference at this point is to basically ask authors of DXR plugins to expose something akin to vim's syntax highlighting instead of asking them to write a full lexer for their language.
On the other end of the spectrum is generating the database. The idea is to use an instrumenting compiler, but while that works for C++ or Java, someone whose primary code is a website in Perl or a large Python utility has a hard time writing a compiler. Perhaps the best option here is just parsing the source code when we walk the tree. There is also the question about what to do with the build system: surely people might want help understanding what it is their Makefile.in is really doing.
This begs the question, how does one handle multiple languages in the database? As I ponder this, I realize that there are multiple domains of knowledge: what is available in one language is not necessarily available in another. Of the languages Mozilla uses, C, assembly, C++, and Objective-C[++] all share the same ability to access any information written in the other languages; contrast this to JS code, which can only interact with native code via the use of a subset of IDL or interactions with native functions. IDL is a third space, which is a middle ground between native and JS code, but is insufficiently compatible with either to be lumped in with one. Options:
- Dump each language into the same tables with no distinction. This has problems in so far as some languages can't be shoehorned into the same models, but I think that in such cases, one is probably looking for different enough information anyways that it doesn't matter. The advantage of this is that searching for an identifier will bring it up everywhere. The disadvantage... is that it gets brought up everywhere.
- Similar to #1, but make an extra column for language, and let people filter by language.
- Going a step further, take the extra language information and build up the notion of different bindings: this foo.bar method on a python object may be implemented by this Python_foo_bar C binding. In other words, add another table which lists this cross-pollination and takes it into account when searching or providing detailed information<.
- Instead of the language column in tables, make different tables for every language.
- Instead of tables, use databases?
Hmm. I think the binding cross-reference is important. On closer thought, it's not really languages themselves that are the issue here, it's essentially the target bindings: if we have a system that is doing non-trivial build system work that involves cross-compiling, it matters if what we are doing is being done for the host or being done for the target. Apart from that, I think right now that the best approach is to have different tables.
The previous discussion bleeds into this final one, since they both ultimately concern themselves with one thing: the database. This time, the question is how to handle generation of information beyond the "standard" set of information. Information, as I see it, comes in a few forms. There is additional information at the granularity of identifiers (this function consumes 234 bytes of space or this is the documentation for the function), lines (this line was not executed in the test suite), files (this file gets compiled to this binary library), and arguably directories or other concepts not totally mappable to constructs in the source tree (e.g., output libraries).
The main question here is not on the design of the database: it's only a question of extra tables or extra columns (or both!). Instead, the real question is in the design of the programmatic mechanisms. In dxr+dehydra, the simple answer is to load multiple scripts. For dxr+clang, however, the question becomes a lot more difficult since the code is written in C++ and isn't dynamically loading modules like dehydra does. It also begins to beg the question of the exposed API. On the other hand, I'm not sure I know enough of the problem space to be able to actually come up with solutions. I think I'll leave this one for later