Thursday, June 2, 2011

Visualization toolkits suck

This is a rant in prelude to a blog post I expect to make tomorrow morning (or later today, depending on your current timezone). As someone who is mildly interested in data, I spend a fair amount of time thinking up things that would be interesting to just see dumped out visually. And I'm not particularly interested in small datasets: one of my smaller datasets I've been playing with is "every symbol in"; one of my larger datasets is "every news message posted to Usenet this year, excluding binaries" (note: this is 13 GiB worth of data, and that's not 100% complete).

So while I have a fair amount of data, what I need is a simple way to visualize it. If what I have is a simple bar chart or scatter plot, it's possible to put it in Excel or LibreOffice... only to watch those programs choke on a mere few thousand data points (inputting a 120K file caused LibreOffice to hang for about 5 minutes to produce the scatter plot). But spreadsheet programs clearly lack the power for serious visualization information; the most glaring chart they lack is the box-and-whiskers plot. Of course, if the data I'm looking at isn't simple 1D or 2D data, then what I really need won't be satiable with them anyways.

I also need to visualize large tree data, typically in a tree map. Since the code for making squarified tree maps is more than I care to do for a simple vis project, I'd rather just use a simple toolkit. But which to use? Java-based toolkits (i.e., prefuse) require me to sit down and make a full-fledged application for what should be a quick data visualization, and I don't know any Flash to be able to use Flash toolkits (i.e., flare). For JavaScript, I've tried the JavaScript InfoVis Toolkit, Google's toolkit, and protovis, all without much luck. JIT and protovis both require too much baggage for a simple "what does this data look like", and Google's API is too inflexible to do anything more than "ooh, pretty treemap". Hence why my previous foray into an application for viewing code coverage used a Java applet: it was the only thing I could get working.

What I want is a toolkit that gracefully supports large datasets, allows me to easily drop data in and play with views (preferably one that doesn't try to dynamically change the display based on partial options reconfiguration like most office suites attempt), supports a variety of datasets, and has a fine-grained level of customizability. Kind of like SpotFire or Tableau, just a tad bit more in my price range. Ideally, it would also be easy to put on the web, too, although supporting crappy old IE versions isn't a major feature I need. Is that really too much to ask for?


glandium said...

Maybe this can help you.

I meant to do that for a while ago, but failed to find some time for it (or when I had time, failed to remember it and focused on something else)

Anonymous said...

Why not use a good datawarehouse package?

wabik said...

Try GNU Plot


Patrick Cloke said...

Why not use something such as MATLAB that's meant to handle large amounts of data (yes, MATLAB isn't free but there are alternatives -- GNU Octave, which uses GNU Plot; FreeMat; Scilab).


Pike said...

I played with protovis last weekend. It was one very frustrating day, and one good one.

The first shattered all my expectations, and the second made me make bold assumptions and I learned which parts of the code to copy and tweak to make it do what I want, at least somewhat.

D3 is even more low-end.

But for really large datasets, the magic is probably to find out what not to draw, and I guess that's something you need to do yourself in most toolkits today. I know that we have folks at mozilla who can work with statistical data languages, IIRC "R" is one. But I never had to go there myself.

Anonymous said...

Try using something that scientists actually use (e.g. Mathematica has extremely flexible and easy to use graphics that'll let you make completely new types of graphs in a few minutes. No toolkit made by/for *programmers* offers a comparable combination of flexibility and ease of use.)

Preprocess your data to extract the relevant information, then import this into your visualization program. No easy to use program will ever be able to deal with overly large data. E.g. if you work with post frequencies, extract this from your 13 GB of data, and use only the few meagabytes you're left with. Generally you'll need to process the large dataset sequentially to extract the important data for the task at hand, then when the amount of data has been sufficiently reduced, you'll be able to load all of it into memory with your vis program.

Janet S. said...

Have you looked at Matplotlib for 2D and MayaVi2 for 3D data?

Leonardo Santagada said...

I would recomend either processing[1] or a tool like Sage[2] or Mathematica. I would first try sage if you want common 2d plots, because python is a lot easier than the Mathematica language.


Daniel Einspanjer said...

The Mozilla Metrics team has a few tools that we have layered on top of Protovis and such to make datavis a bit easier. If you are interested, pop into the #metrics channel and chat with us about it.

Joshua Cranmer said...

Part of the issue I have is that I don't want to write visualization code myself: I've done it before, but I have better things to do in life than sit down and write the code to draw, e.g. a box plot drawing program.

I've also tried to use processing and protovis before with less than thrilling results. Good pointer to the metrics team though, I'll look into that.