A good idea would be to measure each iteration separately and then discard outliers by e.g. discarding those that exceed the abs diff between the mean and the stddev.
I leave it as an exercise to the reader to figure out why that's not a good idea. In any case, the last unit of the class dealt with the fun part of statistics, which is to actually evaluate whether observed data is statistically significant. The actual math involved isn't too bad, assuming you have someone spit out a cumulative distribution function for the t-distribution for you (here is Boost's code), but it's a bit convoluted to write here, so read Wikipedia's page. The correct tests are actually the two-sampled t-tests.
But I got to thinking. One thing I've wanted to see for a while is a profiling extension, one that would run some JS code snippet multiple times and produce various reports on profiling runs. Wouldn't it be nice if such an extension could compare two runs and determine if they was a statistically significant difference between them?