Tuesday, January 10, 2012

How bugs get fixed

Recently, I have had the misfortuneopportunity to fix bug 695309. This bug is perhaps a good exemplar of why "obvious" bugs take weeks or months to fix; it is also a good example of why just reporting that a bug occurs for the filer is insufficient to fix. In the hope that others might find this useful, I will explain in a fake liveblogging format how I fixed the bug (fake because I'm writing most of these entries well after they happened).

October 20

There's a bug that "Thunderbird sometimes marks entire newsgroups as unread" in my email. Time to open up the bug report… reported in version 8… I'm pretty sure I've seen this once or twice before, so I don't think it's just a "the news server borked itself" issue. Time to file it away until when I have more time.

November 12

Another comment reminded me that I have this tab open. I've definitely seen it a few times, but I need to remember to keep Thunderbird open with logging enabled to figure out what's going on. It's reported as being a regression from version 8, when I checked in a slew of the NNTP URI parsing patches, so that seems like a probable target to look at. The question is why URI parsing would be causing an intermittent problem instead of a constant issue?

December 3

Some NNTP logs. Nothing is obviously amiss (not terribly unexpected, since logging either tends to drown you in useless information or omit the things you really want to know). Note after the fact: the NNTP logs in fact contain the smoking gun; it's just that the poster trimmed it out.

December 6

Bienvenu comments that no developer has seen the problem; this isn't true, as I have seen it (but just taken to avoiding it as much as possible). At some point in time, I figured out that the issue could be avoided by shutting down Thunderbird before putting my laptop to sleep. After being prodded over IRC, I finally sat down and attempted to debug it. The working thesis is that the problem is that newsrc files are getting corrupted, so I faithfully save several copies of the newsrc for confirmation. The msf files are also generally good candidates, but since other flags are untouched, it's highly unlikely in this bug.

I successfully trigger the bug once. The reports are mixed: the newsrc didn't get trashed as expected at first, but later it did. However, there is a brief window of time after the bug happens which allows you to fix it, if you shut down Thunderbird. Knowing what I now know, this is because the newsrc file is written out on a timer, so the newly-all-unread messages wouldn't have been saved to disk to show the bug. Since I always think of what to log after the fact, I try to trigger it a few more times.

None of the later tests are quite so successful as the first one. It does start to dawn on me that the bug probably has two parts. There is a first step that puts the group into a pre-bug situation; a second step actually produces the symptoms of the bug. Omniscient addendum: the first step is where the bug happens; the second step is just the delay in getting the bug to occur. I report my findings, and subsequent comments all but confirm my hypothesis for a necessary component.

December 12

This bug seems to be happening for me only when I don't want it to happen. I thought of more things to test earlier, and had them ready to copy-paste into my error console to try to find a smoking gun. This test confirms that the fault originates with the newsrcLine (so something reset that); more investigation leads to only one truly likely scenario to cause this to happen. At this point, I am all but convinced that the bug happens in two parts. All of the debugging needs to focus on when the first part happens; the symptoms (everything being marked read) are a result of Thunderbird attempting to repair the damage that had already been done. Since most people are going to try to report based on when they see the problems, I'm probably not going to get anything useful.

December 13

Hey, I found it again. Tests confirm that the high water number was first set to 0. This means either we have memory corruption or we are setting the value incorrectly initially. Looking at the code, this is set from a call to sscanf. A simple test indicates that I can set the input value to 0 if I don't have it scan a number. Now all I need to do is figure out network traffic that can cause this. Time to try to trigger it with xpcshell tests. And then get frustrated because I still can't do it reliably.

December 18

With Wireshark (logging on Windows is rather annoying), I finally can track down the network traffic of interest. It turns out that we are off-by-one in terms of responses. This really should be enough to get xpcshell to report the error. However, it also means I probably should finally give up and go for the NNTP logs again to catch the error: it is painfully obvious that the problem is that the internal state is futzed up. This also means that other various newly-reported issues (more frequent auth failures, various alerts like "Not in a newsgroup", etc.) are pretty much confirmed to be the same bug.

January 9

I finally buckled down and turned my very old hackish code for NSPR log management into an extension. This means that I don't have to futz with environment variables on Windows, and it also gives me an easy way to trim down my log size (since unlike environment variables, I can stop logging). I test and finally get the NNTP log of interest.

Now that I have a log, I can try to work backwards and figure out how this bug happens. The failure is easily ascribable to somehow thinking we've already opened the socket when we open a socket; this much I have known for almost a month (the Wireshark told me). What the NNTP log gives me is a more fine-grained ability to try to trace the code after-the-fact to figure out where the bug is.

Working backwards from the log

NNTP logs helpfully record the address of the the NNTP protocol object when emitting output, so the best place to start is from the first line of that object. Since the numbers are hard to read, so I replace the addresses with a more obvious string like "evil." The next state was set to SEND_FIRST_NNTP_COMMAND—that clearly should be NNTP_LOGIN_RESPONSE, so let's see why it might be skipped. The latter is set if m_socketIsOpen is false, so perhaps someone could open a socket if it isn't set.

This variable is set only by nsMsgProtocol, specifically the LoadUrl method. So who might call that? Still nothing out of the ordinary is popping out at me, so it's time to return to the logs (A particularly astute developer might be able to find out the bug without returning to the log here, but I doubt most people would come up with the final flash of insight yet).

To figure out the problem, it's important (to me, at least) to set out the time frames. The natural sequence of events for a connection is to be created, connected to the server, and then go through several iterations of running URLs. Is this bad connection a new connection or an old one (since I started the log after TB started up, I can't guarantee that this is truly a new connection)? Thunderbird stupidly (but helpfully, in this case) tells me when the connection is created, since we get a log message of "creating" in the constructor. Since this message doesn't appear, it's being reused.

Wait—that message is there, tucked just underneath the lines that tell me LoadUrl is run. As I am often wont to do, I announce my revelation to IRC: "< jcranmer> bienvenu: please slap me". The code for the constructor is pretty simple, but there is one little function call that ends up causing the problem. If I tell you the exact revision that caused the regression (specifically, the first hunk of that the patch), can you find the regression? Click me if you can't

Reproducing the bug

Now that I know the exact sequence of events to cause the bug, I can write a test to reliably reproduce the bug. I can also explain why the symptoms occur. First, a connection's socket dies, but it doesn't fail the timeout check in mailnews code, so it remains in the cache. Next, the server schedules a URL on the connection, which promptly dies. If there are many more URLs to be run (when we're doing, say, a timed update), then we have several elements in the queue waiting to be processed. Now, we attempt to run a new URL; with no connections available in the cache, we now create a new connection and run the new URL on that. Since the queue is not empty, the SetIsBusy(false) call ends up pulling a queued URL and setting up the state (including opening up the connection) and running that. Then the constructor finishes, and the new URL for which the connection was constructed is used to initialize it. Since the connection is by now open, the code moves straight to the run-my-URL part of the state machine, which misinterprets the login response as the answer to the command it just sent. This ends up leading to all of the various badness subsequently observed.

The description of these events is pretty complicated. A condensed version is added to the test I constructed to get this to happen, which is called, fittingly enough, test_bug695309.js. I am not going to try to come up with a better name, as I think most people might agree that this is where naming tests after bug numbers is most desirable. Naturally, the code to actually trigger this is complicated: I started with code to trigger just the unread effect, and then switched to monitoring the highwater mark instead (one less download of messages). I need to fill up the connection cache and then figure out how to kill the connections (restarting the server seems to do an adequate job). After that, I need to run enough URLs to cause the connections to queue, and then trigger a load of the folder in question at precisely the right moment. All of these requires paying careful attention to what happens synchronously and what does not, and it all relies on knowing the precise steps of what will happen as a result of each action. Any of a small number of changes I could make in the future is probably going to "break" the test, in that it will cease to test anything meaningful.

Postmortem on fixing it

I'm sure some people might look at this bug and ask several questions. Let me try to anticipate some of them and answer them.

Why did the reviewers fail to notice this change?
This bug is a result of the confluence of several changes. In the beginning, SetIsBusy was mostly a no-op, so calling it from the constructor was safe. Then the URL queue was added, and SetIsBusy was used to indicate that the connection was ready to receive a new URL from the queue. This turned it into a time bomb, since the code was correct only because the server was null in the constructor and the queue should have been empty during connection construction anyways. The final change was moving initialization of the server, which triggered the time bomb. But even then, it triggered the bomb only in a rather abnormal circumstance. At no time would the code have failed any automated tests or any of the "does your patch work" tests that a reviewer normally goes through: robustness during connection interruption is both sufficiently difficult to test and rare enough that it generally doesn't get tested.
Why did this bug take so long to track down?
The time since December 1 is truly a result of lack of time: I have had maybe 1 good week to work on anything due to outside factors. There are two, maybe three, people in active participation with Mozilla who know this code well, and out of those, only one was able to somewhat reliably reproduce it (specifically, the one with the least amount of time on his hands). I could have relied on others to gather the information I needed, but my previous efforts in QA have taught me that getting people to give you necessary information is often very difficult, especially when the information collectable at the obvious point in time is completely useless. I realize in retrospect that there were some more technically inclined people in the bug (at which point in time I was already devoting as much attention as I felt I could spare on the bug anyways). The bug is also particularly pernicious in that the most valuable information is that which is gathered before any visible symptoms occur; it was after I figured out how to discover that the bug was going to occur that I could track it down further.
Will the fix be backported to more stable branches?
At this stage, I am looking at two fixes: the small one-liner that fixes the specific bug, and the larger patch which fixes the larger issue of "the NNTP URL connection queue is a ticking time bomb". The former I would definitely like to see backported to aurora and beta (and possibly release if there are plans for another one before the next uplift), and I can't see any reason a small patch on the most-highly-voted TB bug filed since 01-01-2011 would get rejected (it is the newest bug to have more than 5 votes, and the next largest has 1/3 less and is almost twice as old).
Why is this bug so hard to reproduce?
If you think to the cause of the bug, it is that it is passing a check in a function that is always called. It therefore seems like this bug should be triggered a majority of the time, and not an undependable minority. As I alluded to earlier, the problem is that this also has to interact with a buggy pending URL queue management system. In short, most of the time, this queue will be empty during the initial, buggy call; with a poor timing of events (less likely to happen on most developers' computers, in short), the queue will cease to be empty and cause the problem. Indeed, when crafting my test for reproduction, I discovered that letting the event loop spin in one particular location (after killing connections) would fail to trigger the bug. In the real world, that is one of the places where the event loop is most liable to spinning for the longest.