Log in

No account? Create an account
Previous Entry Share Next Entry
Heisenbug, noun
"A computer error that goes away every time you look at it closely, making it difficult to diagnose."

As, for example, spending the evening trying to figure out why your Internet connection has become flaky -- only to realize in the morning that the cablemodem is plugged into the light switch. D'oh...

  • 1
(Deleted comment)
My worst-ever bug was while working at Looking Glass, and was kind of an "anti-Heisenbug". It was the legendary "clock slowing down" bug. The symptoms were very simple and clear, but incredibly bizarre: after playing the game, over the course of the next week or so, you'd begin to lose time from the Windows system clock. If left long enough, the system would slowly grind to a near-halt (to the point where even mouse movement became sluggish) -- despite the Windows Task Manager claiming that the CPU wasn't doing *anything*.

I lost over a *month* to that goddamn bug, and never did get Microsoft to admit that it was their fault, but my online research basically came to the conclusion that I'd hit a bug deep inside the OS, specifically in the sound libraries.

I don't remember the full details, but when you started using this particular media library, it would begin a timer in the kernel that would occasionally wake up, scan an internal linked list, pop the first thing from the list and add a new one to the end. Problem was, if you killed the process without properly closing that library (because the program crashed or -- as happened all the time for us -- if you stopped it in the debugger), it would stop *removing* things from the list, but would keep *adding* to it. This only happened a relatively few times per second, but eventually added up to scanning a linear linked list that was *millions* of elements long, several times a second. And of course, since it was in the kernel, the OS didn't think anything at all was happening, but it would begin to miss OS events like mouse movement and clock ticks since it was spending all its time scanning the list.

It was an "anti-Heisenbug" in that, as I eventually figured out, it *only* happened if you were debugging. Fixing it in the shipping product was trivial: I just made sure that we had an outer exception catch, and always shut down that damned Windows library properly. But even if we hadn't fixed it, I suspect it would never have been noticed in the field. It was aggravating to realize that the bug only really showed up when debugging, and that the only solution was, really and truly, to just reboot eff'ing Windows every now and then.

(Of course, this was around the time that we found out that it was physically impossible to run Windows for more than two months before an internal timer would roll over and crash the OS. Far as we could tell, nobody had ever hit this because it was nearly impossible to successfully run Windows for that long...)

*blink* If it was that buggy, how on earth did Windows manage to keep selling stuff? These days it's big through sheer force of inertia, but back in the day, wouldn't Linux, Windows, and Apple all have been on a reasonably equal footing?

Heh -- sorry, but sometimes you do show your youth. This was 1999: Linux barely existed yet, and nobody even believed it was possible for it to be used as a consumer-level OS. Linux (like all Unix derivatives) was used for servers and embedded systems, not consumer boxen. The geeks loved it, but far less than 1% of the population had even heard of it.

As for Apple, it was the usual story: Apple's "my way or the highway" attitude turned a lot of folks off, and the variety of hardware and software available for it was a small fraction of what you could do with Windows. It was basically a comfortable high-end niche player.

So basically, there was no competition to speak of -- indeed, that "sheer force of inertia" was much, much stronger at that point. (*Now*, Windows is in serious trouble, because iOS and Android have badly disrupted the assumptions underlying it.) Everyone know that Windows was crappy (certainly all the programmers did), but it was what the consumers had, so we targeted it. And since all the programs ran on Windows, all the consumers bought it. It was very sweet for Microsoft, for quite a while...

(Deleted comment)
Unix, at least the version that ran on SUN 3 hardware, had a similar problem, with a continuously increasing log file for which the simple solution was to reboot every couple of months (before the system froze).

Of course, the big difference was that a lot of people knew about this problem, because otherwise you could keep the system up indefinitely.

(Deleted comment)
Ouch. Yeah, that's a pretty brutal and very classic Heisenbug. While it's not quite universal, I do usually assume that a Heisenbug involves concurrency in *some* fashion: that's probably true 95% of the time.

I confess to being morbidly curious what they were doing to allow level 2 re-entrancy but not level 3 -- that smells like somebody hacked something horribly...

(Deleted comment)
True. My very first summer job, working for my father, was to add a printer subsystem for the POS system we were building. We had a total of 16k of EEPROM, not expandable, and that was mostly already full. I pretty much had to invent refactoring from first principles (with some suggestions from Dad, of course), just to clear enough space to get the job done.

Fun times: it was my first experience of "they didn't tell me it was impossible when I got the assignment"...

You two have reminded me how much I liked listening to you and others talk programming after dance practice.

Happy to provide entertainment. War stories are fun to pass around, regardless of the context...

It was not only entertainment, but I learned a lot about how programmers interact, which meant I was highly employable as support staff, so it had actual benefits!

Plus 95% of my sweethearts have been techies...

You are reminding me of the link() system call bug I ran across years ago. It was insufficiently atomic, resulting sometimes in a caller being told it had won a race it had not.

Ouch. Good reminder of why I've become so deeply enamored of the Actor Model for -- well, pretty much anything of scale. Having an architecture where, if you follow the rules (not a trivial caveat) threading issues simply don't arise is a *huge* comfort.

But even there, I'm still dependent on the system libraries not sabotaging things from the get-go...

(Deleted comment)
  • 1