22

Finding bugs is part of a programmer's lifestyle. But there are some bugs that are just plain weird and the solution to them are unintuitive.

Post stories that has happened to you or someone else that involves these types of bugs. If you want, you can post links to great programming bug stories.

One bug story per post but you can post as many war stories as you want.

48

The 500 mile email bug refers to a bug where the users are having trouble sending email that was over 500 miles away. If the email was sent wihin 500 miles, the email would be sent.

22

This file will jam your printer. The bug report was simple:

Print the attached file. The LPS-20 will jam. You'll have to open the printer to remove the scrunched up paper.

16

A couple of years ago I started work at a company that mandated Internet Explorer 6, doing ASP.NET. I set up my machine as normal, and the first time I tried to View | Source I got nothing - no error message, nothing at all. No matter what we tried (reinstall everything, wipe-and-drop etc.) no one could figure it out or fix it, and for 6 months I did web development without being able to view source HTML (a real challenge, I have to tell you).

One day I stumbled across a Usenet posting that revealed the problem: if you have a shortcut named "Notepad" on your desktop, View | Source in IE6 doesn't work. It doesn't even have to point to Notepad.exe, just be named "Notepad". I renamed my Notepad shortcut to NotepadX and was able to work normally from that point on.

I'm hoping that some current or former MS employee will read this and smack his forehead when he remembers a weird piece of code he forgot to comment out. I can't imagine why else something like this would happen.

12

Back in the Atari ST days, it was quite popular for people to make "bootsector" demos, showcasing some sort of graphical effect shoehorned into the first sector of a floppy disk (512 bytes) so that the effect would be shown when you booted from that floppy.

As you might imagine, these demos became quite elaborate over time, and it wasn't long before people started making bootsector demos that could remove all the borders on the ST (also called "full overscan"), as well as showing a nice graphical effect. Removing the borders on an ST involved exploiting a bug in the video chip and some extremely careful timing, so doing this and showing a nice graphical effect, all in 512 bytes (less, if you count the bits of the bootsector you couldn't change) was all the rage at one point.

Of course, I had to get in on the action, so I made my own "full overscan" bootsector demo, and was so proud of it that I put in on quite a few of the floppies I used regularly, and also gave it to some friends who did the same.

Then, one day, my friend phoned me and said the demo had gone a bit wonky on one of his disks (it's a bit hard to describe -- it looked like the demo, but somehow not). During the course of trying to find out what was wrong with it, he tried another disk, and it was doing the same thing. And another. And another -- all the disks he had with the bootsector on them, in fact.

So I tried my disks. Same problem. On all of them. After much messing around, getting nowhere, we removed the bootsector from most (but not all) disks, and I spent the next few days trying to find out what was wrong with it.

After a few days, I got it working again. Now, this was before I'd ever heard of things like source control, so I wasn't sure what I'd changed to fix it in those few days, just that I somehow had. Remember, this overscan effect required perfect timing, so making demos like this work involved a lot of semi-mindless fiddling with the code and instruction timings, etc.

So I phoned my friend and told him I'd fixed it. While we were talking, I tried the disk just to double check it was ok, which it was. Until I realised: this was a different disk, one I hadn't "fixed".

Not quite sure what was going on, my friend tried one of his remaining disks with the old code on it. It worked.

As it turns out, the new code was the same as the old code. My days of fiddling with instruction timings had led me back to where I started, but yet it was now working.

All the disks we had that contained the bootsector were now working again.

They continued working for some time, maybe a week or so, and then, one day, all of them stopped working again. And then they started working again, and so on. At any given time, they either all worked, or none of them did.

We never did find out why this was happening. We eventually put it down to some periodic effect at the hardware level that affected how the overscan trick behaved. We were never 100% convinced by this explanation, but it was more comforting than the prospect of all instances of the bootsector being entangled in some spooky, quantum way, which was our next best guess. :)

12

Born at the wrong time in the wrong place.

I once got a bug report from a bank saying that a customer's birth date was not accepted when trying to open an account - they'd tried and found that any data within a range of about a month in the summer of 1945 was not accepted. This was a German bank, and the application was written in Java.

I could reproduce the bug and found that the date was rejected at a very low technical level in the Calendar class (long before any domain validation happened), just as if you'd entered the 30th of February. Some debugging sessions later I found that the Calendar class calculates a lot of internal date and time fields, and the daylight savings time field containd a value of 2 hours, which was rejected by internal sanity checks.

The name of that field led me to a Java API bug report which explained everything: The Locale for Germany is "centered" on Berlin, and in the summer of 1945 Berlin and the Soviet-occupied zone of Germany actually did have a 2 hour daylight savings time (which happened to be identical to Moscow time). Some smartass in the Java development team had "corrected" the sanity check in Java 1.4 because he believed 1 hour DST to be the maximum - but Berlin is in fact not the only timezone which had a 2 hour DST at one time or another. The bug was fixed in Java 1.5

10

I'm glad to say I don't have any personal experience with this one, but I found the Patriot Missile Failure to be an interesting read.

Quick Summary: A timing bug in the Patriot Missile Defense System required that the computers be rebooted periodically. One of the calculations was dependent on the time since system boot, and after about 100 hours of continuous operation the system would miss their targets, which were incoming SCUD missiles.

I work for a defense contractor, and this sort of thing helps me stay focused.

More information here.

10

I once heard a story, I think from the offices of Be, Inc., but I can't remember the full details now nor find a link, where a mysterious crash occurred every day around 6 or 7 AM. There was nothing special scheduled to run then, and it only happened on a specific machine. One day they decided to have someone get up early and see for themselves, and they found that the box was situated in such a way that for a tiny window of time every day, the sun would be beaming directly onto it through an office window, causing the hardware to overheat and malfunction.

The way they told it made for a better story. :)

8

A colleague wrote a script to remove unused user accounts from an Oracle database. It worked well. Too well. It removed the system and sys users, which are like root in Unix.

7

A few years ago we had a strange problem. We had some low-end machines, provided by the client, being used to simply collect some data through a Serial Port. All that simple, with a C program, running on Linux. Every night the file was rolled using a cron configured process, appending the current date on the filename. It was all working well, until the client asked us to start to collect data from another source, and provided another machine to this. The same process we had through many times. But this time the rolling process was not working. The file was always named to January 1, 1970. The date_t 0. Checking the date on the machine, it appeared to be correct. But every day the file was overwritten with the January 1, 1970 date. After many days trying to find some problem or bug with the rolling process, somebody do run date twice in a row in the console. And the second date showed a date some seconds before the first one! Something surely wasn't right. In the end, the problem was that this particular IBM Netvista model had some firmware problem that would give an inconstant clock time to the OS when ACPI was turned on and a Hyper-Threading processor was present. After this, everytime a client was to give us a machine, we asked: "Will that be an IBM Netvista?"

+1 to "I guess that will learn me for assuming the hardware is always correct."

5

During one of my internships, I hit a weird deadlock problem that could only be reproduced on our most powerful machines (reserved for senior developers, not interns), if you jiggled the mouse for about fifteen minutes. And exclusively on a release build, with traces turned off.

It ended up being a deadlock between the licencing mechanism and the DLL loading library, in someone else's code, that could only be triggered with weird event timing (hence the mouse jiggling). But I did spend about two weeks jiggling that wretched mouse, until a more senior developer helped me debug the whole thing in assembly language, and figure out the culprit.

5

I was making a platformer shooter game. One day, while testing it, a grenadier guy died but at the same time his body start to spew grenades like mad, it was like a fountain, but of grenades, quickly everything was exploding and the system slowing down with so much things going on.

Then it never happened for some time.

Then this happened to a medic, it died, and his body started to cure forever his last patient.

I noticed that both actions had one thing in common, they needed to switch animations. I went to the code and found: If a character died EXACTLY when he just started (that is, 1/60 of a second) a action, ALL conditions existing in the animation code were avoided, and this made the character get stuck both dead and doing whatever it just started to do.

Before fixing the bug (that involved just adding another "if" to see if the character was dead and set it only as dead), I tested with all sorts of action, and with some heavy luck and patience I could make all sorts of bizarre behavior (like turn a pistol in a deadly machinegun in the hands of a dead guy).

4

Here's a bug I logged on Stackoverflow, because it was so weird, that I thought the solution would be very useful to anybody who ran into a similar bug. Basically the bug is that the syntax for accessing an array index, and the syntax for passing an argument to a function is exactly the same in VB.Net. Combine that with the fact that you don't have to include brackets when calling functions with no parameters, and well, hilarity ensues. Read the link for more details.

4

I'm surprised no-one has mentioned the Daily WTF yet...

4

Last April, I was working on an interoperability demo to be shown, on stage, at the RSA Security Conference. On the Saturday morning immediately preceding the event, my demo system worked perfectly correctly - signed messages were being sent from a client to a web service, validated, and all was well. On returning to it on Saturday night, just to check, all messages were being rejected by the server, since their timestamp was one hour in the future.

After several hours of debugging (into the small hours of Sunday morning), I identified a very subtle issue in the Java XML and WebServices Security code - an error in calculating time zone offsets, which only surfaced during the 8 hours between 2am UTC and 2am Pacific time, on the morning of the DST changeover.

Now, this wouldn't have hit me at all on this particular night, but the JVM on the demo server missed the patch for the DST change, so the server thought it was DST changeover night. Grrr!!!

And, of course, by Sunday morning, the situation had corrected itself, so if I hadn't looked at the demo on the Saturday night, I'd have been blissfully ignorant. But the bug would have still been there in the XWSS stack, lurking until the next DST changeover...

3

I worked on a product where everytime the CEO gave a demo it crashed. It was only when he gave a demo and nobody else had it happen. Turned out he was accessing some code that we didn't even know was on the system and something terrible was happening with a thread.

Needless to say it made us sweat for a while until we figured it out.

3

One tricky Heisenbug that I tracked down actually turned up to be a hardware problem. I was working on an embedded system that communicated over a bus to another system. Since we were still in the prototyping stage the hardware guys just milled up an adapter board for the system (the actual board was delayed by several months so I was stuck with the milled board...).

Everything worked fine during the initial development until the software side of things was almost done. However strange things started to pop up. I would debug the data transfer process line-by-line and everything would be fine. But running the system without debugging it would fail is strange and seemingly random places (but consistent, e.g. always step 3 or 7, sometimes 9).

Getting frustrated I started to debug the values sent to the bus. I would write an value to the bus and read it straight back again and it was correct but bulk reads or writes would be different. When I debugged the process everything worked. It all seemed very strange!

After writing lots of sanity checking code and looking at the logic analyzer traces we realised that our little milled board was getting cross talk changing the signals when operating at high speed! Debugging was fine since the bus had plenty of time to stabilize before the next value was applied. Bulk or fast transfers failed since they were too fast and the board was not grounded properly.

It seems so obvious in hindsight but at the time it drove me crazy :) I guess that will learn me for assuming the hardware is always correct.

3

Years ago (back when memory was expensive) I had a programmer that did everything he could to save space (Hey, Scott!). We had a field that held the text representation of the date & time to the second. He counted the number of characters in a maximum-length time string and that's how large he made a char[] buffer in a C routine.

The date numbers and the hour did not (for some reason) have a leading zero when < 10. C adds a 0x00 byte at the end of a string, so this buffer was one character short, but only on days >= 10, and months >= 10, and hours >= 10. On top of that, the value on the stack that was getting clobbered sometimes would be looked at later in the routine and sometimes not. Oh, and this was in a multi-threaded / multi-process environment. Cheeerist!

It took over two weeks to find this sucker.

2

I recommend you this very nice book: It's Not a Bug, It's a Feature!: Computer Wit and Wisdom

2

Some years ago we had a problem with one of our new customers, where a part of our (client/server) application would not work. This is not an application I developed, it was developed by another office of ours abroad.

They were using the exact release as I was and I could not reproduce it in the office at all. I tried various things all with no success. The client was getting angry and they were a big client. There was talk of contracts being cancelled.

In desperation I borrowed a new server that we had just received that was a similar spec to theirs, and did a from-scratch installation. I now had the same issue as the client. I formatted and installed onto another blank machine (same OS etc) with the same media and this time it worked.

Eventually I realised the problem. The client's server, and one of the servers I tried had a Hyperthreading CPU, and one of our components just refused to run with it. The dev team refused to believe it. We ended up buying and shipping them a Hyperthreading machine so they could see first-hand that this was indeed the case.

It was eventually fixed.

2

http://www.crunchgear.com/2008/12/31/zune-bug-explained-in-detail/

always find it interesting how Microsoft let this happen

year = ORIGINYEAR; /* = 1980 */

while (days > 365)
{
    if (IsLeapYear(year))
    {
        if (days > 366)
        {
            days -= 366;
            year += 1;
        }
    }
    else
    {
        days -= 365;
        year += 1;
    }
}
2

I was working on a system which communicated with a SQL Server, back in 1996/97. At seemingly random times, the performance of the SQL Server would degrade quite significantly. The programmers would then ask the (inexperinced) DBA to look into the problem. As soon as he did, the problem resolved itself, and performance picked back up.

To cut a long story short, we eventually figured out that it was the simple act of the DBA logging into the server hosting the RDBMS that was correcting the problem. The server was, at the time, running under Windows NT3.5, with some default installation details. One of those defaults was the screen saver...the system was using the Pipes screen save, which would show a randomly-generated 3d image of pipelines...and the screensaver is what was consuming enough server resources to cause the performance issue.

1

Many years ago I ran across an issue where Turbo Pascal would NOT accept a line setting "c" equal to 1 on a given line in one of my programs. Move it up or down a line, change the variable name, change the number, and it would take it. Very strange...

1

I wrote self-modification code in asm x86. The modification was in the loop which some instructions further from code which introduces modifications. The problem is that internal pipe of instructions in the processor (it was old-old Intel i486) can be filled with instructions and my modifications are not accounted and old version of instructions are executed. The problem was complicated by the fact that it works ok when you debug your code - during debugging the pipe of processor was filled many times with debug code.

1

One that to this day I don't know the real problem:

This was back when the 1.2mb floppies hadn't yet displaced the 360k ones, both were in use.

We had horrible luck with our software misbehaving on one customer's machine. Several inexplicable bugs and one of the two dongles would never work. Several sets of floppies were sent out after having been tested at our location and working correctly but failing there.

Finally a lucky break: Someone who happened to be in our office overheard a conversation and mentioned a problem with certain systems misreading 360k floppies in 1.2mb drives. As the target system was an XT (with a 360k floppy) networked to a AT (with the 1.2mb floppy) it was quite possible to try it from a 360k--and, baffling as it seems all of the craziness disappeared.

Since we always started from blank floppies and copied files on with a batch file things would always end up in the same spots and apparently the error was consistent in this case.

Of course at this point the real problem began: They had a guy there setting things up that was a total moron. As far as I could tell the real problem was that the XT machine wasn't configured to see the printer on the AT machine. If our program ran into that and barfed at being asked to print his reaction would be to reinstall--from the AT machine because it was faster. The only way to get him to install from the XT machine was to stay on the line with him while he did the whole thing--which of course did nothing about the printer problem. We had no experience with networking whatsoever and couldn't give any advice on that. I'm not sure the client EVER got things running because of that moron.

1

If you've ever played Worms Armageddon - this seems to happen on some configurations. We never managed to reproduce it...

1

After too much programming in Java, where it checks for every single issue, I get started on writing some code in C. The program didn't work. Upon close inspection of the code, I notice I had this:

a=0;
for (i=0;i< n;i+=sizeof(mystruct))
if (a< complex_formula_involving_i_and_the_structure)
    a=complex_formula_involving_i_and_the_structure;

I was puzzled how a could get negative values at the end of that loop. I put a few printfs to check the values right before the if, and it was basically telling me that (0< -1)==true...

Finally after hours of being puzzled by this, I put a cast to int before that formula, and it worked. Still, it reminded me why I hate C.

1

In a platformer game that I was making, out of lazyness, instead of asking the artists to fix the collision boxes for some AI characters that were getting stuck in the ground, I coded them to jump if they got stuck. It worked mostly fine.

Until someone changed the boxes and made the behavior worse. One character jumped non-stop, quickly the dev team started calling him "Happy Deer" and that name stuck.

Other character jumped non-stop when climbing stairs or ramps, unfortunally often near stairs or ramps, this resulted into him jumping into minefields, enemeies or something else worse.

This character got nicknamed "stupid" (it was a big character, with a face of a stupid guy, and it held the biggest weapon of the game, the name fit him perfectly)

People enjoyed so much "stupid" behavior (because he was your ally, and after some tweaking on the level design, instead of killing itself stupidly, it started to look like some wannabe hero, saving you even when you don't need) that I did not fixed the bug, I only worked with the artists to fix the collision box of the constant jumping character.

The most funny part of it was that many people praised my AI coding skills because "stupid" behavior, even with me insisting that his AI was the SAME as other allied characters (all allied characters just called the same function that did AI calculations and returned simulated inputs)

1

While experimenting with a new feature for loading graphics tiles at runtime that had recently been added to BYOND, I discovered that it didn't work if the tile in question was mostly black except for the top left quadrant, which was white. The BYOND developer who picked up the bug report found it difficult to believe, describing it as "the craziest thing he had ever heard". It turned out that the heuristics for detecting that a file was text was too high in the pecking order for determining the file type; the black with a white top left quadrant tile happend to pass the text file test, so the function that loaded the file returned a string instead of a graphics tile.

0

Here is an interesting collection of bugs

0

Can you spot the typo?

if (condition) {
//executed sometimes
} else if (condition2); {
//executed all the time.
}
0

I can't find a reference to it now, but I do recall reading years ago about a bug Microsoft had identified in an older version of Excel. Apparently, cells in the spreadsheet would sometimes not autocalc. This would only occur in a cell which had a formula, and the formula was based on another cell in an earlier row, and the cell with the formula was in a row evenly divisible by 8...plus 1.

Always wondered how the hell anyone could track that down.

0

Surprised nobody has mentioned the Android auto-focus bug from last year

There?s a rounding-error bug in the camera driver?s autofocus routine (which uses a timestamp) that causes autofocus to behave poorly on a 24.5-day cycle. That is, it?ll work for 24.5 days, then have poor performance for 24.5 days, then work again.

The 17th is the start of a new ?works correctly? cycle, so the devices will be fine for a while. A permanent fix is in the works.