This is a letter that I would not show to a programmer in a real-life situation. I’ve often thought of bits of it at a time, and those bits come up in conversation occasionally, but not all at once.
This is based on an observation of the chat window in Skype 126.96.36.199.
I discovered a bug today. I’ll tell you how I found it. It’s pretty easy to reproduce. There’s this input field in our program. I didn’t know what the intended limit was. It was documented somewhere, but that part of the spec got deleted when the CM system went down last week. I could have asked you, but you were downstairs getting another latte.
Plus, it’s really quick and easy to find out empirically; quicker than looking it up, quicker than asking you, even if you were here. There’s this tool called PerlClip that allows me to create strings that look like this
As you’ll notice, the string itself tells you about its own length. The number to the left of each asterisk tells you the offset position of that asterisk in the string. (You can use whatever character you like for a delimiter, including letters and numbers, so that you can test fields that filter unwanted characters.)
It takes a handful of keystrokes to generate a string of tremendous length, millions of characters. The tool automatically copies it to the Windows clipboard, whereupon you can paste it into an input field. Right away, you get to see the apparent limit of the field; find an asterisk, and you can figure out in a moment exactly how many characters it accepts. It makes it easy to produce all kinds of strings using Perl syntax, which saves you having to write a line of Perl script to do it and another few lines to get it into the clipboard. In fact, you can give PerlClip to a less-experienced tester that doesn’t know Perl syntax at all (yet), show them a few examples and the online help, and they can get plenty of bang for the buck. They get to learn something about Perl, too. This little tool is like a keychain version of a Swiss Army knife for data generation. It’s dead handy for analyzing input constraints. It allows you to create all kinds of cool patterns, or data that describes itself, and you can store the output wherever you can paste from the clipboard. Oh, and it’s free.
You can get a copy of PerlClip here, by the way. It was written by James Bach and Danny Faught. The idea started with a Perl one-liner by Danny, and they build on each other’s ideas for it. I don’t think it took them very long to write it. Once you’ve had the idea, it’s a pretty trivial program to implement. But still, kind of a cool idea, don’t you think?
So anyway, I created a string a million characters long, and I pasted it into the chat window input field. I saw that the input field apparently accepted 32768 characters before it truncated the rest of the input. So I guess your limit is 32768 characters.
Then I pressed “Send”, and the text appeared in the output field. Well, not all of it. I saw the first 29996 characters, and then two periods, and then nothing else. The rest of the text had vanished.
That’s weird. It doesn’t seem like a big deal, does it? Yet there’s this thing called representativeness bias. It’s a critical thinking error, the phenomenon that causes us to believe that a big problem always looks big from every angle, and that an observation of a problem with little manifestations always has little consequences.
Our biases are influenced by our world views. For example, last week when that tester found that crash in that critical routine, everyone else panicked, but you realized that it was only a one-byte fix and we were back in business within a few minutes. It also goes the other way, though: something that looks trivial or harmless can have dire and shocking consequences, made all the more risky because of the trivial nature of the symptom. If we think symptoms and problems and fixes are all alike in terms of significance, when we see a trivial symptom, no one bothers to investigate the problem. It’s only a little rounding error, and it only happens on one transaction in ten, and it only costs half a cent at most. When that rounding error is multiplied over hundreds of transactions a minute, tens of thousands an hour… well you get the point.
I’m well aware that, as a test, this is a toy. It’s like a security check where you rattle the doorknob. It’s like testing a car by kicking the tires. And the result that I’m seeing is like the doorknob falling off, or the door opening, or a tire suddenly hissing. For a tester, this is a mere bagatelle. It’s a trivial test. Yet when a trivial test reveals something that we can’t explain immediately, it might be good idea to seek an explanation.
A few things occurred to me as possibilities.
- The first one is that someone, somewhere, is missing some kind of internal check in the code. Maybe it’s you; maybe it’s the guy who wrote the parser downstream, maybe it’s the guy that’s writing the display engine. But it seems to me as though you figured that you could send 32768 bytes, someone else has a limit of 29998 bytes. Or 29996, probably. Well, maybe.
- Maybe one of you isn’t aware of the published limits of the third-party toolkits you’re using. That wouldn’t be the first time. It wouldn’t necessarily be negligence on your part, either—the docs for those toolkits are terrible, I know.
- Maybe the published limit is available, but there’s simply a bug in one of those toolkits. In that case, maybe there isn’t a big problem here, but there’s a much bigger problem that the toolkit causes elsewhere in the code.
- Maybe you’re not using third-party toolkits. Maybe they’re toolkits that we developed here. Mind you, that’s exactly the same as the last problem; if you’re not aware of the limits, or if there’s a bug, who produced the code has no bearing on the behaviour of the code.
- Maybe you’re not using toolkits at all, for any given function. Mind you, that doesn’t change the nature of the problems above either.
- Maybe some downstream guy is truncating everything over 29996 bytes, placing those two dots at the end, and ignoring everything else, and and he’s not sending a return value to you to let you know that he’s doing it.
- Maybe he is sending you a return value, but the wrong one.
- Maybe he’s sending you a return value, and you’re ignoring it.
- Maybe he’s sending you a return value, and you are paying attention to it, but there’s some confusion about what it means and how it should be handled.
- Maybe you’re truncating the last two and a half kilobytes or so of data before you send it on, and we’re not telling the user about it. Maybe that’s your intention. Seems a little rude to me to do that, but to you, it works as designed. To some user, it doesn’t work—as designed.
- Maybe there’s no one else involved, and it’s just you working on all those bits of the code, but the program has now become sufficiently complex that you’re unable to keep everything in your head. That stands to reason; it is a complicated program, with lots of bits and pieces.
- Maybe you’re depending on unit tests to tell you if anything is wrong with the individual functions or objects. But maybe nothing is wrong with any particular one of them in isolation; maybe it’s the interaction between them that’s problemmatic.
- Maybe you don’t have any unit tests at all.
- Maybe you do have unit tests for this stuff. From right here, I can’t tell. If you do have them, I can’t tell whether your checks are really great and you just missed one this time, or if you missed a few, or if you missed a bunch of them, or whether there’s a ton of them and they’re all really lousy.
Any of the above explanations could be in play, many of them simultaneously. No matter what, though, all your unit tests could pass, and you’d never know about the problem until we took out all the mocks and hooked everything up in the real system. Or deployed into the field. (Actually, by now they’re not unit tests; they’re just unit checks, since it’s a while since this part of the code was last looked at and we’ve been seeing green bars for the last few months.)
For any one of the cases above, since it’s so easy to test and check for these things, I would think that if you or anyone else knew about this problem, your sense of professionalism and craftsmanship would tell you to do some testing, write some checks, and fix it. After all, as Uncle Bob Martin said, you guys don’t want us to find any bugs, right?
But it’s not my place to say that. All that stuff is up to you. I don’t tell you how to do your work; I tell you what I observe, in this case entirely from the outside. Plus it’s only one test. I’ll have to do a few more tests to find out if there’s a more general problem. Maybe this is an aberration.
Now, I know you’re fond of saying, “No user would ever do that.” I think what you really mean is no user that you’ve thought of, and that you like, would do that on purpose. But it might be a thought to consider users that you haven’t thought of, however unlikely they and their task might be to you. It could be a good idea to think of users that neither one of us like, such as hackers or identity thieves. It could also be important to think of users that you do like who would do things by accident. People make mistakes all the time. In fact, by accident, I pasted the text of this message into another program, just a second ago.
So far, I’ve only talked about the source of the problem and the trigger for it. I haven’t talked much about possible consequences, or risks. Let’s consider some of those.
- A customer could lose up to 2770 bytes of data. That actually sounds like a low-risk thing, to me. It seems pretty unlikely that someone would type or paste that much data in any kind of routine way. Still, I did hear from one person that they like to paste stack traces into a chat window. You responded rather dismissively to that. It does sound like a corner case.
- Maybe you don’t report truncated data as a matter of course, and there are tons of other problems like this in the code, in places that I’m not yet aware of or that are invisible from the black box. Not this problem, but a problem with the same kind of cause could lead to a much more serious problem than this unlikely scenario.
- Maybe there is a consistent pattern of user interface problems where the internals of the code handle problems but don’t alert the user, even though the user might like to know about them.
- Maybe there’s a buffer overrun. That worries me more—a lot more—than the stack trace thing above. You remember that this kind of problem used to be dismissed as a “corner case” back when we worked at Microsoft—and then how Microsoft shut down new product development spent two months on investigating these kinds of problems, back in the spring of 2002? Hundreds of worms and viruses and denial of service attacks stem from problems whose outward manifestation looked exactly as trivial as this problem. There are variations on it.
- Maybe there’s a buffer overrun that would allow other users to view a conversation that my contact and I would like to keep between ourselves.
- Maybe an appropriately crafted string could allow hackers to get at some of my account information.
- Maybe an appropriately crafted string could allow hackers to get at everyone‘s account information.
- Maybe there’s a vulnerability that allows access to system files, as the Blaster worm did.
- Maybe the product is now unstable, and there’s a crash about to happen that hasn’t yet manifested itself. We never know for sure if a test is finished.
- Here’s something that I think is more troubling, and perhaps the biggest risk of all. Maybe, by blowing off this report, you’ll discourage testers from reporting a similarly trivial symptom of a much more serious problem. In a meeing a couple of weeks ago, the last time a tester reported something like this, you castigated her in public for the apparently trivial nature of the problem. She was embarrassed and intimidated. These days she doesn’t report anything except symptoms that she thinks you’ll consider sufficiently dramatic. In fact, just yesterday she saw something that she thought to be a pretty serious performance issue, but she’s keeping mum about it. Some time several weeks from now, when we start to do thousands or millions of transactions, you may find yourself wishing that she had felt okay about speaking up today. Or who knows; maybe you’ll just ask her why she didn’t find that bug.
NASA calls this last problem “the normalization of deviance”. In fact, this tiny little inconsistency reminds me of the Challenger problem. Remember that? There were these O-rings that were supposed to keep two chambers of highly-pressurized gases separate from each other. It turns out that on seven of the shuttle flights that preceded the Challenger, these O-rings burned through a bit and some gases leaked (they called this “erosion” and “blow-by”). Various managers managed to convince themselves that it wasn’t a problem, because it only happened on about a third of the flights, and the rings, at most, only burned a third of the way through. Because these “little” problems didn’t result in catastrophe the first seven times, NASA managers used this as evidence for safety. Every successful flight that had the problem was taken as reassurance that NASA could get away with it. In that sense, it was like Nassim Nicholas Taleb’s turkey, who increases his belief in the benevolence of the farmer every day… until some time in the week before Thanksgiving.
Richard Feynman, in his Appendix to the Rogers Commission Report on the Space Shuttle Challenger Accident, nailed the issue:
The phenomenon of accepting for flight, seals that had shown erosion and blow-by in previous flights, is very clear. The Challenger flight is an excellent example. There are several references to flights that had gone before. The acceptance and success of these flights is taken as evidence of safety. But erosion and blow-by are not what the design expected. They are warnings that something is wrong. The equipment is not operating as expected, and therefore there is a danger that it can operate with even wider deviations in this unexpected and not thoroughly understood way. The fact that this danger did not lead to a catastrophe before is no guarantee that it will not the next time, unless it is completely understood. When playing Russian roulette the fact that the first shot got off safely is little comfort for the next.
That’s the problem with any evidence of any bug, at first observation; we only know about a symptom, not the cause, and not the consequences. When the system is in an unpredicted state, it’s in an unpredictable state.
Software is wonderfully deterministic, in that it does exactly what we tell it to do. But, as you know, there’s sometimes a big difference between what we tell it to do and what we meant to tell it to do. When software does what we tell it to do instead of what we meant, we find ourselves off the map that we drew for ourselves. And once we’re off the map, we don’t know where we are.
According to Wikipedia,
Feynman’s investigations also revealed that there had been many serious doubts raised about the O-ring seals by engineers at Morton Thiokol, which made the solid fuel boosters, but communication failures had led to their concerns being ignored by NASA management. He found similar failures in procedure in many other areas at NASA, but singled out its software development for praise due to its rigorous and highly effective quality control procedures – then under threat from NASA management, which wished to reduce testing to save money given that the tests had always been passed.
At NASA, back then, the software people realized that just because their checks were passing, it didn’t mean that they should relax their diligence. They realized that what really reduced risk on the project was appropriate testing, lots of tests, and paying attention to seemingly inconsequential failures.
I know we’re not sending people to the moon here. Even though we don’t know the consequences of this inconsistency, it’s hard to conceive of anyone dying because of it. So let’s make it clear: I’m not saying that the sky is falling, and I’m not making a value judgment as to whether we should fix it. That stuff is for you and the project managers to decide upon. It’s simply my role to observe it, to investigate it, and to report it.
I think it might be important, though, for us to understand why the problem is there in the first place. That’s because I don’t know whether the problem that I’m seeing is a big deal. And the thing is, until you’ve looked at the code, neither do you.
As always, it’s your call. And as usual, I’m happy to assist you in running whatever tests you’d like me to run on your behalf. I’ll also poke around and see if I can find any other surprises.
P.S. I did run a second test. This time, I used PerlClip to craft a string of 100000 instances of :). That pair of characters, in normal circumstances, results in a smiley-face emoticon. It seemed as though the input field accepted the characters literally, and then converted them to the graphical smiley face. It took a long, long time for the input field to render this. I thought that my chat window had crashed, but it hadn’t. Eventually it finished processing, and displayed what it had parsed from this odd input. I didn’t see 32768 smileys, nor 29996, nor 16384, nor 14998. I saw exactly two dots. Weird, huh?