John D. Cook recently issued a blog post, How many errors are left to find?, in which he introduces yet another silly quantitative model for estimating the number of bugs left in a program.
The Lincoln Index, as Mr. Cook refers to it here, was used as a model for evaluating typographical errors, and was based on a method for estimating the population of a given species of animal. There are several terrible problems with this analysis.
First, reification error. Bugs are relationships, not things in the world. A bug is a perception of a problem in the product; a problem is a difference between what is perceived and what is desired by some person. There are at least four ways to make a problem into a non-problem: 1) Change the perception. 2) Change the product. 3) Change the desire. 4) Ignore the person who perceives the problem. Any time a product owner can say, “That? Nah, that’s not a bug,” the basic unit of the system of measurement is invalidated.
Second, even if we suspended the reification problem, the model is inappropriate. Bugs cannot be usefully modelled as a single kind of problem or a single population. Typographical errors are not the only problems in writing; a perfectly spelled and syntactically correct piece of writing is not necessarily a good piece of writing. Nor are plaice the only species of fish in the fjords, nor are fish the only form of life in the sea, nor do we consider all life forms as equivalently meaningful, significant, benign, or threatening. Bugs have many different manifestations, from capability problems to reliability problems to compatibility problems to performance problems and so forth. Some of those problems don’t have anything to do with coding errors (which themselves could be like typos or grammatical errors or statements that can interpreted ambiguously). Problems in the product may include misunderstood requirements, design problems, valid but misunderstood implementation of the design, and so forth. If you want to compare estimating bugs in a program to a population estimate, it would be more appropriate to compare it to estimating the number of all biological organisms in a given place. Imagine some of the problems in doing that, and you may get some insight into the problem of estimating bugs.
Third, there’s Djikstra’s notion that testing can show the presence of problems, but not their absence. That’s a way of noting that testing is subject to the Halting Problem. Since you can’t tell if you’ve found the last problem in the product, you can’t estimate how many are left in it.
Fourth, the Ludic Fallacy (part one). Discovery and analysis of problems in a product is not a probabilistic game, but a non-linear, organic system of exploration, discovery, investigation, and learning. Problems are discovered at neither a steady nor a random rate. Indeed, discoveries often happen in clusters as the tester learns about the program and things that might threaten its value. The Lincoln Index, focused on typos—a highly decidable and easily understood problem that could largely be accomplished by checking—doesn’t fit for software testing.
Fifth, the Ludic Fallacy (part two). Mr. Cook’s analysis implies that all problems are of equal value. Those of us who have done testing and studied it for a long time know that, from one time to another, some testers find a bunch of problems, and others find relatively few. Yet those few problems might be of critical significance, and the many of lesser significance. It’s an error to think in terms of a probabilistic model without thinking in terms of the payoff. Related to that is the idea that the number of bugs remaining in the product may not be that big a deal. All the little problems might pale in significance next to the one terrible problem; the one terrible problem might be easily fixable while the little problems grind down the users’ will to live.
Sixth, measurement-induced distortion. Whenever you measure a self-aware system, you are likely to introduce distortion (at best) and dysfunction (at worst), as the system adapts itself to optimize the thing that’s being measured. Count bugs, and your testers will report more bugs—but finding more bugs can get in the way of finding more important bugs. That’s at least in part because of…
Seventh, the Lumping Problem (or more formally, Assimiliation Bias). Testing is not a single activity; it’s a collection of activities that includes (at least) setup, investigation and reporting, and design and execution. Setup and investigation and reporting take time away from test coverage. When a tester finds a problem, she investigates reports it. That time is time that she can’t spend finding other problems. The irony here is that the more problems you find, the fewer problems you have time to find. The quality of testing work also involves the quality of the report. Reporting time, since it isn’t taken into account in the model, will distort the perception of the number of bugs remaining.
Eighth, estimating the number of problems remaining in the product takes time away from sensible, productive activities. Considering that the number of problems remaining is subjective, open-ended, and unprovable, one might be inclined to think that counting how many problems are left is a waste of time better spent on searching for other bad ones.
I don’t think I’ve found the last remaining problem with this model.
But it does remind me that when people see bugs as units and testing as piecework, rather than the complex, non-linear, cognitive process that it is, they start inventing all these weird, baseless, silly quantitative models that are at best unhelpful and that, more likely, threaten the quality of testing on the project.