Program testing involves the execution of a program over sample test data followed by analysis of the output. Different kinds of test output can be generated. It may consist of final values of program output variables or of intermediate traces of selected variables. It may also consist of timing information, as in real time systems.
The use of testing requires the existence of an external mechanism which can be used to check test output for correctness. This mechanism is referred to as the test oracle. Test oracles can take on different forms. They can consist of tables, hand calculated values, simulated results, or informal design and requirements descriptions.—William E. Howden, A Survey of Dynamic Analysis Methods, in Software Validation and Testing Techniques, IEEE Computer Society, 1981
While we have a great deal of respect for the work of testing pioneers like Prof. Howden, there are some problems with this description of testing and its focus on correctness.
- Correct output from a computer program is not an absolute; an outcome is only correct or incorrect relative to some model, theory, or principle.
Trivial example: Even the mathematical rule “one divided by two equals one-half” is a heuristic for dividing things. In most domains, it’s true, but as in George Carlin’s joke, when you cut a crumb in two, you don’t have two half-crumbs; you have two crumbs.
- A product can produce a result that is functionally correct, and yet still be deeply unsatisfactory to its user.
Trivial example: a calculator returns the value “4” from the function “2 + 2″—and displays the result in white on a white background.
- Conversely, a product can produce an incorrect result and still be quite acceptable.
Trivial example: a computer desktop clock’s internal state and second hand drift a few tenths of a second each second, but the program resets itself to be consistent with an atomic clock at the top of every minute. The desktop clock almost never shows the right time precisely, but the human observer doesn’t notice and doesn’t really care.
Another trivial example: a product might return a calculation inconsistent with its oracle in the tenth decimal place, when only the first two or three decimal places really matter.
- The correct outcome of a program or function is not always known in advance.
Some development and testing work, like some science, is done in an attempt to discover something new; to establish what a correct answer might look like; to explore a mathematical model; to learn about the limitations of a novel system. In such cases, our ideas of correctness or acceptability are not clear from the outset, and must be developed. (See Collins and Pinch’s The Golem books, which discuss the messiness and confusion of controversial science.)
Trivial example: in benchmarking, correctness is not at issue. Comparison between one system and another (or versions of the same system at different times) is the mission of testing here.
- As we’re developing and testing a product, we may observe things that are unexpected, under-described or completely undescribed.
In order to program a machine to make an observation, we must anticipate that observation and encode it. The machine doesn’t imagine, invent, or learn, and a machine cannot produce an unanticipated oracle in response to an observation. By contrast, human observers continually learn and refine their ideas on what to observe. Sometimes we observe a problem without having anticipated it. Sometimes we become aware that we’re making a new observation—one that may or may not represent a problem.
Distinct from checking, testing continually affords new things to observe. Testing prompts us to decide when new observations represent problems, and testing informs decisions about what to do about them.
- An oracle may be in error, or irrelevant.
Trivial examples: a program that checks the output of another program may have its own bugs. A reference document may be outdated. A subject matter expert who is usually a reliable source of information may have forgotten something.
- Oracles might be inconsistent with each other.
Even though we have some powerful models for it, temperature measurement in climatology is inherently uncertain. What is the “correct” temperature outdoors? In the sunlight? In the shade? When the thermometer is near a building or farther away? Over grass, or over pavement? Some of the issues are described in this remarkable article (read the comments, too).
- Although we can demonstrate incorrectness in a program, we cannot prove a program to be correct.
As Djikstra put it, testing can only show the presence of errors, not their absence; and to go even deeper, Popper pointed out that theories can only be falsified, and not proven.
Trivial example: No matter how many tests we run on that calculator, we can never know that it will always return 4 given the inputs 2 + 2; we can only infer that it will do so through induction, and induction can be deeply problemmatic. In a Nassim Taleb’s example (cribbed from Bertrand Russell and David Hume), every day the turkey uses induction to reinforce his belief in the farmer’s devotion to the desires and interests of turkeys—until a few days before Thanksgiving, when the turkey receives a very sudden, unpleasant, and (alas for the turkey) momentary flash of insight.
- Sometimes we don’t need to know the correct result to know that the observed result is wrong.
Trivial example: the domain of the cosine function ranges from -1 to 1. I don’t need to know the correct value for cos(72) to know that an output of 4.2 is wrong. (Elaine Weyuker discusses this in a paper called “On Testing Nontestable Programs” (Weyuker, Elaine, “On Testing Nontestable Programs”, Department of Computer Science, Courant Institute of Mathematical Sciences, New York University). “Frequently the tester is able to state with assurance that a result is incorrect without actually knowing the correct answer.”)
Checking for correctness—especially when the test output is observed and evaluated mechanically or indirectly—is a risky business. All oracles are fallible. A “passing” test, based on comparison with a fallible oracle cannot prove correctness, and no number of “passing” tests can do that. In this, a test is like a scientific experiment: an experiment’s outcome can falsify one theory while supporting another, but an experiment cannot prove a theory to be true.
A million observations of white swans says nothing about the possibility that there might be black swans; a million passing tests, a million observations of correct behaviour cannot eliminate the possibility that there might be swarms of bugs. At best, a passing test is essentially the observation of one more white swan. We urge those who rely on passing acceptance tests to remember this.
A check can suggest the presence of a problem, or can at best provide support for the idea that the program can work. But no matter what oracle we might use, a test cannot prove that a program is working correctly, or that the program will work. So what can oracles actually do for us?
If we invert the focus on correctness, we can produce a more robust heuristic. We can’t logically use an oracle to prove that a system is behaving correctly or that it will behave correctly, but we can use an oracle to help falsify the theory that it is behaving correctly.
This is why, in Rapid Software Testing, we say that an oracle is a means by which we recognize a problem when we encounter one in testing.
This post was updated 2023-08-11 for formatting and a minor but important change to the last sentence.
Note that the formatting, as of this writing, is still off. I bet you can see that too. Did you apply a correctness criterion before you recognized the problem?