Oracles Are About Problems, Not Correctness

As James Bach and I have have been refining our ideas of testing, we’ve been refining our ideas about oracles. In a recent post, I referred to this passage:

Program testing involves the execution of a program over sample test data followed by analysis of the output. Different kinds of test output can be generated. It may consist of final values of program output variables or of intermediate traces of selected variables. It may also consist of timing information, as in real time systems.

The use of testing requires the existence of an external mechanism which can be used to check test output for correctness. This mechanism is referred to as the test oracle. Test oracles can take on different forms. They can consist of tables, hand calculated values, simulated results, or informal design and requirements descriptions.

—William E. Howden, A Survey of Dynamic Analysis Methods, in Software Validation and Testing Techniques, IEEE Computer Society, 1981

While we have a great deal of respect for the work of testing pioneers like Prof. Howden, there are some problems with this description of testing and its focus on correctness.

  • Correct output from a computer program is not an absolute; an outcome is only correct or incorrect relative to some model, theory, or principle.
    Trivial example: Even the mathematical rule “one divided by two equals one-half” is a heuristic for dividing things. In most domains, it’s true, but as in George Carlin’s joke, when you cut a crumb in two, you don’t have two half-crumbs; you have two crumbs.
  • A product can produce a result that is functionally correct, and yet still be deeply unsatisfactory to its user.
    Trivial example: a calculator returns the value “4” from the function “2 + 2″—and displays the result in white on a white background.
  • Conversely, a product can produce an incorrect result and still be quite acceptable.
    Trivial example: a computer desktop clock’s internal state and second hand drift a few tenths of a second each second, but the program resets itself to be consistent with an atomic clock at the top of every minute. The desktop clock almost never shows the right time precisely, but the human observer doesn’t notice and doesn’t really care.

    Another trivial example: a product might return a calculation inconsistent with its oracle in the tenth decimal place, when only the first two or three decimal places really matter.
  • The correct outcome of a program or function is not always known in advance.
    Some development and testing work, like some science, is done in an attempt to discover something new; to establish what a correct answer might look like; to explore a mathematical model; to learn about the limitations of a novel system. In such cases, our ideas of correctness or acceptability are not clear from the outset, and must be developed. (See Collins and Pinch’s The Golem books, which discuss the messiness and confusion of controversial science.)

    Trivial example: in benchmarking, correctness is not at issue. Comparison between one system and another (or versions of the same system at different times) is the mission of testing here.
  • As we’re developing and testing a product, we may observe things that are unexpected, under-described or completely undescribed.
    In order to program a machine to make an observation, we must anticipate that observation and encode it. The machine doesn’t imagine, invent, or learn, and a machine cannot produce an unanticipated oracle in response to an observation. By contrast, human observers continually learn and refine their ideas on what to observe. Sometimes we observe a problem without having anticipated it. Sometimes we become aware that we’re making a new observation—one that may or may not represent a problem.

    Distinct from checking, testing continually affords new things to observe. Testing prompts us to decide when new observations represent problems, and testing informs decisions about what to do about them.
  • An oracle may be in error, or irrelevant.
    Trivial examples: a program that checks the output of another program may have its own bugs. A reference document may be outdated. A subject matter expert who is usually a reliable source of information may have forgotten something.
  • Oracles might be inconsistent with each other.
    Even though we have some powerful models for it, temperature measurement in climatology is inherently uncertain. What is the “correct” temperature outdoors? In the sunlight? In the shade? When the thermometer is near a building or farther away? Over grass, or over pavement? Some of the issues are described in this remarkable article (read the comments, too).
  • Although we can demonstrate incorrectness in a program, we cannot prove a program to be correct.
    As Djikstra put it, testing can only show the presence of errors, not their absence; and to go even deeper, Popper pointed out that theories can only be falsified, and not proven.

    Trivial example: No matter how many tests we run on that calculator, we can never know that it will always return 4 given the inputs 2 + 2; we can only infer that it will do so through induction, and induction can be deeply problemmatic. In a Nassim Taleb’s example (cribbed from Bertrand Russell and David Hume), every day the turkey uses induction to reinforce his belief in the farmer’s devotion to the desires and interests of turkeys—until a few days before Thanksgiving, when the turkey receives a very sudden, unpleasant, and (alas for the turkey) momentary flash of insight.
  • Sometimes we don’t need to know the correct result to know that the observed result is wrong.
    Trivial example: the domain of the cosine function ranges from -1 to 1. I don’t need to know the correct value for cos(72) to know that an output of 4.2 is wrong. (Elaine Weyuker discusses this in a paper called “On Testing Nontestable Programs” (Weyuker, Elaine, “On Testing Nontestable Programs”, Department of Computer Science, Courant Institute of Mathematical Sciences, New York University). “Frequently the tester is able to state with assurance that a result is incorrect without actually knowing the correct answer.”)

Checking for correctness—especially when the test output is observed and evaluated mechanically or indirectly—is a risky business. All oracles are fallible. A “passing” test, based on comparison with a fallible oracle cannot prove correctness, and no number of “passing” tests can do that. In this, a test is like a scientific experiment: an experiment’s outcome can falsify one theory while supporting another, but an experiment cannot prove a theory to be true.

A million observations of white swans says nothing about the possibility that there might be black swans; a million passing tests, a million observations of correct behaviour cannot eliminate the possibility that there might be swarms of bugs. At best, a passing test is essentially the observation of one more white swan. We urge those who rely on passing acceptance tests to remember this.

A check can suggest the presence of a problem, or can at best provide support for the idea that the program can work. But no matter what oracle we might use, a test cannot prove that a program is working correctly, or that the program will work. So what can oracles actually do for us?

If we invert the focus on correctness, we can produce a more robust heuristic. We can’t logically use an oracle to prove that a system is behaving correctly or that it will behave correctly, but we can use an oracle to help falsify the theory that it is behaving correctly.

This is why, in Rapid Software Testing, we say that an oracle is a means by which we recognize a problem when we encounter one in testing.

Further reading about oracles starting here.

This post was updated 2023-08-11 for formatting and a minor but important change to the last sentence.

Note that the formatting, as of this writing, is still off. I bet you can see that too. Did you apply a correctness criterion before you recognized the problem?

6 replies to “Oracles Are About Problems, Not Correctness”

  1. Link is not working in 7th dot “this remarkable article” – page not found “

  2. I just recently encountered this Oracle definition “problem”; “Sometimes we don’t need to know the correct result to know that the observed result is wrong.” I was testing a tool that did complex rule based calculations based on complex algorithms based on particular input. A subject matter expert needed to provide the expected results which did not happen too quickly. In the meantime I was still able to run tests on the implementation based on reasonable assumptions and isolation and found at least two areas where the calculation was definitely incorrect (i.e. value increased but should have decreased or totals weren’t including all necessary inputs correctly). In both of these cases I did not know what correct result should be. By the time the Subject Matter Expert had provided their analysis where we had some result examples the programmers had already solved the two problems that I noticed. So, in my experience this is true. So, for my situation I had an oracle according to the RST definition but not too much by the Howden definition.

    On another note, would Howden’s definition be okay if you just deleted the two words “for correctness”?

    Michael replies: Howden’s definition wouldn’t fit with ours even with that change. First, to take a page from Boris Beizer, it’s not just about the output, but about the outcomes (note the plural) and all the things that happened along the way. Things have moved on from the old days when computers received input on punch cards and produced output on fanfold paper. Modern computers don’t simply produce output; you could say that they produce experiences, of which the output is a part. (The product produced the correct result, as checked by our automated oracle; but it took 19 seconds to do it, it had a bunch of superfluous digits after the decimal point, it came back in white on a white background, and it silently sent our username and password to Then there’s Howden’s odd phrasing “the use of testing requires the existence of an external mechanism…”; what’s with “the use of”? Howden couldn’t have anticipated that we would draw a sharp distinction between testing and checking; nonetheless, oracles are used in all testing contexts, not only for checking.

    Finally, Howden seems to suggest that an oracle must be a mechanism. For several years, James and I held that an oracle was “a principle or mechanism by which we recognize a problem”. Later we realized that a person (or a person’s opinion, or a person’s feelings) could allow us to recognize a problem. That is, some oracles might be applied immediately, internally and tacitly on the part of the tester; other oracles are media, getting in between the observer and the thing being observed. People aren’t principles, and we were uncomfortable with conflating people and mechanisms. We could have extended the list (principle, mechanism, person, opinion, feeling, reference document, comparable product…), but we settled for generalizing (“an oracle is a means of recognizing a problem when we encounter one during testing”) and then explicating the means.

    Thanks for the story. It’s a nice example that helps to emphasize that we might not need to know about correctness to recognize a problem.

  3. Oracles often operate separately from the system under test.Method post-conditions are commonly used as automated oracles in automated class testing.The oracle problem is often much harder than it seems, and involves solving problems related to controlability and observability.


Leave a Comment