DevelopsenseLogo

Breaking the Test Case Addiction (Part 11)

In the previous post in this series, I made these claims about the audience for test reports:

  • They almost certainly don’t want to know about test case counts (although they might think they do).
  • They almost certainly don’t want to know about pass-fail ratios (although they might think they do).
  • They almost certainly don’t want to know about when the testing is going to be done (although they might think they do).

It’s far more likely that they want an answer to these questions:

What is the actual status of the product? Are there problems that threaten the value of the product? How do you—the tester—know? Do these problems threaten the on-time, successful completion of our work?

In this post, I’ll address the first two claims; I’ll leave the latter claim for next time.

They almost certainly don’t want to know about test case counts (although they might think they do).

Imagine asking a tester to test a cheap pocket calculator for you. We will call him “Eccles” (in honour of The Goon Show). You tell him your intentions for it: you would like use it mostly to help you to divide the bill for a group of friends at a restaurant, and other everyday tasks.

Eccles disappears, and returns a few minutes later. You ask him if he has found any problems. He says No. You ask to see his results, and he shows you his two test cases:

Input: 1 + 1 Result: 2 (Pass)
Input: 2 + 2 Result: 4 (Pass)

You quite reasonably believe that Eccles’ testing is inadequate. You tell him that you want more test cases. He listens, appears to understand the problem, and nods. He disappears again, and considerably later he returns, telling you that he has run 100 test cases—50 times more than the first time! And he has carefully documented the results:

Input: 1 + 1 Result: 2 (Pass)
Input: 2 + 2 Result: 4 (Pass)
Input: 3 + 3 Result: 6 (Pass)
Input: 4 + 4 Result: 8 (Pass)
Input: 5 + 5 Result: 10 (Pass)
Input: 6 + 6 Result: 12 (Pass)
Input: 7 + 7 Result: 14 (Pass)
Input: 8 + 8 Result: 16 (Pass)
Input: 9 + 9 Result: 18 (Pass)
Input: 10 + 10 Result: 20 (Pass)
Input: 11 + 11 Result: 22 (Pass)

Input: 99 + 99 Result: 198 (Pass)
Input: 100 + 100 Result: 200 (Pass)

To the degree that more is better here, it’s not very much better.

The trouble, of course, is that a count doesn’t mean anything without context. What aspects of the product are being tested? Has the testing been limited to only mathematical functions within the product? If so, has the tester at least given some coverage to all of them—and if not, which ones has the tester not covered—and why not? Has the tester considered other things that could diminish, damage, or destroy the value of the product? Has the tester considered performance and reliability? Has the tester considered the different people who might use the product, and the ways in which they might use the product in the real world?

Testing is the process of evaluating a product by learning about through experiencing, exploring and experimenting, which includes to some degree questioning, studying, modeling, observation, inference, sensemaking, risk analysis, critical thinking—and many other things too. A test is a an instance of testing. Not all tests are equal in terms of effort, time, skill, scope, risk focus,…

Test cases tend represent things that are easy to describe about a test: directly observable behaviour that can be described or encoded explicitly; and observable and describable outputs. Test cases both assume and ignore tacit knowledge.

But neither tests nor test cases are commensurate—that is, they cannot be counted as though they were equivalent units—so “test case” is not a valid unit of measurement.

  • From one case to another, test cases vary widely in scope, in coverage, in cost, in risk focus, and in value.
  • The design of a test case is subjective, based at least to some degree on the mental models and mindset of individual testers.
  • Test cases involve different test techniques.
  • Test cases are not independent; the outcome of one might influence the outcome of another.
  • Test cases are not interchangeable. They’re different, depending on the feature, function, data, and product in front of us.
  • Test cases do not—and cannot—capture all the testing work that occurs, such as learning, conjecture, discoveries, bug investigation, and so forth.
  • Test cases don’t even capture the work of designing the test cases, nor of analyzing the results!
  • And finally… testers often don’t follow the test cases anyway—and certainly not in the same way every time! A test is a performance, and a test case is like a script and stage directions for that performance. As with actors working from a script, the performance will vary from tester to tester, and from time to time.

Note that none of these things is necessarily a problem. Indeed, in testing, there’s considerable value in variation and variability. Bugs aren’t all the same, and they’re not always in the same place. There is a big problem in trying to treat test cases as equivalent for the purposes of counting them. (I’ve talked about that many times before, including here, and here.)

Now, there is at least one argument in favour of test cases:

Perhaps someone wants to verify that a specific procedure can be followed, with specific preconditions and specific inputs, in order to show that the procedure and inputs will produce a specific result. And, in fact, perhaps that procedure — or some part of it at least — can be automated.

That’s okay, although there are at least two problems to consider. First, all that specification tends to take time and effort which can be costly, and which can swamp the value of what we might learn from following the procedure. Second, demonstrating that something can work based on specific procedures and inputs doesn’t mean that it will work. A variation in the procedure, or the conditions, or the inputs will result different output. Even holding the conditions and the procedure steady, and obtaining the correct output might result in an outcome that is terribly wrong in some sense.

Perhaps someone wants certain conditions to be identified and covered. If that’s true, identify those conditions and cover them. There are plenty of ways to do that without over-formalizing or over-proceduralizing the testing work.

Consider

  • noting those conditions in guidance for human interaction with the product;
  • reviewing existing logs or records to see if those conditions have been covered, and if not, cover them; or
  • creating automated low- or middle-level checks for those conditions.

Over 50 years ago, Jerry Weinberg wrote this passage:

One of the lessons to be learned from such experiences is that the sheer number of tests performed is of little significance in itself. Too often, the series of tests simply proves how good the computer is at doing the same things with different numbers. As in many instances, we are probably misled here by our experiences with people, whose inherent reliability on repetitive work is at best variable. With a computer program, however, the greater problem is to prove adaptability, something which is not trivial in human functions either. Consequently we must be sure that each test really does some work not done by previous tests. To do this, we must struggle to develop a suspicious nature as well as a lively imagination.

Leeds and Weinberg, Computer Programming Fundamentals: Based on the IBM System/360, 1970

So, consider thinking in terms of testing, rather than test cases. And if you are applying test cases, please don’t count them. And if you count them, please don’t believe that the count means anything.

They almost certainly don’t want to know about pass-fail ratios (although they might think they do).

If a test case count is not a valid measure of test coverage, then a ratio derived from that count is invalid too, whether used to evaulate the quality of the product or the quality of the testing. I’ve heard tell of organizations that have a policy that says “when 97% of the test cases pass, the product is ready for shipping”. It shouldn’t take long to see the foolishness of this policy; it’s like a doctor say that when 97% of the data points in your medical checkup indicate no problem, you’re healthy.

Just as “the sheer number of tests is of little interest in itself”, the ratio of passing tests to failing ones is both insignificant and easy to game. Insignificant, because a product can be passing all of the tests that we’ve performed so far and still have terrible problems. Also insignificant, because a product can fail to pass hundreds of tests—but if those tests are outdated, inconsequential, overly precise, or otherwise irrelevant, there’s no problem. Easy to game, because if you want to make the product look better than it is, it’s a simple matter to perform more passing tests.

The point of testing is not to provide a pat on the head for the product; the point is to evaluate its true status, and to identify problems that threaten the value of the product to people who matter—to the users or customers of the software, or to anyone affected by it; to the support organization; to the operations people; and, ultimately, to the business.

Several years ago, a participant in one of my Rapid Software Testing classes approached me after I had mentioned this 97% pass rate business (which I’ll call 97PR henceforth). He said, “It’s funny you should mention it. I’ve worked at two companies where they used that measure to decide when to ship.”

“Really?” I replied. “Do you mind me asking—which ones?”

“Well,” he said. “One was Nortel.” I winced; Nortel was a huge Canadian success story until all of a sudden it wasn’t. “The other,” he said, “was RIM—Research in Motion. The Blackberry people.” I winced again.

Was 97PR responsible for the demise of these two companies? Probably not—certainly not directly. But to me, the 97PR suggests a company where engineering has been reduced to scorekeeping. If you want to fool people about something, providing numbers without context is a great way to do it. And if you want other people to fool you, ask for numbers without context.

For the calculator example above, what would a better test report look like? Here’s what I might offer:

“I’ve tested the calculator for basic math operations that seem likely to be important in calculating restaurant cheques: addition, multiplication, subtraction, and division. I imagined that you would be wanting to do this for groups of up to a dozen people. I did a handful of variations of each math operation, up to the limits of what the display of calculator supports, including stuff like dviding by zero. Beware, because if you do that by accident, you’ll lose what you’ve entered so far. (Aside: Windows Calculator loses the operations before a divide-by-zero too.) I took notes, if you want to see them.”

The client, of course, could stop me at any time. What if she didn’t? What would a deeper test report look like? Given some time, I might offer this:

“I tested the memory-store and memory-recall functions, too, and didn’t observe any problems. Even though they’re present as buttons on the calculator, I didn’t bother to test the higher-order math functions like squares, square roots, and trigonometric functions, since I reckoned you wouldn’t need those for restaurant bills and I didn’t want to waste your time by testing them. But if you want me to, I can.

“The buttons provide haptic feedback, so it’s easy to tell when they’ve been pressed, and there’s no key-repeat function, so it’s easier to avoid accidental double keypresses on this calculator than it is on others. I looked at it in low-light conditions; its LCD screen may be a little hard to see in a dark restaurant. It’s solar-powered, and there’s a feature that turns itself off after five minutes. In that case, it forgets whatever data you’ve entered.

“I dumped some water on the keypad, and it continued to perform without any problems. After I immersed it in a glass of water, though, I had to let it dry for a couple of days before it started working again, but it now seems to be working just fine.”

Yes; all that takes quite a bit longer to say—or to write—than “We’ve run 5163 tests, and of those, 118 are failing, for a pass rate of 97.7 per cent.” It’s also more informative—by a country mile—about the quality of the product and the quality of the testing.

So what do you do when a manager asks for test case counts or pass-fail ratios? Here’s a reply from James Bach: “I’m sorry, but misleading you is not a service that I offer.” Consider offering a three-part testing story instead.

We’ll get to that last claim about a test report’s audience (they almost certainly don’t want to know about when the testing is going to be done (although they might think they do)) in the next and final post in this all-too-long series.

2 replies to “Breaking the Test Case Addiction (Part 11)”

Leave a Comment