I Might Be Wrong (But Not For Me)

Jerry Weinberg tells a story (yes, it’s me; I’m telling yet another Jerry Weinberg story) of meeting an old friend who looked distraught.

“What’s the matter?” Jerry asked.

The fellow replied, “Well, I’m kind of shellshocked. My wife just left me.”

“Was that a surprise?”

“Yes, it really was,” the fellow said. “I mean, we had had some problems, but I thought they were all settled.”

Jerry paused for a moment. Then he said, “nothing is ever settled.”

Several years after hearing that story I recognized its power as a general systems law. Obviously, I didn’t discover it, but I did name it. I call it “The Unsettling Rule”: Nothing is ever settled.

In Lessons Learned in Software Testing by Kaner, Bach, and Pettichord, Lesson 145 is “Use the IEEE Standard 829 for Test Documentation”. Lesson 146, on the facing page, is “Don’t Use the IEEE Standard 829”. When the book was published, some reviewers said “What’s the problem with these guys? They can’t even get it together to tell a consistent story!” Others, including me, thought that this pair of pages in particular was wonderful. It underscored the degree to which issues in the world of software testing are not settled, the degree to which our craft is a long dialogue in which there are many voices to be heard, many options to be discussed, and many contexts be considered.

The difference between the context-driven school (or approach; there’s now apparently disagreement between whether it’s a school or an approach!) and other school/approaches is that these disagreements can get aired in public. There are some fundamental principles on which we agree, and there are some other things on which we don’t agree. Whatever else happens, in this community, we try to make sure that there’s no fake consensus. This is alarming and disturbing, sometimes, to some people, and it can be stressful to the participants. But when it comes up, it’s a hallmark of our community that we try to deal with it. It helps to keep us sharp, and it helps to keep us honest.

Recently I wrote a blog post in which I took the position that the often-used pass-vs.-fail ratio is an invalid and misleading measurement. To summarize the post, I said, “At best, if everyone ignores it entirely, it’s simply playing with numbers. Otherwise, producing a pass/fail ratio is irresponsible, unethical, and unprofessional… The ratio of passing test cases to failing test cases is at best irrelevant, and more often a systemic means of self- and organizational deception. Reducing the product story to a number means reducing its relationship with people to a number. By extension, that means reducing people to numbers too. So to irresponsible, unethical, and unprofessional, we can add unscientific and inhumane.”

I recognize that, coming from someone who claims to be context-driven, that’s pretty extreme stuff. Yet, in its form, it’s consistent with one of those pages or the other in Lessons Learned in Software Testing (with some omissions, which I’ll address shortly). It is also consistent with a set of principles that James Bach and I espouse as part in our Rapid Software Testing class:

We will not knowingly or negligently mislead our clients and colleagues. This ethical premise drives a lot of the structure of Rapid Software Testing. Testers are frequently the target of well-meaning but unreasonable or ignorant requests by their clients. We may be asked to suppress bad news, to create test documentation that we have no intention of using, or to produce invalid metrics to measure progress. We must politely but firmly resist such requests unless, in our judgment, they serve the better interests of our clients. At minimum we must advise our clients of the impact of any task or mode of working that prevents us from testing, or creates a false impression of the testing.

To me, that statement is both in tension with and consistent with several of the principles of the context-driven school, the first and second (“The value of any practice depends on its context” and “There are good practices in context, but there are no best practices”) and the seventh (“Only through judgment and skill, exercised cooperatively throughout the entire project, are we able to do the right things at the right times to effectively test our products.”)

Pass-vs.-fail ratios, to me, fly in the face of one of the “principles in action” listed at “Metrics that are not valid are dangerous.”

Cem Kaner disagrees with the position expressed in my post. It seems to me that Cem’s disagreement hangs on the degree of danger and our reactions to it. I hold that in practical contexts, pass-vs.-fail ratios so dangerous that for almost all cases, they cross over the line into “unethical:, like giving the car keys to someone who is obviously drunk, or like planting land mines near a community well, even though in some rare contexts, such things could be done in good faith and without harm. Cem’s position seems to be (and I welcome correction, if it’s warranted) that although pass-vs.-fail ratios are exemplary of dangerous metrics, they’re not unethical.

Let’s start with two points that I’d like to make about the “unethical” label. One is that my ethical sense is personal, and so are the views posted on my blog. Although I’m happy when other people share them, unless otherwise stated, I don’t represent the view of any community, including my own. I don’t make claims to universal ethics. Second, Cem refers to “using the accusation of unethical as a way of shutting down discussion of whether an idea (unethical!) was any good or not.” I’m not using it that way. I have no intention whatsoever of shutting down debate (as if I could in any case!). Unless claimed otherwise, I am stating personal principles; not Right and Wrong, but right and wrong for me. I don’t know of any agency (other than society) who can make claims of Right or Wrong, and even then claims seem always context-specific.

Whether pass-vs.-fail ratios are wrong or Wrong, they’re certainly wrong for me, wrong enough that I’m uncomfortable with using them on the job. I’m sufficiently uncomfortable that I’m usually going to decline to provide them, just as I would not accept a job in which I was obliged to shoot people. Other people might choose to become mercenaries or to go to war for their countries; I’d be a conscientious objector. That wrongness is relative too, of course. It’s subject to the Relative Rule; that any abstract X is X to some person, at some time. I can only warrant my own ethical stance for the moment. My position on some issues has changed over the years, courtesy of some pleasant and unpleasant experiences. I’m not currently aware of things that might cause my stand to change in the future, but I have to leave the possibility open.

So, is providing pass vs. fail rates unethical? On reflection, I have to say reluctantly, yeah, I think so; not absolutely, but in most practical circumstances. For me, the crucial test is in the last of Cem’s questions about ethics: “Are you helping someone else lie, cheat, steal, intimidate, or cause harm?” My answer is that I see a great deal of risk—and admittedly risk is only potential harm—that I will be aiding the client in some form of oppression or deception, either to himself or to his superiors. (The latter is a situation that I have been in before, with pass-vs.-fail ratios at the centre of the story in a project associated with a $33 million dollar loss.) Most of the time, providing pass-vs.-fail ratios is a test activity that I would stop immediately, using the “mission rejected” stopping heuristic (one that I hadn’t noted until Cem himself pointed it out).

Cem doesn’t provide any contexts in which pass-vs.-fail ratios might be useful, but as a context-driven tester, it’s my obligation to accept his critique and his challenge, and consider some contexts in which I might use them. (This is the omission from my post post that I mentioned above, and it’s the way that the controversy was handled in Lessons Learned: with a serving of context) I present them in order from the least plausible to the most plausible.

“Your daughter will die” or “we’ll shoot this dog.” If someone employs a threat of harm to some person or being or something of value, I have to evaluate the relative damage afforded by providing the measure or not.

When mandated by force of law. If I were on the witness stand, and a lawyer asked me, “What were the pass-vs.-fail ratios at release time for this project,” I’d be required by law to respond. I can imagine a likely way it would play out, too: “92.7%, but I’d also like to make it clear that—” “No further questions, Your Honour.”

If I provided the data with all of the appropriate disclaimers AND I were sure that the disclaimer would be heard. If the client (and the client’s client, and so forth) were to relay the data and the disclaimer reliably to the point where the data would be used, I might be persuaded to provide the data. But I’d have to weigh that against the risk that I was wrong about the disclaimer being heard. Moreover, in my professional judgement, it would be wasting my client(s)’s time.

As a placebo. I might give a pass vs. fail ratio long enough to convince my client that it’s not helpful or necessary, while doing other things to test well and provide her with other forms of reliable information. I’d remain pretty uncomfortable with dispensing the sugar pills, though, and would work at ways of getting around it.

In the course of demonstrating that pass-vs.-fail ratios are a bad idea. In some contexts, pass-vs.-fail ratios provide what Kirk and Miller call quixotic reliability. That is, the measurement seems to correlate with other measurements of the state of the project. I might provide pass-vs.-fail ratios long enough to show a divergence between that data and other measures of project or product health.

If I were aware that the person receiving the data was in possession of all the contextual information that I believe they needed to put it to appropriate and non-harmful use. We use this in one of the exercises in our class, based on a bug from an actual product. We present a very specific set of tests that are the same in every material way but for two variables. The total domain space to put these variables in combination is a set with 2304 elements. When used in a test that covers all of these elements, 510 provide a “fail” result. All of the test cases are of the same kind, and our students knows that those test cases are comparable for the purposes that they’re considering. In that case, that kind of ratio in that kind of context has some value in describing that kind of coverage. So there might be some pedagogical or rhetorical value to reporting a pass-vs.-fail ratio there. Interestingly, the root of the problem is a data type problem in a single line of code. That helps to illuminate the discussion of “one bug or 510?” which in turn illuminates how bug counts and failure counts aren’t well correlated. It also helps to illuminate opportunity cost in paying overmuch attention to this problem when there are many other things that we might test.

To me, the real challenge is in coming up with a case in which this invalid, dangerous metric in its most common applications might be used for good. In the contexts where they’re commonly discussed and used—overwhelmingly commonly, in my view—pass-vs.-fail ratios are used to express the quality of testing, the health of the project, or the readiness of the product. In those contexts, the risk of misuse, whether intentional or inadvertent, is high—like placing a loaded gun with the safety off in a crowded subway car. As I’ve heard Cem say before, “I’d like to call them an Industry Worst Practice, but being context-driven, I can’t.” Once again, Cem has reminded me of why I can’t commit to the “unethical” charge absolutely and in all cases. He’s provided me with a challenge and an opportunity to sharpen my analysis, and I thank him for that.

Postscript, March 28, 2012: In private correspondence and conversation, Cem suggested a different interpretation of a paragraph from this post that I quoted above to provide context for this post. In order to ward off that interpretation, here’s how I might write that paragraph today:

“The ratio of passing test cases to failing test cases is at best irrelevant, and more often a systemic means of self- and organizational deception. Reducing the product story to this invalid number without additional information means reducing the product’s relationship with people to this invalid number. By extension when this invalid number is being used to evaluate people, that means reducing people to this invalid number too. So to irresponsible, unethical, and unprofessional, in this case we could add unscientific and inhumane.”

To be clear: these two posts have not been a blanket condemnation of all measurement, but of a particular metric that fails spectacularly when subjected to the tests of construct validity and reasonable and foreseeable side effects in Kaner and Bond’s Software Engineering Metrics: What Do They Measure and How Do We Know?. Pass vs. fail is not an imperfect metric; this is a metric that has no discernable construct validity to me (or even to Cem). I’ve both experienced and seen pain and systematic deception with this metric at the centre of it. In this, it’s not like imperfect financial figures that are generated by legitimate companies subject to scrutiny by regulators, by auditors, by shareholders, and by markets. It’s more like financial forecasting data dreamed up by Bernie Madoff. I don’t mind dealing with imperfect but plausibly valid information; that’s all a tester ever gets to do, really. But if Bernie Madoff were to ask me to lend my credibility to his models, data, or business practices, I’d feel personally bound to decline that particular request.

7 replies to “I Might Be Wrong (But Not For Me)”

  1. Hi Michael

    Appreciate the “To me, the real challenge is in coming up with a case in which this invalid, dangerous metric in its most common applications might be used for good.” IT is indeed a challenge also to me.

    Can you elaborate based on “There are good practices in context, but there are no best practices” – is there for a given (testing) practice, a context/community where the practice is good (enough for them)?

    Michael replies: If I understand your question, there appear to be a number of people on the LinkedIn forum “Software Testing and Quality Assurance” who say that the practice is good enough for them.

    I’m at theoretically yes, but realistically maybe – but I might be wrong

    Me too. 🙂

  2. Michael,

    What resonated with me on this post was your last sentence:

    “He’s provided me with a challenge and an opportunity to sharpen my analysis, and I thank him for that”

    Recently I wrote a post about ‘jumping to conclusions’ where I talk about questions and conclusions. You posted a comment suggesting inferences was a better standpoint.

    I’m glad you wrote that, you made a valid point and it helped my understanding better.

    I’m glad there’s people out there willing to test and challenge my ideas and my thinking and help me create a deeper understanding of what I do.

    Michael replies: Me too. 🙂 I believe we should cultivate relationships with people like that.

  3. Interesting post, but I’d like to pose the question – couldn’t a pass-fail ratio be used along with a number of other metrics to support a particular snapshot of project / testing / whatever progress or state?

    Without doubt, using pass-fail as a single or even key criteria to evaluate how a project is faring is not a particularly bright thing to do (speaking from experience). But using such data to provide a certain angle may hold some value, however large or small.

    I guess this kind of falls under your condition where the stakeholder has all of the contextual information available to use pass-fail accordingly, but I’d like to think that many (if not most) “people in power” do use this data with a lot of other facts at their finger-tips and that the risk of stakeholders “doing themselves a harm” is reasonably low as a result. That said, I could be somewhat overly optimistic about that…

  4. Hi Michael,

    I’m a Software tester, and we use pass/fail statistics on our project a lot. There are 2 main flaws I see in them: they don’t evaluate the importance of the scenarious in raport to one another, and they don’t say anything about scenarious that should be covered, but are not. Do you agree with this? Are there any additional flaws you see?


Leave a Comment