On Red

What actually happens when a check returns a “red” result?

Some people might reflexively say “Easy: we get a red; we fix the bug.” Yet that statement is too simplistic, concealing a good deal of what really goes on. The other day, James Bach and I transpected on the process. Although it’s not the same in every case, we think that for responsible testers, the process actually goes something more like this:

First, we ask, “Is the check really returning a red?” The check provides us with a result which signals some kind of information, but by design the check hides lots of information too. The key here is that we want to see the problem for ourselves and apply human sensemaking to the result and to the possibility of a real problem.

Sensemaking is not a trivial subject. Karl Weick, in Sensemaking in Organizations, identifies seven elements of sensemaking, saying it is:

grounded in identity construction (which means that making sense of something is embedded in a set of “who-am-I-and-what-am-I-doing here?” questions);
social (meaning that “human thinking and social functioning are essential aspects of each other”, and that making sense of something tends to be oriented towards sharing the meanings);
ongoing (meaning that it’s happening all the time, continuously; yet it’s…)
retrospective (meaning that it’s based on “what happened” or “what just happened?”; even though it’s happening in the present, it’s about things that have happened in the past, however recent that might be);
enactive of sensible environments (meaning that sensemaking is part of a process in which we try to make the world a more understandable place);
based on plausibilty, rather than accuracy (meaning that when people make sense of something, they tend to rely on heuristics, rather than things that are absolutely 100% guaranteed to be correct)
focused on extracted cues (extracted cues are simple, familiar bits of information that lead to a larger sense of what is occurring, like “Flimsy!->Won’t last!” or “Shouting, with furrowed brow!->Angry!” or “Check returns red!->Problem!”).

The reason that we need to apply sensemaking is that it’s never clear that a check is signaling an actual problem in the product. Maybe there’s a problem in the instrumentation, or a mistake in the programming of the check. So when we see a “red” result, we try to make sense of it by seeking more information (or examining other extracted cues, as Weick might say).

We might perform the check a second time, to see if we’re getting a consistent result. (Qualitative researchers would call this a search for diachronic reliability; are we getting the same result over time?)

If the second result isn’t consistent with the first, we might perform the check again several times, to see if the result recurs only occasionally and intermittently.

We might look for secondary indicators of the problem, other oracles or other evidence that supports or refutes the result of the check.

If we’re convinced that the check is really red, we then ask “where is the trouble?” The trouble might be in the product or in the check.

We might inspect the state of our instrumentation, to make sure that all of the equipment is in place and set up correctly.
We might work our way back through the records produced by the check, tracing through log files for indications of behaviours and changes of state, and possible causes for them.
We might perform the check slowly, step by step, observing more closely to see where the things went awry. We might step through the code in the debugger, or perform a procedure interactively instead of delegating the activity to the machinery.
We might perform the check with different values, to assess the extents or limits of the problem.
We might perform the check using different pacing or different sequences of actions to see if time is a factor.
We might perform the check on other platforms, to see if the check is revealing a problem of narrow or more general scope. (Those qualitative researchers would call this a search for synchronic reliability; could the same thing happen at the same time in different places?)

Next, if the check appears to be producing a result that makes sense—the check is accurately identifying a condition that we programmed it to identify—it might be easy to conclude that there’s a bug, and now it’s time to fix it. But we’re not done, because although the check is pointing to an inconsistency between the actual state of the product and some programmed result, there’s yet another decision to be made: is that inconsistency a problem with respect to something that someone desires? In other words, does that inconsistency matter?

Maybe the check is archaic, checking for some condition that is no longer relevant, and we don’t need it any more.
Maybe the check is one of several that are still relevant, but this specific check wrong in some specific respect. Perhaps something that used to be permitted is now forbidden, or vice versa.
When the check returns a binary result based on a range of possible results, we might ask “is the result within a tolerable range?” In order to do that, we might have to revisit our notions of what is tolerable. Perhaps the output deviated from the range insignificantly, or momentarily; that is, the check may be too restrictive or too fussy.
Maybe the check has been not been set up with explicit pass/fail criteria, but to alert us about some potentially interesting condition that is not necessarily a failure. In this case, the check doesn’t represent a problem per se, but rather a trigger for investigation.
We might look outside of the narrow scope of the check to see if there’s something important that the check has overlooked. We might do this interactively, or by applying different checks.

In other words: after making an observation and concluding that it fits the facts, we might choose to apply our tacit and explicit oracles to make a different sense of the outcome. Rather than concluding “The product doesn’t work the way we wanted it to”, we may realize that we didn’t want the product to do that after all. Or we might repair the outcome (as Harry Collins would put it) by saying, “That check sometimes appears to fail when it doesn’t, just ignore it” or “Oh… well, probably this thing happened… I bet that’s what it was… don’t bother to investigate.”

In the process of developing the check, we were testing (evaluating the product by learning about it through exploration and experimentation). The check itself happens mechanically, algorithmically. As it does so, it projects a complex, multi-dimensional space down to single-dimensional result, “red” or “green”. In order to make good use of that result, we must unpack the projection. After a red result, the check turns into the centre of a test as we hover over it and examine it. In other words, the red check result typically prompts us to start testing again.

That’s what usually happens when a check returns a “red” result. What happens when it returns nothing but “green” results?

7 replies to “On Red”

Of course an all green test result means that the product manifests its theoretical maximum value to all affected users. The product has 0 bugs. If a million expert testers spent a million years investigating the product they would not, could not, find a single bug.

Michael replies: Indeed. Well, we’ll get to that in another exciting installment.

[…] Blog: On Red – Michael Bolton – http://www.developsense.com/blog/2015/06/on-red/ […]

[…] On Red Written by: Michael Bolton […]

All green test results means that the semaphore is broken =)

Michael replies: That’s true of checks that are falsely running green—isn’t that what you mean?

[…] On Red by Michael Bolton […]

[…] expensive. (As a side note – Michael Bolton has a good post on what actually happens when you see a check failure. The whole point to getting “really clean” is to get to the point where it’s a […]

[…] – On red and On green – Michael […]

7 replies to “On Red”

Leave a Comment Cancel reply