Testing Deep and Shallow (2): “Shallow” is a feature, not an insult!

When we talk about deep and shallow testing in the Rapid Software Testing namespace, some people might assume that we mean “deep testing” is good and decent and honourable, and that we mean “shallow” to be an insult. But we don’t. “Shallow” is not an insult.

Depth and shallowness are ways of talking about the thoroughness of testing, but they’re not assessments of its value. The value or quality or appropriateness of thoroughness can only be decided in context. Shallow testing can be ideal for some purposes, and deep testing can be pathological. How so? Let’s start by getting clear on what we do mean.

Shallow testing is testing that has a chance of finding every easy bug.

“Shallow testing” is not an insult! Shallow doesn’t mean “slapdash”, and shallow doesn’t mean “sloppy”.

Both shallow testing and finding easy bugs are good things. We want to find bugs—especially easy bugs—as quickly and as efficiently as possible, and shallow testing has a chance of finding them. Shallow testing affords some coverage, typically in specific areas of the product. In lots of contexts, the fact that shallow testing isn’t deep is a feature, not a bug.

Here’s a form of shallow testing: TDD-style checks. When developers design and implement TDD checks, the goal is not to test the product deeply. The goal is is to make efficient, incremental progress in building a function or a feature. Each new check provides a quick indication that the new code does what the programmer intended it to do. Re-running the existing suite of checks provides a developer with some degree of confidence that the new code hasn’t introduced easy-to-find problems.

TDD makes rapid progress possible by focusing the programmer on experimenting with the design and writing code efficiently. That effort is backed with simple, quick, first-order output checks. For the purpose of getting a new feature built, that’s perfectly reasonable and responsible.

When I’m writing code, I don’t want to do challenging, long-sequence, thorough experiments that probe a lot of different coverage areas every time I change the danged code. Neither do you. TDD checks aren’t typically targeted towards testing for security and usability and performance and compatibility and installability risks. If they were, TDD would be intolerably slow and ponderous, and running the checks would take ages.

Checking of this nature is appropriately and responsibly quick, inexpensive, and just thorough enough, allowing the developers to make reliable progress without disrupting development work too much. The idea is to find easy bugs at the coal face, applying relatively little effort that affords maximum momentum. That speed and ease is absolutely a feature of shallow testing. And not a bug.

Shallow testing is also something that testers must do in their early encounters with the product, because there is no means to teleport a tester to deep testing right away.

A developer builds her mentals models of the product as part of the process of building it. The tester doesn’t have that insider’s, builder’s perspective. The absence of that perspective is both a feature and a bug. It’s a feature because the tester is seeing the product with fresh eyes, which can be helpful for identifying problems and risks. It’s a bug, because the tester must go through stage of learning, necessary confusion, and bootstrapping to learn about the product.

The Bootstrap Conjecture suggests that any process that is eventually done well and efficiently started off by being done poorly and ineffeciently; any process focused on trying to get things right the first time will be successful only if it’s trivial or lucky.

In early encounters with a product, a tester performs shallow testing—testing that has a chance of finding every easy bug. That affords the opportunity to learn the product, while absolving the tester of an obligation to try to get to deep testing too early.

So what is deep testing?

Deep testing is testing that maximizes the chance of finding every elusive bug that matters.

That needs some unpacking.

First, “maximize”. No testing, and no form of testing, can guarantee that we’ll find every bug. (Note that in Rapid Software Testing, a bug is anything about the product that might threaten its value to some person who matters.)

It’s a commonplace maxim that complete testing is impossible: we can’t enter every possible set of inputs; examine every possible set of outputs; exercise every function in the product, in every possible sequence, with every possible variation of timing, on every possible platform, in every possible machine state that we can’t completely control anyway.

Given that we’re dealing with an infinite, intractable, multi-dimensional test space, testing skill matters, but some degree of luck inevitably plays a role. We can only strive to maximize our chances of finding bugs, because bugs are to some degree elusive. Bugs can be subtle, hidden, rare, intermittent, or emergent.

Some bugs are subtle, based on poorly-understand aspects of programming languages, or surprising behavior of technologies.

Some bugs are hidden in complex or obscure or old code. Some bugs are hidden in code that we didn’t write, but that we’re calling in a library or framework or operating system.

Some bugs are rare, dependent on specific sets of unusual conditions, or triggered by code encountering particular data, or exclusive to specific platforms.

Some bugs are intermittent, only manifesting infrequently, when the system is in a particular state.

Perhaps most significantly, some bugs are emergent. All of the components in a product might be fine in isolation, but the overall system has problems when elements of it are combined. A shared library, developed internally, that supports one product might clobber functions in another. A product that renders fine on one browser might run afoul of different implementations of standards on another.

Just today, I got mail from a Mac user friend that I’m sure looked fine on his machine; it doesn’t get rendered properly under Windows Outlook. A product that performs fine in the lab can be subject to weird timing problems when network latency comes into play, or when lots of people are using the system at the same time.

Time can be a factor, too. One classic case is the Y2K problem; storing the year component of a date in a two-digit field wouldn’t have looked like much of a problem in 1970, when storage was expensive and people didn’t foresee that the system might still be in use a generation later. Programs that ran just fine on single-tasking 8086 processors encountered problems when run in virtual mode on the supposedly-compatible virtual 8086 mode on 80386 and later processors.

(This sort of stuff is all over the place. As of this writing, there seems to be some kind of latent bug on my Web site that only manifests when I try to update PHP, and that probably happens thanks to stricter checking by the newer PHP interpreter. It wasn’t a problem when I put the site together, years ago, and for now I’m in upgrade jail until I sort it all out. Sigh.)

Bugs that are elusive can evade even a highly disciplined development process, and can also evade deep testing. Again, there are no guarantees, but the idea behind deep testing is to maximize the chance of finding elusive bugs.

How do you know that a bug is, or was, elusive? When an elusive bug is found in development, before release, qualified people on the team will say things like, “Wow… it would have been really hard for me to notice that bug. Good thing you found it.”

When a bug in our product is found in the field, by definition it eluded us, but was it an elusive bug?

Elusiveness isn’t a property of a bug, but a social judgment—a relationship between the bug, people, and context. If a bug found in the field was elusive, our social group will tend to agree, “Maybe we could have caught that, but it would have been really, really hard.” If a bug wasn’t elusive, our social group will say “Given the time and resources available to us, we really should have caught that.” In either case, responsible people will say, “We can learn something from this bug.”

That suggests, accurately, that both elusiveness and depth are subjective and socially constructed. A bug that might have been easy to find for a developer—shallow from her perspective—might have become buried by the time it gets to the tester. When a bug has been buried under layers of code, such that it’s hard to reach from the surface of the product, finding that bug deliberately requires deep testing.

A tester who is capable of analyzing and modeling risk and writing code to generate rich test data is likely to find deeper, more elusive data-related bugs than a tester who is missing one of those skills.

A bug that is easy for a domain expert to notice might easily get past non-experts. Developing expertise in the product domain is an element of deeper testing.

A tester with a rich, diversified set of models for covering the product might find bugs she considers relatively easy to find, but which a developer without those models might consider to be a deep bug.

Deep testing is, in general, far more expensive and time-consuming than shallow testing. For that reason, we don’t want to perform deep testing

  • too often
  • prematurely
  • in a way oblivous to its cost
  • when it’s not valuable
  • when the feature in question and its relationship to the rest of the product is already well-understood
  • when risk is low
  • when shallow testing will do

We probably don’t need to perform deep testing when we’ve already done plenty of deep testing, and all we want to do is check the status of the build before release. We probably don’t need deep testing when a change is small, and simple, and well-contained, and both the change and its effects have been thoroughly checked. Such testing could easily be obsessive-compulsively, pathologically deep.

So, once again, the issue is not that shallow testing is bad and deep testing is good. In some contexts, shallow testing is just the thing we need, where deep testing would be overkill, expensive and unnecessary. The key is to consider the context, and the risk gap—the gaps between what we can reasonably say we know what we need to know in order to make good decisions about the product.

Leave a Comment