DevelopsenseLogo

Testing Deep and Shallow (2): “Shallow” is a feature, not an insult!

When we talk about deep and shallow testing in the Rapid Software Testing namespace, some people might assume that we mean “deep testing” is good and decent and honourable and pure, and that we mean “shallow” to be an insult, based on some kind of moral judgement. But we don’t. “Shallow” is not an insult. It’s a description.

Depth and shallowness are ways of talking about the thoroughness of testing, but they’re not assessments of its value. The value or quality or appropriateness of thoroughness can only be decided in context. Shallow testing can be ideal for some purposes, and deep testing can be pathological. How so? Let’s start by getting clear on what we do mean.

Shallow testing is testing that has a chance of finding every easy bug.

“Shallow testing” is not an insult! Shallow doesn’t mean “slapdash”, and shallow doesn’t mean “sloppy”. Shallow means “close to where we are right now; near the current surface”.

Both shallow testing and finding easy bugs are good things. We want to find bugs—especially easy bugs—as quickly and as efficiently as possible, and shallow testing has a chance of finding them. Shallow testing affords some coverage, typically in specific areas of the product. In lots of contexts, the fact that shallow testing isn’t deep is a feature, not a bug.

Here’s one form of shallow testing: TDD-style checks. When developers design and implement TDD checks, the goal is not to test the product deeply. The goal is is to make efficient, incremental progress in building a function or a feature. Each new check provides a quick indication that the new code does what the programmer intended it to do. Re-running the existing suite of checks provides a developer with some degree of confidence that the new code hasn’t introduced easy-to-find problems.

Those who like it say TDD makes rapid, disciplined progress possible by focusing the programmer on experimenting with the design and writing code efficiently. That effort is backed with simple, quick, first-order output checks. For the purpose of getting a new feature built, that’s perfectly reasonable and responsible.

When I’m writing code, I don’t want to do challenging, long-sequence, thorough experiments that probe a lot of different coverage areas every time I change the danged code. Neither do you. TDD checks aren’t typically targeted towards testing for security and usability and performance and compatibility and installability risks. If they were, TDD would be intolerably slow and ponderous, and running the checks would take ages.

Checking of this nature is appropriately and responsibly quick, inexpensive, and just thorough enough, allowing the developers to make reliable progress without disrupting development work too much. The idea is to find easy bugs at the coal face, applying relatively little effort that affords maximum momentum. That speed and ease is absolutely a feature of shallow testing. And not a bug.

There’s another kind of shallow testing that’s important to testers. A developer builds her mentals models of the product as part of the process of building it. The tester doesn’t have that insider’s, builder’s perspective. The absence of that perspective is both a feature and a bug. It’s a feature because the tester is seeing the product with fresh eyes, which can be helpful for identifying problems and risks. It’s a bug, because the tester must go through stage of learning, necessary confusion, and bootstrapping to learn about the product.

The Bootstrap Conjecture suggests that any process that is eventually done well and efficiently started off by being done poorly and ineffeciently; any process focused on trying to get things right the first time will be successful only if it’s trivial or lucky.

Even if we’re well informed about the product, its goodness is only theoretical until it’s built. As testers, we build and refine our mental models — how to understand the product, how to test it, and how to find problems in it — through encounters with the built product.

There is no means to teleport anyone, whether a tester or a developer, to deep testing right away. To some degree, we must learn to test deeply by trying to test the product in more shallow ways. Shallow testing affords the opportunity to learn the product, while absolving the tester of an obligation to try to get to deep testing too early. As a side benefit, this shallower, learning-focused testing offers us a chance of finding every easy bug.

Diversified shallow testing, in which we examine how to cover the product from lots of different perspectives, using lots of different techniques, to find lots of different threats to quality can improve our chances of finding easy bugs. Overfocusing on any one coverage criterion, technique, or risk (as procedurally-structured, repeated-by-rote, function-targeted output checks) limits those chances.

If we find lots of easy-to-notice bugs, management might want to pause and consider problems associated with the complexity of the product, or with maintaining a sustainable pace of development. That is, shallow testing might reveal a product or a project that’s in immediate trouble. Finding a few bugs with shallow testing can be an indication of risk lower down, and a pointer to where we might need deep testing to find the less obvious ones.

So what is deep testing, then?

Deep testing is testing that maximizes the chance of finding every elusive bug that matters.

That needs some unpacking.

First, “maximize”. No testing, and no form of testing, can guarantee that we’ll find every bug. (Note that in Rapid Software Testing, a bug is anything about the product that might threaten its value to some person who matters.)

It’s a commonplace maxim that complete testing is impossible: we can’t enter every possible set of inputs; examine every possible set of outputs; exercise every function in the product, in every possible sequence, with every possible variation of timing, on every possible platform, in every possible machine state that we can’t completely control anyway.

Given that we’re dealing with an infinite, intractable, multi-dimensional test space, testing skill matters, but some degree of luck inevitably plays a role. We can only strive to maximize our chances of finding bugs, because bugs are to some degree elusive. Bugs can be subtle, hidden, rare, intermittent, or emergent.

Some bugs are subtle, based on poorly-understand aspects of programming languages, or surprising behavior of technologies.

Some bugs are hidden in complex or obscure or old code. Some bugs are hidden in code that we didn’t write, but that we’re calling in a library or framework or operating system.

Some bugs are rare, dependent on specific sets of unusual conditions, or triggered by code encountering particular data, or exclusive to specific platforms.

Some bugs are intermittent, only manifesting infrequently, when the system is in a particular state.

Perhaps most significantly, some bugs are emergent. All of the components in a product might be fine in isolation, but the overall system has problems when elements of it are combined. A shared library, developed internally, that supports one product might clobber functions in another. A product that renders fine on one browser might run afoul of different implementations of standards on another.

And some bugs are hard to find with one kind of shallow testing, and easy to find with others. For instance, just today, I got mail from a Mac user friend that I’m sure looked fine on his machine; it doesn’t get rendered properly under Windows Outlook. It’s easy people with eyes and conscious minds to notice that bug when we encounter it. It’s a good deal harder to anticipate risks associated with display issues and create automated checks for them in advance of our encounter.

A module that performs fine in on a developer’s machine can be subject to weird timing problems when it is folded into the rest of a product, or when network latency comes into play, or when lots of people are using the system at the same time. There are all kinds of checks we can make as we build the product, but it often takes experiment to find out if our checks covered the risks.

Time can be a factor, too. One classic case is the Y2K problem; storing the year component of a date in a two-digit field wouldn’t have looked like much of a problem in 1970, when storage was expensive and people didn’t foresee that the system might still be in use a generation later. Programs that ran just fine on single-tasking 8086 processors encountered problems when run in virtual mode on the supposedly-compatible virtual 8086 mode on 80386 and later processors.

(This sort of stuff is all over the place. As of this writing, there seems to be some kind of latent bug on my Web site that only manifests when I try to update PHP, and that probably happens thanks to stricter checking by the newer PHP interpreter. It wasn’t a problem when I put the site together, years ago, and for now I’m in upgrade jail until I sort it all out. Sigh.)

Bugs that are elusive can evade even a highly disciplined development process, can evade shallow testing, and can even evade deep testing. Again, there are no guarantees, but the idea behind deep testing is to maximize the chance of finding elusive bugs.

How do you know that a bug is, or was, elusive? When an elusive bug is found in development, before release, qualified people on the team will say things like, “Wow… it would have been really hard for me to notice that bug. Good thing you found it.”

When a bug in our product is found in the field, by definition it eluded us, but was it an elusive bug? We might say that, since it eluded us, it’s an elusive bug by definition, but hold on.

Elusiveness isn’t a property of a bug. Elusiveness is based on a social judgment—a relationship between the bug, people, and context. If a bug found in the field was elusive, our social group will tend to agree: “Maybe we could have caught that, but it would have been really, really hard.” If a bug wasn’t elusive, our social group will say “Given the time and resources available to us, we really should have caught that.” In either case, responsible people will say, “We can learn something from this bug.”

That suggests, accurately, that both elusiveness and depth are subjective and socially constructed. A bug that might have been easy to find for a developer—shallow from her perspective—might have become buried by the time it gets to the tester. When a bug has been buried under layers of code, such that it’s hard to reach from the surface of the product, finding that bug deliberately requires deep testing.

A tester who is capable of analyzing and modeling risk and writing code to generate rich test data is likely to find deeper, more elusive data-related bugs than a tester who is missing one of those skills.

A bug that is easy for a domain expert to notice might easily get past non-experts. Developing expertise in the product domain is an element of deeper testing.

A tester with a rich, diversified set of models for covering the product might find bugs she considers relatively easy to find, but which a developer without those models might consider to be a deep bug.

Deep testing is, in general, far more expensive and time-consuming than shallow testing. This represents a conflict: as developers, our ambitions and our incentives are oriented towards getting the product built quickly. Deep testing is almost certain to disrupt devlopment; to slow down a developer’s progress. For that reason, especially as developers, we don’t want to perform deep testing

  • prematurely
  • too often
  • in a way oblivous to its cost
  • when it’s not valuable
  • when the feature in question and its relationship to the rest of the product is already well-understood
  • when risk is low, and deep bugs don’t matter
  • when shallow testing will do
  • if we can get someone else to help and reduce the amount of disruptively deep testing we must do

One the other hand, especially as testers, we want to perform deep testing

  • as early as possible (recognizing that we’ll need to start with shallow testing)
  • continuously
  • recognizing that it’s usually more expensive than shallow testing (so it had better be worth the opportunity cost)
  • conscious of and advocating for its value, when it is valuable
  • when the effects and side effects of the feature are not well known
  • when risk is substantial and material — and when we don’t know whether the deep, yet-to-be-found bugs matter or not
  • when shallow testing isn’t enough
  • when we can help to accelerate development by reducing the amount of deep testing the developers might otherwise be obliged to do

As an organization, we probably don’t need to perform more deep testing when we’ve already done plenty of deep testing, and all we want to do is check the status of the build before release. We probably don’t need deep testing when a change is small, and simple, and well-contained, and both the change and its effects have been thoroughly checked. Deep testing under those conditions could be obsessive-compulsively, pathologically deep.

On the other hand, if we haven’t been doing deep testing in parallel with the rest of development; if it’s a complex product with lots of dependencies; if there’s lots of churn; if we haven’t looked been looking at the whole product all the way along; if money, health, safety, social concerns, or reputation are on the line… In such conditions, our testing could be recklessly shallow.

So, once again, the issue is not that shallow testing is bad and deep testing is good. In some contexts, plenty of diversified shallow testing is just the thing we need, where deep testing would be overkill, expensive and unnecessary. The key is to consider the context, and the risk gap—the gaps between what we can reasonably say we know what we need to know in order to make good decisions about the product.

This post was lightly edited 2024-06-04.

1 reply to “Testing Deep and Shallow (2): “Shallow” is a feature, not an insult!”

Leave a Comment