Flaky Testing

The expression “flaky tests” is evidence of flaky testing. No scientist refers to “flaky experimental results”. Scientists who observe inconsistency don’t dismiss it. They pay close attention to it, and probe it. They redesign their experiments or put better controls on them.

When someone refers to an automated check (or a suite of them) as a “flaky test”, the suggestion is that it represents an unreliable experiment. That assumption is misplaced. In fact, the experiment reliably shows that someone’s models of the product, check code, test environment, outcomes, theory, and the relationships between them are misaligned.

That’s not a “flaky experiment”. It’s an excellent experiment. The experiment is telling you something crucial: there’s something you don’t know. In science, a surprising, perplexing, or inconsistent result prompts scientists to begin an investigation. By contrast, in software, an inconsistent result prompts some people to shrug and ignore what the experiment is trying to tell them. Then they do weird stuff like calculating a “flakiness score”.

Of course, it’s very tempting psychologically to dismiss results that you can’t explain as “noise”, annoying pieces of red junk on your otherwise lovely all-green lawn. But a green lawn is not the goal. Understanding what the junk is, where it is, and how it gets there is the goal. It might be litter—it it might be a leaking container of toxic waste.

It’s not a great idea to perform a test that you don’t understand, unless your goal is to understand it and its relationship to the product. But it’s an even worse idea to dismiss carelessly a test outcome that you don’t understand. For a tester, that’s the epitome of “flaky”.

Now, on top of all that, there’s something even worse. Suppose you and your team have a suite of 100,000 automated checks that you proudly run on every build. Suppose that, of these, 100 run red. So you troubleshoot. It turns out that your product has problems indicated by 90 of the checks, but ten of the red results represent errors in the check code. No problem. You can fix those, now that you’re aware of the problems in them.

Thanks to the scrutiny that red checks receive, you have become aware that 10% of the outcomes you’re examining are falsely signalling failure when they are in reality successes. That’s only 10 “flaky” checks out of 100,000. Hurrah! But remember: there are 99,900 checks that you haven’t scrutinized. And you probably haven’t looked at them for a while.

Suppose you’re on a team of 10 people, responsible for 100,000 checks. To review those annually requires each person working solo to review 10,000 checks a year. That’s 50 per person (or 100 per pair) every working day of the year. Does your working day include that?

Here’s a question worth asking, then: if 10% of 100 red checks are misleadingly signalling a problem, what percentage of 99,900 green checks are misleadingly signalling “no problem”? They’re running green, so no one looks at them. They’re probably okay. But even if your unreviewed green checks are ten times more reliable than the red checks that got your attention (because they’re red), that’s 1%. That’s 999 misleadingly green checks.

Real testing requires intention and attention. It’s okay for a suite of checks to run unattended most of the time. But to be worth anything, they require periodic attention and review—or else they’re like smoke detectors, scattered throughout enormous buildings, whose batteries and states of repair are uncertain. And as Jerry Weinberg said, “most of the time, a nonfunctioning smoke alarm is behaviorally indistinguishable from one that works. Sadly, the most common reminder to replace the batteries is a fire.”

And after all this, it’s important to remember that most checks, as typically conceived, are about confirming the programmers’ intentions. In general, they represent an attempt to detect coding problems and thereby reduce programmers committing (pun intended) easily avoidable errors. This is a fine and good thing—mostly when the effort is targeted towards lower-level, machine-friendly interfaces.

Typical GUI checks, instrumented with machinery, are touted as “simulating the user”. They don’t really do any such thing. They simulate behaviours, physical keypresses and mouse clicks, which are only the visible aspects of using the product—and of testing. GUI checks do not represent users’ actions, which in the parlance of Harry Collins and Martin Kusch are behaviours plus intentions. Significantly, no one reduces programming or management to scripted and unmotivated keystrokes, yet people call automated GUI checks “simulating the user” or “automated testing”.

Such automated checks tell us almost nothing about how people will experience the product directly. They won’t tell us how the product supports the user’s goals and tasks—or where people might have problems getting what they want from the product. Automated checks will not tell us about people’s confusion or frustration or irritation with the product. And automated checks will not question themselves to raise concern about deeper, hidden risk.

More worrisome still: people who are sufficiently overfocused, fixated, on writing and troubleshooting and maintaining automated checks won’t raise those concerns either. That’s because programming automated GUI checks is hard, like all programming is hard. But programming a machine to simulate human behaviours via complex, ever-changing interfaces designed for humans instead of machines is especially hard. The effort easily displaces risk analysis, studying the business domain, learning about users’ problems, and critical thinking about all of that.

Testers: how much time and effort are you spending on care and feeding of scripts that represents distraction from interacting with the product and searching for problems that matter? How much more valuable would your coding be if it helped you examine, explore, and experiment with the product and its data? If you’re a manager, how much “testing” time is actually coding and fixing time, in which your testers are being asked to fuss with making the checks run green, and adapting them to ongoing changes in the product?

So the issue is not flaky tests, but flaky testing talk, and flaky test strategy. It’s amplified by referring to “flaky understanding” and “flaky explanation” and “flaky investigation” as “flaky tests”.

Some will object. “But that’s what people say! We can’t just change the language!” I agree. But if we don’t change the way we speak —and the way we think along with it—we won’t address the real flakiness, which the flakiness in our systems, and the flakiness in our understanding and explanations of those systems. With determination and skill and perseverance, we can change this. We can help our clients to understand the systems they’ve got, so that they can decide whether those are the systems they want.

Learn about how to focused on fast, inexpensive, powerful testing strategies to find problems that matter. Register for classes here.

2 replies to “Flaky Testing”

Robert Day

February 25, 2021 at 10:31 am

Or, as Isaac Asimov once put it, “The most important words in science aren’t ‘Eureka! I’ve got it!’, but ‘Hmmm… that’s interesting…’ “
Five Blogs – 4 March 2021 – 5blogs

March 4, 2021 at 1:05 am

[…] Flaky Testing Written by: Michael Bolton […]

2 replies to “Flaky Testing”

Leave a Reply to Robert Day Cancel reply