Last night, my wife was out on an errand in our car. She parked it, entered the store, and came out again. She tried to start the car. It wouldn’t start.
She called home to consult with me. We tried a couple of things over the phone. We considered a couple of possible problems. From what I could tell, the starter motor wasn’t engaging. Not exactly a surprise, because I had noticed that the car had been a little reluctant to start over the last little while. We decided to call the auto club to have the car towed to our mechanic’s shop, close to our home.
When the tow truck driver arrived, he tried to start the car. It started. The tow was free, thanks to the auto club, so we decided to have it towed anyway.
I walked down to meet the driver at the mechanic’s shop, to pick up the key—the shop was closed, and there isn’t a drop box. The driver unloaded the car from the flatbed, started it, and parked it. He left the car running. I got in, turned it off, and then started it. Once again, the car seemed a little cranky. Nonetheless, it started.
We’re going to Montreal in a couple of days—a six-hour drive. We’ll need the car to get there, to get around the city, to drop in on some friends on the way back, and to get home. We’ve established that the car can work. Does that mean everything is okay, or would you have the mechanic take a look at the car?
The other day, I was on LinkedIn. I noticed an interesting behaviour: account names were suddenly missing from my notifications (they usually appear beside the picture).
I wrote a brief post to note the issue — and in the post, I noted that a few minutes later, the names were back.
One correspondent replied that this was “pretty impressive fault tolerance and recovery if you think of it that way.”
Well… maybe. But as a tester, I don’t think that way. As a tester, I must focus on trouble; on risk; on the idea that I’m seeing evidence of a deeper problem.
As testers, we must remain alert to any symptom, anything that seems out of place, a hissing sound, a grinding sound, an inconsistency, missing text, or a starter that sometimes doesn’t work.
Today, reviewing my email backlog, I saw this:
This — missing user data from a different day — seems to be evidence of a more systemic problem. To a tester, the fact that some message notifications seem to be displayed properly is unremarkable. The important thing is that some don’t.
All this reminds me of Richard Feynman (the patron saint of software testers), and the appendix he wrote to the Report of the Presidential Commission on the Space Shuttle Challenger Accident.
Feynman noted that NASA officials took exactly the wrong interpretation from problems that had been observed on previous shuttle flights — erosion and blow-by. (Erosion refers to the degradation of the rubber O-rings that kept hydrogen and oxygen separate from one another; blow-by refers to the escape of gases through those degraded seals.)
The phenomenon of accepting for flight, seals that had shown erosion and blow-by in previous flights, is very clear. The Challenger flight is an excellent example. There are several references to flights that had gone before. The acceptance and success of these flights is taken as evidence of safety. But erosion and blow-by are not what the design expected. They are warnings that something is wrong. The equipment is not operating as expected, and therefore there is a danger that it can operate with even wider deviations in this unexpected and not thoroughly understood way. The fact that this danger did not lead to a catastrophe before is no guarantee that it will not the next time, unless it is completely understood. When playing Russian roulette the fact that the first shot got off safely is little comfort for the next. The origin and consequences of the erosion and blow-by were not understood. They did not occur equally on all flights and all joints; sometimes more, and sometimes less. Why not sometime, when whatever conditions determined it were right, still more leading to catastrophe?
In spite of these variations from case to case, officials behaved as if they understood it, giving apparently logical arguments to each other often depending on the “success” of previous flights.
Report of the PRESIDENTIAL COMMISSION on the Space Shuttle Challenger Accident
Appendix F: Personal Observations on the Reliability of the Shuttle
Here, Feynman is looking beyond the technical problem and identifying a social and psychological problem. As Diane Vaughan puts it in her splendid analysis, the problem was the normalization of deviance:
The decision (to launch) was not explained by amoral, calculating managers who violated rules in pursuit of organizational goal, but was a mistake based on conformity — conformity to cultural beliefs, organizational rules and norms, and NASA’s bureaucratic, political, and technical culture.
Diane Vaughan, The Challenger Launch Decision
I worry that the software business hasn’t learned from NASA’s experience.
Dear testers, and dear developers: can work does not mean does work, and seems to work now does not mean will work later. A problem that appears briefly and seems to go away is evidence that we have an inconsistent system. That’s not an invitation to shrug; it’s a motivation to investigate.
The car’s at the shop. We’ll be getting it back tomorrow.