I’d like to warn you that you’re about to get “promoted” — and doubtless without a pay increase — to a kind of management job that I bet you don’t want.
Hold on a second; pardon me. Let me introduce myself first.
I’m a tester. As a tester, it’s my job to point out problems and risks that you might not have recognized.
A brief, informal survey of some of my programming friends suggests that they haven’t considered a problem that I’ve noticed looming. I want to alert you to that problem here; I suspect it might be a problem for you too.
Your job as a developer, as I think we can agree, is to link the world of humans and machines by developing code, programs, and systems that help to end people’s troubles.
My job as a tester is to help you and our managers, by linking the world of humans and software systems with the world of doubt and risks. My role is to be professionally unsure that the product you’re making is one that you, and the business, and our customers want.
For that reason, I draw your attention to problems that I observe as you’re building the product, so that you can address problems with the design. Then I set out to get some experience with the product you’ve built. I explore it and examine it and experiment with it, with a special focus on learning about problems that threaten its value to people who matter. That is: it’s my job to find trouble wherever I look.
My testing job is socially challenging in two ways. First, I’m challenging the product and assumptions about it in a social context. My job is also socially challenging in the sense that people generally prefer not to think about problems. Pointing them out is, to some degree, socially awkward.
I take on this difficult role to provide a service to you and to the business. You want to make progress building the product at top speed, and that’s what the business wants from you. You’re writing code, and doing quick, mostly low-level testing to be confident that what you’re building is reasonably close to what you’re intending to build. You’re probably writing some checks to find avoidable, easy bugs before they get buried, and to help prevent regression when the code changes.
Deeper testing would interrupt your flow. My job is to test the product more deeply so you don’t have to. I’m here to help you recognize when the troubles haven’t been ended, or when new trouble has been introduced. If I’m doing a good job, you are less likely to be oblivious to problems and risks. You can at least be aware of the ones I’ve found, and address them as you and the managers see fit.
Here’s the thing: I’m not sure if you’re aware of it, but there’s a good chance your job is about to change dramatically, and in a way that you might not like.
You see, I observe that some people in the software community are very excited about tools and bots based on Large Language Models (LLMs), and how they’re going to aid developers in programming. No question about it: LLMs may be able to assist you in several ways. They can help you to get started or unstuck; to create a bunch of prototype code quickly; to reduce your typing time, especially for commonplace coding tasks; to provide some quick answers to quandaries in a way you might like better than Stack Overflow. This is cool, because in principle these things can let you spend more time to focus on possible solutions to problems, and less time on grinding out code.
I’ve also observed some responsible people in the development community trying to emphasize that the output from products based on LLMs can be unreliable. That makes sense, because LLMs don’t understand things; they produce strings. LLMs perform calculations to produce output that is statistically reasonable; plausible, such that people can feel like they’re talking to a human.
That output is not required to be correct, nor can it be trusted to solve the problems you’re trying to solve. LLMs are incurious; they tend not to ask clarifying questions the way you do. Sometimes LLMs fail to address part of the task that you’ve prompted them to fulfill. Sometimes they hallucinate. Sometimes they forget things. There are patterns of problems that we’ve seen while testing them. You can read more about all that here. Notice, by the way, that it’s really hard to talk about LLMs without anthropomorphizing them.
If you’re using an LLM to generate code, one obvious risk is that the generated output will contain relatively straightforward coding errors of some kind. When you review the code, you’ll probably spot some of them; others may be more subtle. The possibility of missing coding errors generally increases as the request becomes larger or more complex, and when there’s more output.
Some people equate coding errors with “bugs”. In Rapid Software Testing, we don’t limit the notion of “bugs” to that. For us, a bug is anything about the product that threatens its value to someone who matters. Many bugs come not from coding errors, but from design problems, requirement oversights, platform or component incompatibilities, unexpected data, timing dependencies… Bugs may appear in a system even though the units or components have been tested and seem to be just fine on their own. Problems happen due to unforeseen conditions, interactions, interruptions, or inadvertent misuse. That is, many bugs are emergent.
Another real possibility is that the generated code will omit features, or things like error checking and exception handling, even if you prompt for them. Unlike you, the LLM won’t ponder the request before it starts typing. The LLM will start spitting output immediately. An LLM doesn’t reflect on the ideas it has, because it doesn’t have ideas; it processes tokens. For a sufficiently complex product, the LLM may run out of tokens in its context window, rendering it forgetful about earlier prompts or responses, or code generated earlier in the session.
Remember, the LLM doesn’t understand things; it calculates probable next words or strings. It doesn’t deal at all with the essential human parts of software development. It doesn’t socialize, manage politics, refine requirements, or negotiate the mission in the ways you’ve learned to do. It doesn’t respond to a puzzled look.
As a tester, I’m naturally concerned about the possibility of problems, bugs, and risks about the product we’re developing. I’m definitely concerned that LLMs aren’t as technically competent as you are. After all, even though they have access to more data and more programs than we could ever have, they regularly screw up at simple tasks associated with math, data generation, text analysis, reasoning about puzzles, and even describing themselves.
Since it doesn’t participate in human society, an LLM doesn’t have your social competence either. People make stereotyped jokes about programmers. (Q. How many programmers does it take to change a light bulb? A. Let me tell you why it’s not a good idea to change the light bulb…) But programmers act like programmers because human beings have skills and tacit knowledge that affect their missions. An LLM doesn’t have a conscience, or empathy. It doesn’t deal with controversy between stakeholders. It doesn’t obtain real human experience with your product — or with any product — so it can’t recognize many things that real people might like or dislike about it.
Some people who tout LLMs say that they’re helpful with test design. I hope for everyone’s sake that LLMs are better at coding than they are at testing tasks, because they’re definitely not very good at those. (More on that here, and here too.) That’s explicable; LLMs are trained on the internet, where there’s a stonking amount of bad writing about testing, and even worse advice about it.
(My colleague James Bach points out that while there has been enormous effort and some success associated with suppressing prejudice, hate speech, racism, and sexism, in LLM output, there’s apparently been no push to reduce the volume of bad testing advice.)
Since LLMs are not technically savvy, nor social agents, nor socially competent, it would be a mistake to trust LLMs for most kinds of testing tasks today, or at any time in the foreseeable future. They might be helpful when there’s not very much at stake, and when the tester remains in charge, but any output that matters from an LLM will need scrutiny for a good while yet. Moreover, the volume of stuff being produced can be overwhelming. Reviewing LLM output carefully is like looking for unexploded bombs after an avalanche.
Now, there are some skilled testers who are ready and eager right now to help with testing of LLMs — or at least getting ready to help. It’s true; there aren’t nearly enough of us yet. We’re going to need lots more, and we’re all going to have to practice and talk about what we’re learning. The overarching theme of testing — getting experience with products and analyzing them with a focus on spotting trouble before it’s too late — doesn’t change. In Rapid Software Testing, at least, we’re always working on ideas on how to help people test new things, think critically about them, and analyze risk. We’re super happy to work with developers on that, too.
Meanwhile, here’s a list of things I don’t hear my developer friends saying very often:
“I LOVE testing!”
“I can’t wait to find hidden, rare, subtle, intermittent, emergent bugs in this product!”
“Reading through piles of unreliable code is great!”
“Putting this thing through its paces to find deep problems? When can I start?!”
“There’s a possibility that this thing might spill people’s confidential data; I’m goin’ a-huntin’ for security holes!”
“Fixing bugs in code written by others is my idea of a good time!”
“Supervising junior programmers that don’t learn from experience is fun!”
And this is what I’m warning you about, Dear Developer: development processes and tools that incorporate Large Language Models will produce enormous piles of code very quickly. You won’t be writing that code, but if you’re responsible for it, you’ll be responsible for examining, reviewing it, finding problems in it, and fixing it.
That is: if you are using LLMs, or if you’re directed to do so, you’re about to turn from an author of your own code into a reviewer and maintenance programmer — and maybe a tester, too — for a barely competent programmer producing mountains of mediocre and unreliable code.
Are you ready for that?