Dear Developer…
I’d like to warn you that you’re about to get “promoted” — and doubtless without a pay increase — to a kind of management job that I bet you don’t want.
Hold on a second; pardon me. Let me introduce myself first.
I’m a tester. As a tester, it’s my job to point out problems and risks that you might not have recognized.
A brief, informal survey of some of my programming friends suggests that they haven’t considered a problem that I’ve noticed looming. I want to alert you to that problem here; I suspect it might be a problem for you too.
Your job as a developer, as I think we can agree, is to link the world of humans and machines by developing code, programs, and systems that help to end people’s troubles.
My job as a tester is to help you and our managers, by linking the world of humans and software systems with the world of doubt and risks. My role is to be professionally unsure that the product you’re making is one that you, and the business, and our customers want.
For that reason, I draw your attention to problems that I observe as you’re building the product, so that you can address problems with the design. Then I set out to get some experience with the product you’ve built. I explore it and examine it and experiment with it, with a special focus on learning about problems that threaten its value to people who matter. That is: it’s my job to find trouble wherever I look.
My testing job is socially challenging in two ways. First, I’m challenging the product and assumptions about it in a social context. My job is also socially challenging in the sense that people generally prefer not to think about problems. Pointing them out is, to some degree, socially awkward.
I take on this difficult role to provide a service to you and to the business. You want to make progress building the product at top speed, and that’s what the business wants from you. You’re writing code, and doing quick, mostly low-level testing to be confident that what you’re building is reasonably close to what you’re intending to build. You’re probably writing some checks to find avoidable, easy bugs before they get buried, and to help prevent regression when the code changes.
Deeper testing would interrupt your flow. My job is to test the product more deeply so you don’t have to. I’m here to help you recognize when the troubles haven’t been ended, or when new trouble has been introduced. If I’m doing a good job, you are less likely to be oblivious to problems and risks. You can at least be aware of the ones I’ve found, and address them as you and the managers see fit.
Here’s the thing: I’m not sure if you’re aware of it, but there’s a good chance your job is about to change dramatically, and in a way that you might not like.
You see, I observe that some people in the software community are very excited about tools and bots based on Large Language Models (LLMs), and how they’re going to aid developers in programming. No question about it: LLMs may be able to assist you in several ways. They can help you to get started or unstuck; to create a bunch of prototype code quickly; to reduce your typing time, especially for commonplace coding tasks; to provide some quick answers to quandaries in a way you might like better than Stack Overflow. This is cool, because in principle these things can let you spend more time to focus on possible solutions to problems, and less time on grinding out code.
I’ve also observed some responsible people in the development community trying to emphasize that the output from products based on LLMs can be unreliable. That makes sense, because LLMs don’t understand things; they produce strings. LLMs perform calculations to produce output that is statistically reasonable; plausible, such that people can feel like they’re talking to a human.
That output is not required to be correct, nor can it be trusted to solve the problems you’re trying to solve. LLMs are incurious; they tend not to ask clarifying questions the way you do. Sometimes LLMs fail to address part of the task that you’ve prompted them to fulfill. Sometimes they hallucinate. Sometimes they forget things. There are patterns of problems that we’ve seen while testing them. You can read more about all that here. Notice, by the way, that it’s really hard to talk about LLMs without anthropomorphizing them.
If you’re using an LLM to generate code, one obvious risk is that the generated output will contain relatively straightforward coding errors of some kind. When you review the code, you’ll probably spot some of them; others may be more subtle. The possibility of missing coding errors generally increases as the request becomes larger or more complex, and when there’s more output.
Some people equate coding errors with “bugs”. In Rapid Software Testing, we don’t limit the notion of “bugs” to that. For us, a bug is anything about the product that threatens its value to someone who matters. Many bugs come not from coding errors, but from design problems, requirement oversights, platform or component incompatibilities, unexpected data, timing dependencies… Bugs may appear in a system even though the units or components have been tested and seem to be just fine on their own. Problems happen due to unforeseen conditions, interactions, interruptions, or inadvertent misuse. That is, many bugs are emergent.
Another real possibility is that the generated code will omit features, or things like error checking and exception handling, even if you prompt for them. Unlike you, the LLM won’t ponder the request before it starts typing. The LLM will start spitting output immediately. An LLM doesn’t reflect on the ideas it has, because it doesn’t have ideas; it processes tokens. For a sufficiently complex product, the LLM may run out of tokens in its context window, rendering it forgetful about earlier prompts or responses, or code generated earlier in the session.
Remember, the LLM doesn’t understand things; it calculates probable next words or strings. It doesn’t deal at all with the essential human parts of software development. It doesn’t socialize, manage politics, refine requirements, or negotiate the mission in the ways you’ve learned to do. It doesn’t respond to a puzzled look.
As a tester, I’m naturally concerned about the possibility of problems, bugs, and risks about the product we’re developing. I’m definitely concerned that LLMs aren’t as technically competent as you are. After all, even though they have access to more data and more programs than we could ever have, they regularly screw up at simple tasks associated with math, data generation, text analysis, reasoning about puzzles, and even describing themselves.
Since it doesn’t participate in human society, an LLM doesn’t have your social competence either. People make stereotyped jokes about programmers. (Q. How many programmers does it take to change a light bulb? A. Let me tell you why it’s not a good idea to change the light bulb…) But programmers act like programmers because human beings have skills and tacit knowledge that affect their missions. An LLM doesn’t have a conscience, or empathy. It doesn’t deal with controversy between stakeholders. It doesn’t obtain real human experience with your product — or with any product — so it can’t recognize many things that real people might like or dislike about it.
Some people who tout LLMs say that they’re helpful with test design. I hope for everyone’s sake that LLMs are better at coding than they are at testing tasks, because they’re definitely not very good at those. (More on that here, and here too.) That’s explicable; LLMs are trained on the internet, where there’s a stonking amount of bad writing about testing, and even worse advice about it.
(My colleague James Bach points out that while there has been enormous effort and some success associated with suppressing prejudice, hate speech, racism, and sexism in LLM output, there’s apparently been no push to reduce the volume of bad testing advice.)
Since LLMs are not technically savvy, nor social agents, nor socially competent, it would be a mistake to trust LLMs for most kinds of testing tasks today, or at any time in the foreseeable future. They might be helpful when there’s not very much at stake, and when the tester remains in charge, but any output that matters from an LLM will need scrutiny for a good while yet. Moreover, the volume of stuff being produced can be overwhelming. Reviewing LLM output carefully is like looking for unexploded bombs after an avalanche.
Now, there are some skilled testers who are ready and eager right now to help with testing of LLMs — or at least getting ready to help. It’s true; there aren’t nearly enough of us yet. We’re going to need lots more, and we’re all going to have to practice and talk about what we’re learning. The overarching theme of testing — getting experience with products and analyzing them with a focus on spotting trouble before it’s too late — doesn’t change. In Rapid Software Testing, at least, we’re always working on ideas on how to help people test new things, think critically about them, and analyze risk. We’re super happy to work with developers on that, too.
Meanwhile, here’s a list of things I don’t hear my developer friends saying very often:
“I LOVE testing!”
“I can’t wait to find hidden, rare, subtle, intermittent, emergent bugs in this product!”
“Reading through piles of unreliable code is great!”
“Putting this thing through its paces to find deep problems? When can I start?!”
“There’s a possibility that this thing might spill people’s confidential data; I’m goin’ a-huntin’ for security holes!”
“Fixing bugs in code written by others is my idea of a good time!”
“Supervising junior programmers that don’t learn from experience is fun!”
And this is what I’m warning you about, Dear Developer: development processes and tools that incorporate Large Language Models will produce enormous piles of code very quickly. You won’t be writing that code, but if you’re responsible for it, you’ll be responsible for examining, reviewing it, finding problems in it, and fixing it.
That is: if you are using LLMs, or if you’re directed to do so, you’re about to turn from an author of your own code into a reviewer and maintenance programmer — and maybe a tester, too — for a barely competent programmer producing mountains of mediocre and unreliable code.
Are you ready for that?
No matter what, we testers are here to help.
Your friend,
Michael B.
Tester
Update, 2024-06-06: Read this superb post by Chelsea Troy.
Great read, I especially enjoyed the “phrases I don’t hear developers say often.” I’ve heard quite a few of those fairly often — but only under the warm blanket of sarcasm.
Developers have always joked at how they normally just steal code from the internet (StackOverflow) instead of trying to figure it out on their own, so I think using LLMs is done in the same vein.
You could argue that code found from the internet & written by a fellow human being is *likely* tested code, but we all know from testing experience that just because it was tested code in some other application doesn’t mean that it’s tested in our application. Code from the internet in general should be tested and understood before being executed or included in a project.
So since we’re grabbing code written by other people on the internet AND grabbing code from LLMs… and we will need to validate them before utilizing them in the current project… I would argue that developers don’t need to heed this warning because they already SHOULD be.
While there is some experience to gain after using an LLM, sending it to QA, receiving testing feedback, fixing the mistakes, and repeating until it goes through… maybe working through the mistakes in AI code will help you identify a problem that the LLM is prone to creating — or one that you’ve been creating as well. Using LLMs is a double sided sword (that would be a sword as a hilt and a sword as a sword; wield with confidence) but so are many tools in the software world.
From a testing perspective, I generally use LLMs to bridge the gap in shoddy framework documentation to develop utility functions or create custom assertions. Generally it helps me understand what I need to be doing to achieve my desired outcome, but doesn’t work.
Working with LLMs (and other code from the internet) is something that needs to be learned to be untrusted. To the developers who have had a wake up call from this article…… there’s a long road ahead of you.
Seamus,
I’m not sure about few things you wrote, but perhaps they were phrased the wrong way. When you say “…developers don’t need to heed this warning because they already SHOULD be…” to me that reads that they DO need to heed this warning, regardless of whether or not they already are.
To Michael’s original point, LLMs generate voluminous amounts of text, so I think it isn’t a great idea to use LLMs and feed their output through testing, fixing mistakes, and repeating like you suggest, rather than the conventional alternative.
Also, I think it is a mistake for you or anyone to make a blanket comparison between going to StackOverflow for technical help on a problem vs ‘asking’ an LLM.
StackOverflow and similar technical forums are places where real people (some experts, some not) discuss, debate, vote, opine etc. on technical topics. There is an amount of vetting and voting, but more importantly you typically have to do a lot of synthesis and analysis in your search as you look over various cases and decide if the problem/solution applies to your specific context, or if not, you ask interactive questions yourself with other people.
This is not the same as the illusion/performance of a ‘conversation’ with an LLM which will ‘magically’ generate something that that the appearance of a solution specific to your problem but by some algorithms that are only designed to approximate coherent language.
Unfortunately it seems, again to the original point, far too many people will make this same mistake.
To the Tester: Embracing the Shift with Open Arms
February 18, 2024
Dear Tester,
As someone deeply embedded in the fabric of software development, I find myself compelled to engage with your reflections on the impending “promotion” of developers in the age of Large Language Models (LLMs) like GPT and its kin. Allow me to offer an alternative perspective, hoping to illuminate the other side of the coin you’ve so meticulously examined.
Firstly, let’s acknowledge the ever-changing landscape of technology, a realm where evolution is the only constant. The advent of LLMs and AI in software development, while daunting, is not a detour but a continuation of this journey. Historical shifts – from the birth of compilers to the rise of integrated development environments (IDEs) – have invariably nudged developers towards more creative shores, freeing us from the tedium of manual labor to engage in more complex and inventive problem-solving.
Viewing LLMs as tools that augment rather than replace our capabilities is essential. These models promise to liberate us further, automating the mundane and allowing us to dedicate our intellect and creativity to higher-order design challenges and solutions. This is not a demotion but an elevation of our role, pushing us towards innovation and creative thinking.
The concern that LLMs will churn out “mountains of mediocre and unreliable code” is to overlook the iterative nature of technology and the symbiotic relationship between human intelligence and artificial assistance. Through continuous feedback and collaborative refinement between developers and AI, we can enhance the quality of code generated, steering these tools towards reliability and efficiency.
Moreover, the integration of AI into our workflows does not signal a reduction in our responsibilities but a shift towards more strategic and impactful endeavors. Architecture design, system integration, user experience – these are domains ripe for human ingenuity, areas where the nuanced understanding of human needs and technological capabilities converge. Testers, too, will find their roles enriched, tasked with navigating the complexities of AI-generated code to safeguard the alignment of our tools with our values and needs.
The prospect of overseeing “barely competent programmers” in the form of AI, as you put it, underscores not a future of frustration but one of opportunity. It invites us to engage in continuous learning and to adapt our skills and methodologies to harness the potential of AI effectively. This is a call to elevate our understanding and to shape the tools at our disposal to better suit our aims.
The ethical deployment of AI in software development is a shared responsibility, requiring us to ensure that these tools are developed and utilized in ways that are transparent, equitable, and aligned with societal values. Addressing biases, ensuring privacy, and preventing misuse are challenges we must meet head-on, guiding our technological evolution with a moral compass.
Rather than viewing the integration of LLMs into software development with apprehension, let us embrace it as an opportunity for growth, innovation, and the redefinition of our roles. Far from relegating us to mere overseers of code, AI offers us the keys to new kingdoms of creativity and problem-solving. By welcoming innovation, championing continuous learning, and adhering to ethical principles, we can navigate the challenges and seize the opportunities that lie ahead.
With respect and optimism for our shared future,
A Fellow Tester (with a development mindset)
It looks as though this post was written by ChatGPT. “The ever-changing landscape of technology, a realm where evolution is the only constant” is something of a tell.
“Viewing LLMs as tools that augment rather than replace our capabilities is essential. These models promise to liberate us further, automating the mundane and allowing us to dedicate our intellect and creativity to higher-order design challenges and solutions.”
Well.. really? I mean: do you consider replying here to be a mundane task? And what is the current status of that promise?
“The concern that LLMs will churn out ‘mountains of mediocre and unreliable code’ is to overlook the iterative nature of technology and the symbiotic relationship between human intelligence and artificial assistance.”
No, it is not. To be concerned is the opposite of overlooking something.
“The prospect of overseeing ‘barely competent programmers’ in the form of AI, as you put it, underscores not a future of frustration but one of opportunity.”
This is some strange meaning of “opportunity” that I hadn’t previously been familiar with.
“The ethical deployment of AI in software development is a shared responsibility, requiring us to ensure that these tools are developed and utilized in ways that are transparent, equitable, and aligned with societal values.”
That’s right. And that requires critique, not just cheerleading — and especially not just cheerleading written by a bot.
It’s hard for me to take this reply seriously, MJB, because I don’t believe that you wrote it. It looks like you got an LLM to write it, and you merely approved it. The message that you’re sending, implicitly, is that you’re not really willing to take the time and effort to engage your own mind; let the bot do the work. What if I were to reply with a bot-generated reply of my own? Would you read it? Would you take it seriously? Is it a good idea to let the bots fight it out? Would either one of us learn anything from that?
I recommend this article from Cory Doctorow to help you understand why I think that approach is a problem.
I wonder what the response would be if the LLM was instructed to write a “pessimistic” reply. Can it?
My stab at it would be something like “In a world where everyone can write a symphony, most people still won’t, and in fact writing a symphony will look less rewarding than it does now.” Or to put it another way: television, and then the Internet, *did* unlock a Cambrian explosion of human creativity and ingenuity, just not the kind you expected and, increasingly, not the kind you wanted.
I wonder what the response would be if the LLM was instructed to write a “pessimistic” reply. Can it?
Oh, probably. And, as we might anticipate, it will be a fairly banal and uninspiring pessimistic reply.
television, and then the Internet, *did* unlock a Cambrian explosion of human creativity and ingenuity, just not the kind you expected and, increasingly, not the kind you wanted.
Yes. I don’t think I could match the succinctness and clarity of Paul Virilio’s observation: “When you invent the ship, you also invent the shipwreck; when you invent the plane you also invent the plane crash; and when you invent electricity, you invent electrocution… Every technology carries its own negativity, which is invented at the same time as technical progress.”