A little while back, on LinkedIn, Jason Arbon posted a long article that included a lengthy conversation he had with ChatGPT. The teaser for the article is “A little humility and curiosity will keep you one step ahead of the competition — and the machines.” The title of the article is “Testing: Bolt-on AI” and in Jason’s post linking to it, I’m tagged, along with my Rapid Software Testing colleague James Bach. You can read all the way through Jason’s article here, if you have a spare 41 minutes (according to a LinkedIn algorithm).
It didn’t take us very long to get the feeling that we wanted to dismiss Jason’s post; that feeling arrived more or less instantly with the title and the tagging, which we perceived as baiting or trolling. In Rapid Software Testing, we use our feelings as powerful pointers to trouble, but when organizations’ products and reputation are on the line, sometimes we must think critically and investigate more deeply to determine whether those feelings are warranted, and to decide what to do about them.
Having read the post, we decided to apply deep analysis to it, and to prepare two reports. These took something on the order of 40 man-hours. Some people will no doubt consider this level of effort excessive. Yet had we not applied it, some people (some of the same people, no doubt) would have considered a brisk and dismissive reply insufficient. Nonetheless, James and I found a good deal of value in performing the exercise when we framed it in terms of preparation for testing of Large Language Models (LLMs) in real-life situations, and helping others to do it.
In this post, I’ll respond to some of Jason’s more intriguing stattements and to general patterns in his conversation with ChatGPT. The detailed analysis of that conversation appears on James’ site. James and I have collaborated on both that document and on this post. Our focus here is on the rest of Jason’s article and its context.
Some background on Jason. He’s one of three authors of How Google Tests Software. Over the last several years, he’s been developing AI tools for testers. The sites for this work include test.ai, which appears to be moribund as of this writing; and his current project, testers.ai of which we can learn little at the moment. There is a pricing page, but no real information on the what the product actually does; not even on what it’s intended to do, other than a link to a Medium post.
Along the way, Jason has taken on some intriguing initiatives. A while back, he generated 1980s-video-game-style pixellated images of himself, me, James, and other testers, all covered by non-fungible tokens (NFTs). He didn’t ask us (or anyone else, so far as I know) for our feedback or consent on that little project. It’s not clear if he took this approach because he believes that he himself is a leader in testing, or if he is attempting to ride piggyback on the reputations of others.
In the Medium article linked from the testers.ai site, Jason is using some of those images to help promote his idea: “converting the brains of the best testers into AI”. (Is it really doing that? Can it really do that?) Here, Jason poses the question “If we could digitize the best testers and unleash them on all our apps, how much better might our software be? The odds of encountering basic bugs in apps would decrease dramatically, ultimately leading to better user experiences and a more efficient world.”
Well, that certainly might be nice. I’ve been in the software business for more than three decades now. All the way along I’ve been hearing claims about the ways in which various forms of software will reduce or eliminate the need for skilled testers. But progress so far has been limited, and the claims haven’t been met.
This was going on for a generation or two before I got into software. One of Jerry Weinberg’s favourite stories was about a project at IBM. The idea was to take a program (call it X), and feed it to another program (Y). X was to be the only input to Y; nothing else. Y would process X, and the output would be X’ — X, but with all the bugs taken out. Apart from all the other problems with this idea, the people working on the project had apparently never read about the Halting Problem.
Jerry was called in to consult on this project, which by his account had already burned through $10 million. Jerry’s response was practically instantaneous: “Kill it, ” he said. The project team didn’t take Jerry’s advice immediately. First, they spent another $10 million. Then they killed it.
But I digress. Ignoring the point that testers find bugs in the product but don’t fix them (developers do that), what would actually be required to “digitize the best testers”? That would require a good model of how to digitize a human—which reminds me of another of Jerry’s aphorisms: never automate a process you don’t understand.
Jason’s article is an attempt to discuss and demonstrate ChatGPT’s potential helpfulness in testing. He wrote his post in response to a post of my own (estimated reading time: 4.5 minutes). Stirred in with a good deal of snark (mostly directed at James), Jason had some kind words to say about my post; and at other times he has said several complimentary things about me and our Critical Thinking for Testers class. Thanks to him for that.
Yet it’s hard—really hard—for us to respond to his article, first because of the volume of text, and then because of the volume of problems in it. As experienced testers know, a big product is hard enough to test, but when it also has lots of bugs, recording and reporting the bugs can take an enormous amount of time if we want to be thorough about it, and to provide sound reasons for why we believe they’re bugs.
One of the first and biggest problems is that ChatGPT generates text irresponsibly, with reckless disregard for the truth. That’s not surprising, because ChatGPT doesn’t have a concept of truth, nor of responsibility. That is, by Harry Frankfurt’s definition, ChatGPT generates bullshit. And bullshit is subject to Brandolini’s Law:
The amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it.
Not only does ChatGPT generate bullshit; it does so at superhuman rates, regardless of how simple and quick the prompts might be. Thus a thorough reply to ChatGPT’s accelerated bullshit requires more than an order of magnitude of the effort required to produce it.
That fact should be a real concern for all of us — especially for the future. ChatGPT gets huge amounts of its training data from the internet, which means that in the future, ChatGPT will be retrained on data that comes from… ChatGPT.
If you’ve read Jason’s article quickly, or haven’t read it, or don’t want to read it, as mentioned above there’s an account of the ChatGPT conversation part of it on James’ site. That account consists mostly of a table on which we spent that more-than-an-order-of-magnitude effort to analyze each of Jason’s prompts and ChatGPT’s responses critically.
As we went through the conversation, we did not simply evaluate each prompt and response. We also developed and refined a list of syndromes to describe the kinds of problems that ChatGPT exhibited. We designed and performed experiments to illuminate patterns in these syndromes and help us to understand them. We wrote code to allow us to repeat the experiments consistently. Jason doesn’t report that he did anything like that.
Here’s the summary of our analysis: ChatGPT’s performance is mostly poor, and often spectacularly poor, and often in very interesting ways. Yet Jason says “ChatGPT’s responses were better than what I would have added.” Considering that he’s co-author of a book called “How Google Tests Software”, I had to wince at that. Does he believe it? Or did he simply refrain from reading the responses carefully? Jason doesn’t really tell us.
He goes on: “Might [ChatGPT’s responses] be better than; your teammates’? yours’? someone you consider a testing expert?” Well, they might. One catch is that you would need analytical skills, critical thinking, and testing expertise to decide whether you’ve got a worthy response—and without that expertise, you might decide that ChatGPT’s more trivial or cockeyed responses are adequate. Nonetheless, we found in a few instances that parts of ChatGPT’s responses appeared reasonable.
Jason continues: “We often give humans the courtesy of explaining themself, elaborating, and iterating on a topic in conversation. Give the same opportunity to ChatGPT. The smarter the questions, the more useful it is. Just like when mentoring freshies, a few leading questions can go a long way, and working together you are better than two individuals in parallel.”
That last statement is a claim, framed without context. Claims of this nature can be true you if have sufficient expertise; if you have an appropriate, constructive relationship with the tool; if you know to be appropriately skeptical of the tool’s output; and if you’re willing to take responsibility for the outcome. Or, to put it in Jason’s terms, if you’re a non-expert mentoring a fresher, the responses from the fresher and your conclusions may be better. Or they may be substantially worse—and you won’t know it.
So the claim represents a contradiction in Jason’s discourse: at the same time as we’re being asked to consider ChatGPT as something that can beat the experts, we’re also obliged to ask it smarter questions and mentor it as though it were a novice.
And a novice it is. All through the conversation, ChatGPT replies with problematic answers. Sometimes Jason prompts it into other answers — some of which do represent a modest improvement — until he gets an answer that he seems to consider good enough, and moves on.
Why does Jason do this? It’s because he sensibly recognizes ChatGPT’s answers as being wanting in some way. That awareness is available to him because he has some degree of testing expertise (which, strangely, he seems to disclaim from time to time).
Whatever Jason’s level of expertise, ChatGPT doesn’t have it. If ChatGPT had any real testing expertise or conversational skill, it would have refrained from answering the original poll questions; it would have dismissed the simplistic notion of a right answer to the question; and it would have begun its replies by asking questions of its own. It wouldn’t need to be prompted or goaded into asking those questions, or providing better answers. To the degree that ChatGPT provides a “right” answer, Jason plays the role of Wilhelm von Osten to ChatGPT’s Clever Hans.
Towards the end of the post, Jason says that “You can access testing experts’ knowledge, and even its gaps, by simply asking.” Maybe you can. There’s a catch, though: ChatGPT’s training data comes not only from experts, but from everyone and everything else on the Web. Much of the time—and especially when it comes to testing—a large portion of what’s out there is simply junk.
It’s a very safe bet that almost none of ChatGPT’s training data on the topic of testing—on all kinds of other topics, for that matter—was assessed by experts. There’s simply too much training data to do that. Meanwhile, ChatGPT itself doesn’t have expertise to filter the bad training data from the good. It doesn’t have a means of evaluating its input or its output. It is not aware of context. It is not self-critical, even when its placating responses give the illusion of self-criticism — and at that, only after a prompt.
ChatGPT is not designed to produce answers that are right. It’s designed to produce answers that sound good. The danger here is that in a real-life situation, without expert supervision, ChatGPT is just another novice parroting folklore.
Jason says “If software testers avoided vanity and emotionally charged and opinionated chats with AI (and people) and focused on technical, productive chats with some humility, they would realize that AI is getting better every day. Of course, great testers shouldn’t blindly trust any information. Not from AI, not from Google results/ads, not from co-workers, not even their own intuitions. But, it is wise to check, re-check, and test and re-test software.”
Indeed; it would be wise to test software, and when someone claims that ChatGPT as a viable aid to testers, it would be wise to test that too. Alas, Jason didn’t do that; he left that task to us.
Jason’s conversation was an attempt to demonstrate that he was better with ChatGPT than without it. When we looked, we found he wasn’t demonstrating that. He didn’t compare and contrast ChatGPT’s output to his own answers to the questions he asked it, since he didn’t provide his own answers. He didn’t attempt to extend, improve, transform, or illuminate ChatGPT’s answers. He didn’t provide a prompt that expressed doubt in ChatGPT’s answers (which, via experimentation, we found can sometimes trigger productive revisions in its output).
More to the point, Jason’s conversation wasn’t a test. Testing involves configuring, operating, observing and evaluating a product. Testing involves critical examination of a product, probing for problems and risk. Jason’s report includes no indication that he did that. Other than copying and pasting verbatim text from the conversation, Jason didn’t identify a single problem in the entire conversation! It’s not clear that he looked for problems at all, and he certainly didn’t report them. Again, he left that to the reader and to us.
I don’t know how long it took Jason to perform the exercise; likely not more 20 minutes or so. Late in the post, he says
“It is worth noting that I had this conversation with ChatGPT while casually sipping on a Hazelnut Latte in the parking lot of a Starbucks — a lot faster and cheaper (even more friendly!) than attending a seminar, reading a book, or even reading this post.”
The conversation about this single, fairly trivial multiple-choice question on boundary testing exercise runs to almost 9000 words — around tenth of the total length of How Google Tests Software, by a reasonable estimate. Generating that volume of output is indeed really fast and cheap. Skimming through it can be pretty quick, and when you skim, the output looks like plausible responses.
But what if you were to read the output carefully? What if you were to take it seriously? What if you — or your organization — were to depend on it? What if you were to treat it like a real seminar, a real book, or Jason’s blog post? You would probably be assessing it; weighing whether the presenter, author, or poster were making coherent points. You would be on the lookout not only for the valid points, but invalid ones too.
The flaws that we discovered in ChatGPT’s answers suggest that deep review is essential in order to find serious and subtle problems in its output. Using ChatGPT to produce 30 printed pages of output is fast and easy. Reviewing the output critically, in detail, takes a lot longer. With respect to the LinkedIn poll question that started all this, are these 9000 valuable words? How much of ChatGPT’s response is signal, and how much is noise? In other cases, when we’re dealing with a testing problem that matters, will this really be a time-saving thing?
There’s one more very odd claim towards the end of Jason’s article: “…you can get custom questions answered, and specialized to your specific testing problem. You can even ask ‘why’, without starting a pseudo-socratic discussion.” But the entire conversation between Jason and ChatGPT is precisely that! All the way through, Jason asks questions, ChatGPT replies, and Jason asks more questions to get answers of the nature that he is looking for. That is literally a (pseudo-)Socratic discussion! Perhaps by “without starting a pseudo-socratic discussion”, Jason means “without being challenged”; we don’t know.
In the process of analyzing the conversation, we developed some tooling by which we could interface with the OpenAI API and systematically perform experiments: repeating prompts; adjusting temperature levels and other settings; logging the output for analysis; and so forth. James and I argued over the validity and fairness of our assessment, both of ChatGPT’s output and Jason’s prompting. We read academic papers and posts on ways that others assess LLMs (many, most, resort to mathematical and statistical models and benchmarks, so it seems; too few consider the validity and applicability of these models to generated text where the reader is actively engaged in conversation with the tool).
We also developed a prelimary set of guideword heuristics for “syndromes” — consistent patterns of problems that we have observed and can now watch for systematically.
Incuriosity | Avoids asking questions; does not seek clarification. |
Placation | Immediately changes answer whenever any concern is shown about that answer. |
Hallucination | Invents facts; makes reckless assumptions. |
Arrogance | Confident assertion of an untrue statement; especially in the face of user skepticism. |
Incorrectness | Provides answers that are demonstrably wrong in some way (e.g. counter to known facts, math errors, using obsolete training data) |
Capriciousness | Cannot reliably give a consistent answer to a similar question in similar circumstances. |
Forgetfulness | Appears not to remember its earlier output. Rarely refers to its earlier output. Limited to data within token window. |
Redundancy | Needlessly repeats the same information within the same response or across responses in the same conversation. |
Incongruence | Does not apply its own stated processes and advice to it’s own actual process. For instance, it may declare that it made a mistake, state a different process for fixing the problem, then fail to perform that process and make the same mistake again or commit a new mistake. |
Negligence/Laziness | Gives answers that have important omissions; fails to warn about nuances and critical ambiguities. |
Opacity | Gives little guidance about the reasoning behind its answers; unable to elaborate when challenged. |
Unteachability | Cannot be improved through discussion or debate. |
Non-responsiveness | Provides answers that may not answer the question posed in the prompt. |
Blindness | Cannot reason about diagrams and pictures, nor even accept them as input. |
Vacuousness | Provides text that communicates no useful information. |
Note that these problems do not appear in every response to every prompt; they happen intermittently. The intermittence is a consequence of a legitimate feature of ChatGPT: it is designed to avoid giving the same answer every time, so that it appears more like a real conversationalist. (Some people actually prefer ChatGPT’s wrong answers to right answers from StackOverflow!) This makes testing ChatGPT particularly challenging: variation in the output is intentional and intrinsic. Formally scripted, simple, binary pass-or-fail output checking depends on repeated, consistent, reliable output that isn’t available in this context.
We believe that cataloging these pathologies will help in future risk analysis of software that incorporates LLMs. Awareness of these pathologies and risks might help to steer us away from danger, and towards safer and more productive use of LLMs in testing and in other applications.
One future step in our work is to examine more of the academic literature with an eye towards extending and refining our list for working testers. We’re anticipating further development of tools for analysing and coding (“coding” in a qualitative research sense, not programming) ChatGPT’s output in a systematic way. We’re also looking to perform experiments in which we do deep testing of code generated by LLMs — and to identify patterns that might render that code more or less risky.
More experience reports will follow.
Postscript, 2023-08-11: We attempted to converse with Jason before we published our reports. We started with a LinkedIn messaging thread on July 23, 2023 that continued the next day. From one time to the next he turned us down; turned down the opportunity to schedule a call (citing his own “hard rule against scheduled meetings”); and said “Regardless, I’m not interested in the topic.”
I asked him “What do we have to do to get collegial, constructive public responses from you when we present concerns about ChatGPT, large language models, and AI generally? If you think there are errors or omissions in our work, what do you believe we need to do to get constructive public responses from you instead of snarky ones?”
I got a series of non-constructive answers to that. Nonetheless, we had a conversation the next day (unscheduled, of course). That was reasonably cordial. We had a few courteous exchanges in LinkedIn messaging after that, but then the trolling and the snark started again. We are — I hope understandably — disinclined to reply to Jason until that stops.
In the ever-evolving landscape of technology, the concept of “Bolt-on AI” is a testament to our capacity to adapt and integrate cutting-edge solutions into our established frameworks. It underlines the need for thoughtful planning, thorough testing, and continuous improvement to fully reap the rewards of this innovative approach.
So, let’s see: a comment that looks like it has ignored absolutely every point in the post; a comment that looks like it could have been written by ChatGPT itself (except I’m pretty sure that ChatGPT could have done a better job); a mailing address that doesn’t match the fake name “Amy Jackson”; and a link to a site that offers “access to best-in-class quality assurance experts”.
Can’t your organization offer something better than spam-commenting?
One of the important points to be noted here is training data for chatGPT that has mix of correct, incorrect, right, wrong, junk and good data. There is no reference kept about the origin. Its like one writer randomly picking up content from various sources into one massive document about testing without keeping any track of source.
Once data gets into to training data set – it is impossible to take it out. There is no “testing” of this data to check which one is correct and which one is not. When a novice tester sees it – he/she will be totally misled.
It appears to be no way we can digitize expert testers minds – in the same way we could not digitize minds of likes of Feynman or Eistein or DaVinci. Such an effort is not likely to have any success in foreseeable future.
Well, sure; the claim is absurd on the face of it. That’s not going to stop people from trying to raise money to do it.
This is one of the most interesting pieces on AI & software testing I have come across so far. Thank you for the time and effort you guys have put in.