DevelopsenseLogo

A Reply to “Running a crowd-sourced experiment on using LLMs for testing”

This post and the ones that follow represent an expansion on a thread I started on LinkedIn.

On September 30, 2023, Vipul Kocher — a fellow with whom I have been on friendly terms since I visited his company and his family for lunch in Delhi about 15 years ago — posted a kind of testing challenge on LinkedIn.

I strongly encourage you to read the post. I’ll begin by quoting a few excerpts that I believe are relevant here.

Vipul says that he is producing a “tool or toolchain for AI governance, testing of AI models and generally improve the model quality using generative AI.” He says that he is “totally sold on AI/ML”. That makes him what I’ll call a Supporter — a term that I will explain shortly below. Vipul believes that “AI will be able to do in near future what a very large percentage of testers do today, for test design and execution.”

There are at least two ways to take that. One is that AI is, simply, wonderful and magical, capable of things that require skilled human beings. Another is that irrespective of AI’s wonderfulness, the work that a large percentage of testers do today is something much less than wonderful. Due to the long-term trend towards commodification and dumbing-down of what testing could be, I fear Vipul might have a point, but not in a good way.

A detailed reply to posts like Vipul’s is incredibly difficult. It takes enormous amounts of time to respond to the sheer volume of stuff that LLMs produce. Before I respond to the experiment as such, I’ll explain some structural problems with testing and critique of LLMs. And, because producing a report and an apologia (look it up; I had to!) is hard work, I get a little grumpy towards the end of this post.

Deep Analysis Is Work

The first premise: people’s attention and patience has always been limited, and thanks to electronic media, people are even less attentive and patient than ever. Consequently, very few people will read all of any LLMs’ output in detail. Similarly, very few people will read my analysis here. Practically no one will look more deeply into the support for my argument, nor will they read all of the references that I supply. And, certainly by their own lights but also objectively, why should they? I’m giving other people more work to do, and except for a few people, that work isn’t at all fun. Why should people do what I would like them to do?

Supporters and Critics

Next: some people look at testing software products with the premise that we’re here to confirm that things are good; to validate a product. Let’s call those people “Supporters”. When it comes to testing, their priority is confirmation and validation; showing that the product can work; demonstration, far more than experiment. The idea of a test is to get the right answer when we ask “is this output correct?”

Other people believe that testing is about studying the product and investigating risk. Let’s call those people “Critics”. Critics will be more inclined towards the idea that there might be problems in technology, and will be disposed to attempt to falsify the belief that things are good. Their priority is doubt and investigation; getting experience with the product; exploration; experimentation; learning. The motivation for a test is to ask the right questions to answer a dominating question: is there a problem here?

Different Models of Testing

The Supporter mindset and the Critic mindset lead immediately to different models of testing work. Supporters will tend to prefer a routine, factory-like approach, modeled on a manufacturing perspective on evaluating quality: at the end of an assembly line, checking instances of the product to see if it’s consistent with the original instance.

Software development isn’t much like an assembly line, though. Duplicating a software product is fairly trivial; invoke the copy command and off we go. In software, the job is building and testing the master copy. Software development is far more like a design studio; an inventor’s lab; a rehearsal hall. Critics will therefore apply a different model of testing work: trying to learn deeply about the product, how people might get value from it, and in particular how value might be threatened.

There are other dimensions to the models, which are outlined here, in our table explaining how Rapid Software Testing differs from factory-style testing. But for the most part, the Supporters won’t read that!

And if they do, there’s a good chance that the Critic’s perspective will introduce so much cognitive dissonance in the Supporters that they’ll simply reject or ignore the whole idea.

Testing and Test Cases

Supporters tend to organize testing around formally declared, procedurally structured test cases. Critics might apply test cases of that kind for some purposes (automated checking, for instance), but won’t make test cases central to the work.

For Critics, test cases are not testing; testing is about analyzing a product, learning about it, and finding problems and risk. To support that perspective, Critics will point to articles like “Test Cases Are Not Testing: Toward a Culture of Test Performance” (James Bach and Aaron Hodder) to support their argument. But for the most part, the Supporters won’t read that!

Evaluating What We See

Supporters who believe that testing is about rote confirmation of facts expressed in test cases will look at output from LLMs saying, “Looks okay to me” or even “This is great!”, and leave things there because they don’t have test cases. No test cases means no failing test cases, which means no problem.

Critics who believe that testing is primarily about looking for problems that matter will realize that the testing of LLMs can’t be reduced to test cases, so Critics will apply, develop, and refine a set of models to finding trouble in the output, like the “syndromes” on the second page here.

Critics will consider the risk that LLMs are, at time incurious, placating, hallucinating, incorrect, capricious, forgetful, redundant, incongruent, negligent, opaque, non-responsive, or blind, or vacuous — or biased. They’ll point to long-term and potentially unresolvable problems in LLMs, to critical assessments of AI generally, or to socio-psychological issues like how people might prefer answers from LLMs even when the answers are incorrect! Critics will cite works like:

Critics may also point unhumbly to their own critical evaluations of LLMs, and of reports from the field.

But for the most part, whether they’re written by Critics from the testing community or Critics from the AI community, Supporters won’t read those works!

As Upton Sinclair is quoted as saying, “It is difficult to get a man to understand something when his salary depends upon his not understanding it.” It’s not just a matter of salary, though; if our world view is grounded in beliefs that we’ve established over years or decades, it’s typically going to be tough for anyone to shake them.

In the unlikely event that Supporters do read critical analyses of LLMs, there’s a very strong possibility that for some of the most enthusiastic Supporters, cognitve dissonance will kick in again. “I really like LLMs, and if there are deep problems in them, I’d be foolish to give them my unreserved support. And I’m not foolish, so there’s got to be another explanation. How about this: the Critics are just being Great Big Meanies.”

(Now you could say that cognitive dissonance afflicts Critics, too, and you’d be entirely right; everyone has a hard time letting go of longstanding preconceptions and biases. A key difference is that it’s pretty much okay if the Critics turn out to be wrong eventually. When the Bad Things that the Critics warned about don’t happen, nobody gets harmed by those Bad Things. The trouble is that if the Critics are right, lots of people can be harmed to varying degrees, including the Supporters.)

As I’ve said before, testing is socially challenging in two ways: We’re challenging the product in a social context, and it’s socially awkward for Critics to bring clouds over the sunny world view of the Supporters. Plus, in the world, Supporters vastly outnumber Critics.

The Role of Expertise

In Vipul’s post, the domain upon which Bard is being targeted is banking. Vipul acknowledges that he doesn’t have much experience in that domain. To Vipul’s credit, he does at least mention this issue and at least some awareness of its significance. Many Supporters who are testing a product will not worry themselves too much about their lack of expertise in a given domain, and won’t spend time learning about it. In the minds of Supporters, the problem of lack of knowledge can be wallpapered over by test cases written by people who do have the requisite information.

Critics will be looking for problems not only in the product, but also in their own knowledge of the domain. Critics will systematically doubt that a set of test cases is okay, worthy, or impressive until they have addressed subtantial doubt about their own expertise — and immersed themselves in the worlds of the people who are using the product. (For my part, I’ve done just enough work in banking and finance to know how shallow the output from Vipul’s experiment is.)

Critics acknowledge that they might not have contributory expertise (the capacity to perform the work of in a particular domain, like banking or programming). But Critics will aspire to learn about a domain to the level of interactional expertise (the capacity to speak and understand the language of a particular domain, without the capacity to perform the work).

Rethinking Expertise (Collins, Evans)
Are We All Scientific Experts Now? (Collins)

But the Supporters won’t read those! Developing expertise takes too much time — and considering the nature of expertise takes even more time!

Reporting and Why It Doesn’t Happen

For Vipul’s experiment and other instances like it, it’s highly likely that neither the Supporters nor the Critics will produce a detailed report, but for different reasons.

Supporters won’t feel that they have anything in particular to report on. Things look good and impressive! Once again, there are no test cases, so no failing test cases, so everything’s fine, so no report.

Critics will skim the output, applying an analytic, critical stance, and see numerous instances of the syndromes right away. Critics will be quickly overwhelmed at the volume of problems — too many to list — so no report. Certainly not a detailed report. The Critics’ reluctance is because of the effort pointed to by Brandolini’s Law:

The amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it.

Wikipedia, Brandolini’s Law

But the Supporters won’t read that! Or if they do, cognitive dissonance will go to work yet again. “The Critics are insinuating that the LLM’s output is bullshit, and if I were to support bullshit, I’d be a Bullshitter. And I’m no Bullshitter, so the only explanation is that the Critics are Great Big Meanies!”

On Bullshit and Bullshitters

I say this as a friendly greeting from the Critics: take heart, dear Supporters. Take heart. You’re not a Bullshitter unless or until you decide to speak with reckless disregard for truth. You probably don’t do that. Do you?

Calling Bullshit (Bergstrom, West)

On Bullshit (Frankfort)

To be a true Supporter means to worry about the possibility that you’re fooling yourself, and in doing so, you might helping others to be fooled. Those things might be okay if no harm will come to others because someone has been fooled. When a magician performs a magic trick, we’re might be fooled, but no one gets hurt. No problem there.

When we are developing tools and technologies that have the potential for bringing harm to people, we must acknowledge the possibility of fooling ourselves. Confronting that possibility and seeking out problems — and dealing with them — is the responsible and ethical thing to do. Whatever we might support (including, somewhat paradoxically, the Critic stance), if we wish to consider ourselves responsible, we must embrace thoughtful and reasoned responses from people with the opposite mindset.

It’s okay for Supporters and Critics alike to be naïve or mistaken at first. Real Supporters respond to concerns about problems and risk, because that’s the first step in confronting those problems and addressing them responsibly. Real Critics respond to concerns about errors in their work when Supporters point them out. Remember, to be a Bullshitter means to say things with reckless disregard for truth — to reject responsibility.

If you look at the replies to Vipul’s post and others on similar topics, you will see several instances of thoughtful and reasonable replies from Supporters and Critics. We may disagree, but we’ll disagree on substance. I hope those conversations can continue vigourously and cordially.

Also, alas, you may also see some bullshit replies. Some might ask, “Are you saying that what I (or someone else) said is bullshit? Are you calling me (or someone else) a Bullshitter?” If you’re asking, you can figure out the answer for yourself, using what I will call The Bullshit Test.

  • Are the replies glib, dismissive, irrelevant, or non-sequiturs?
  • Do the replies avoid responding with detailed, reasoned arguments to what’s being said?
  • Are the replies insubstantial and trivial, favouring things like hashtags and emojis over well-formed sentences and paragraphs?
  • Do the replies put ad hominem focus on a particular person or group, rather focusing ad rem on the work or specific statements?

If the answer to any of these questions is Yes, then I would call bullshit. And if someone dispenses bullshit consistently, although I’d be running the risk of Fundamental Attribution Error, I’d offer that that person is a Bullshitter.

(Of course, there’s every possibility that someone taking a Critic’s position might also be a Bullshitter. The good news is that the Bullshit Test applies to Supporter and Critics alike.)

Let’s be clear: I have no time for Bullshitters or their bullshit. They represent a denial-of-service attack on the time and intelligence of both Critics and Supporters. My approach is to ignore them, to refrain from responding, and move on to discussion with responsible people. To reply to bullshit would simply prolong an already polluted discussion. That is: I enjoy discussion and even arguments with reasonable Supporters. Those discussions can help us all to learn. However, replying to Bullshitters is not a service that I offer.

Besides, in my observation and experience, Bullshitters won’t listen to Critics, and won’t engage them in a substantive way. There’s a chance that they might listen to Supporters. I note that bullshit coming from the Supporter side risks giving Supporters a bad name. So, dear Supporters, dealing with the Bullshitters might be up to you.

In my next post, I’ll do what I’ve predicted neither most Supporters nor more Critics will do: I’ll provide an analysis of Vipul’s experiment and report on my process.

2 replies to “A Reply to “Running a crowd-sourced experiment on using LLMs for testing””

  1. In my opinion the whole discussion (this and many other threads) can be summarized to one question – why do you create (or want to create) test cases? This is a basis for me. According to the great text above, I believe that I’m one of Critics (I really hope so) and to be honest, also (in some cases more often, in some cases not at all) create test cases. But I do this personally, not ChatGPT/AI/LLM/whatever magic abbreviation. Why? Because I am just a human and can forget to test something in the next version of software. For me, test cases are some kind of map/checklist of things which I have to test. Moreover, I believe such test cases can help my colleague to test the application, because maybe he just didn’t think about this things (paths), because we are just a humans. So in my opinion, testing can be performed by using test cases, but… I have to say about one, small assumption here… testing should not, cannot be limited to just test cases execution. We as testers have to explore software and test it using our experience and skills.
    People think that if they generate test cases using AI/ChatGPT (even if they will be great – I am not an expert in this domain, maybe generated content can be great), they will have everything what they need to testy software. No, they won’t have. In best case, they’ll get some.. road map? Points to check the most common user paths? Nothing more. Finally, that’s why I think that test cases are not bad.. just people use them in a bad way or/and with bad mindset and approach.

    Reply

Leave a Comment