DevelopsenseLogo

Testing ChatGPT’s Programming “Skills”

With the current mania for AI-based systems, we’re finally starting to hear murmurs of moderation and the potential for risk. How do we test systems that incorporate an LLM?

You already know how something about how to test LLM systems if you know how to test. Testing starts with doubt, and with a desire to look at things critically.

The other day on LinkedIn, Paramjit Singh Aujla presented a problem statement:

Problem statement: Please write a program in c# where A person has to travel from point A to L. I need five unique ways to reach the destination with at least four alphabets to represent the path. We can use all the intermediate alphabets as detour points.

Then he presented ChatGPT 3.5’s response to this prompt. Finally, he gave the assignment, “Please identify the issue in the output.”

It’s not clear to me exactly what Paramjit intended, but I’ll take this stance: I’m going to examine ChatGPT’s response to the assignment generally.

First, I tried giving ChatGPT the problem statement myself. Each time, it produced something of this nature:

class Program
{
    static void Main(string[] args)
    {
        Console.WriteLine("Five unique ways to reach destination L from point A:");

        // Way 1
        Console.WriteLine("1. A -> B -> C -> D -> L");

        // Way 2
        Console.WriteLine("2. A -> E -> F -> G -> H -> I -> L");

        // Way 3
        Console.WriteLine("3. A -> E -> F -> J -> K -> L");

        // Way 4
        Console.WriteLine("4. A -> B -> C -> G -> L");

        // Way 5
        Console.WriteLine("5. A -> E -> I -> L");
    }
}

Each attempt produced a minor variation on the same theme: printing a set of five statically coded strings that began with A and ended with L. The responses never included example output. I had to prompt it to deliver code that would dynamically generate strings.

Paramjit obtained a surprisingly different response to the prompt from my responses:

using System;

class Program
{
 static void Main(string[] args)
 {
 Console.WriteLine("Five unique ways to reach point L from point A:");

 // Generate five unique paths
 for (int i = 0; i < 5; i++)
 {
 string path = GeneratePath();
 Console.WriteLine($"{i + 1}. {path}");
 }
 }

 static string GeneratePath()
 {
 Random random = new Random();
 string alphabet = "BCDEFGHIJKLMNOP";
 int pathLength = random.Next(4, alphabet.Length); // Ensure at least four alphabets

 // Shuffle the alphabet string
 char[] shuffledAlphabet = alphabet.ToCharArray();
 for (int i = 0; i < shuffledAlphabet.Length; i++)
 {
 int newIndex = random.Next(i, shuffledAlphabet.Length);
 char temp = shuffledAlphabet[i];
 shuffledAlphabet[i] = shuffledAlphabet[newIndex];
 shuffledAlphabet[newIndex] = temp;
 }

 // Construct the path using shuffled alphabets
 string path = "A";
 for (int i = 0; i < pathLength; i++)
 {
 path += shuffledAlphabet[i];
 }
 path += "L";

 return path;
 }
}
--------------------
Program execution output:
Five unique ways to reach point L from point A: 
1. AICGOJPFMEL 
2. AEPBNFLIL 
3. AFCLKBNIGEOPL 
4. ALEDOGIL 
5. AEOLCFNMIHGKBL 
 
=== Code Execution Successful ===

It’s not clear to me how Paramjit got this response. It seems to me there must have been a bunch of prompts over and above the problem statement, but I’ve consulted him, and he says that this is the single verbatim prompt, and the entire response.

There are lots of issues in the output — but there’s ambiguity in the assignment (“Please identify the issue in the output.”).

  • Is Paramjit asking about the ouput from ChatGPT’s program as listed?
  • Is he asking us to indentify issues in ChatGPT’s code?
  • Or is he asking us to evaluate the output from ChatGPT as though it were a programmer, including what’s there and what’s missing?

Here’s my evaluation: The “Program execution outupt” of ChatGPT’s program is a little odd, but at a stretch, one could make a case for it being correct, based on the problem statement. That’s not the issue, though. There’s a multitude of issues, and sub-issues.

If we’re evaluating ChatGPT’s capacity to generate code without evaluating the quality of the code, that’s one thing. It’s quite another if we are evaluating ChatGPT’s capabilities to write good code, or to help us write code. For that, we’ve got to be clear what constitutes a good or a bad response. That is, we should be conscious of our oracles — the means by which we recognize problem when we encounter one in testing. What oracles are immediately available to us?

We could evaluate ChatGPT’s responses based on claims (from the Non-critical AI Fanboys; NAIFs) that LLMs are ready to replace or augment programmers. We’ll start there.

The first issue, then, is in ChatGPT’s response to the problem statement. The problem statement is not clear to me. Let’s look at it again.

Problem statement: Please write a program in c# where A person has to travel from point A to L. I need five unique ways to reach the destination with at least four alphabets to represent the path. We can use all the intermediate alphabets as detour points.

  • I don’t understand the assignment. I assume that the assignment is to model a person’s travel. But what does the real world look like?
  • I infer from the problem statement that the client is requesting something in terms of nodes on a graph, with A representing the starting node, L representing the ending node, and other nodes to represent locations in between the person’s starting and ending point. If that assumption is true, I don’t have any idea what the graph looks like.
  • I don’t know what “alphabets” means; I assume it means “alphabetic characters”, or “letters”, but I am not sure.
  • Given that, I assume that the letters are intended to label nodes on a graph, not paths. I assume that strings of letters represent the paths.
  • I don’t know clearly what constitutes a valid path. I don’t know what might constrain choices of certain nodes. An example or two might help.
  • I’m not clear on the required length of the paths, and I don’t know what “at least four alphabets to represent the paths” means. Does that mean at least four letters in the form A, x, y, [z, z1, z2…], L? Or does “at least four alphabets” refer to four intervening nodes in addition to A and L?
  • Is there a upper limit on the number of nodes?
  • I don’t know what “all the intermediate alphabets” means. I assume it means “any letters from B to K”, but that’s only an assumption.
  • I don’t know what, if anything, might constrain the choice of letters. I don’t know if all the letters in the standard, Western Roman alphabet, from A to Z, might be used to label nodes.
  • I don’t know what “detour points” means; is that different from “nodes” somehow? Does it mean “waypoint”?
  • Must the nodes visited on the path be in alphabetical order? Or can we jump arbitrarily from node to node?
  • Are cycles in the path allowed?
  • Might paths begin at a point other than A, or end somewhere other than L, as long as both of those two nodes are covered, and at least four other nodes are covered?

Most importantly, I’m not clear on the point of the exercise. I assume that it’s strictly a programming kata of some kind. Is it? If not, is this part of some larger system? Is there risk here? Will someone suffer if we don’t get this right? Is performance an issue?

Let me tell you this: no competent programmer would charge ahead and write a program without bringing up assumptions and questions like the ones I’ve raised here. By contrast, ChatGPT spews out code without any comprehension or questioning of the task; without any concerns for ambiguity, limits, constraints, or risks; without engaging with the person who is giving the assignment. This is an instance of ChatGPT’s incuriosity and negligence/laziness syndromes that we’ve identified here. These are instances of the “patterns of familar problems” oracle heuristic. You’ll see them again and again in this report.

If real programmers responded the way ChatGPT did, no software would reliably do what we wanted it to do except entirely by accident.

In the code, ChatGPT does provide reasonably descriptive variable names. It also provides four comments. Comments can be a good thing to help the reader to understand assumptions or reasoning behind the code, but these comments are fairly banal, describing things that are easily discernable from the code itself. Note that the comment “// Ensure at least four alphabets” repeats the term “alphabets” without having clarified what it means first.

Where might it be a lot more helpful to have a comment? There’s no explanation for why ChatGPT includes L, M, N, O, and P in “string alphabet”. Why include those letters outside the range A through L? If we’re including L, M, N, O, and P, why not include the remaining letters of the alphabet too?

The code seems to be overly elaborate, and somewhat inefficient. There’s no need to create a shuffled alphabet for each new string. Instead, pick a character at random from the original alphabet string. If duplication of nodes is an issue, control for duplicates.

If someone’s purpose is to replace, assist, or simulate a good programmer who writes reliable code, ChatGPT seems to be inconsistent with that purpose.

There’s a problem with ChatGPT’s “program example output”. That heading is ambiguous. Worse, there’s the highly misleading line “=== Code Execution Successful ===”. This suggests that ChatGPT actually ran the code, but that interpretation would be incorrect. ChatGPT doesn’t run code that it writes. ChatGPT doesn’t run code that you feed it. What we’re seeing here is an example of output that looks like it might come from the code.

I’m not a C# programmer. Having written programs in C, I can read C# code and infer pretty easily what it would do. I have VS Code installed on my system (just barely; I use the JetBrains tools), but I didn’t have C# installed. However, my friend Wayne Roseberry was able to run the code. It compiled and ran, at least; and it generated five strings. As you might expect from looking at the code, it generates strings of varying length, and they’re similar to ChatGPT’s simulated output.

There’s no question that ChatGPT does hallucinate about running code, as the hilarious conversation at this link shows. In that conversation, ChatGPT claims to execute the code three times, providing examples of hallucination, placation, and incorrectness. Then, when challenged, it confesses that it actually doesn’t run code. For those who bother to read the whole thing (the NAIFs won’t), you’ll note that none of the fake output contains the letters M, N, O, or P. When I challenged ChatGPT on this, it generates text with some preposterous errors.

Again, the output from “running the code” looks legitimate. But this raises a question: how does ChatGPT simulate running the code? Does it rely on cues in the source?

So, James Bach and I did a bunch of experiments today, in which we tried to learn things about how ChatGPT makes inferences about what the output would be, and how ChatGPT generates its simulated output. Those experiments included removing the comments and the header for the output, reversing the alphabet string, and obfuscating the variable names. (In the last experiment, we accidentally eliminated the “using System” directive and the the “class Program” line. In each case, ChatGPT pretended to run the code.

Then we tried corrupting one of the obfuscated variables such that the code shouldn’t compile. In this case, ChatGPT did not pretend to run the code; it offered a correction. When we insisted that it run the original code without changing it (“Please execute the code in the original prompt. Do not change that code.“), ChatGPT did not respond to the prompt; it responded, “I see. Here’s the corrected version of the code you provided:” and followed it with exactly the same corrected version of the code. This is an instance of the non-responsiveness syndrome.

When we prompted again, “Please execute the code in the original prompt. Do not change that code. Do not correct any errors. Provide the output assuming that we attempted to run this code.”, ChatGPT dutifully provided sample output — one instance of “AEL”, and four instances of “AFL” (which would not have been an ouput of the corrected code. Below the made-up and incorrect sample output, ChatGPT noted (to its credit) “Since besoptsDgmmpt is not defined, the code won’t compile.”

In our next experiment, we introduced a comparison error, replacing a less-than symbol in the code with a greater-than symbol. ChatGPT corrected that. But it didn’t run the code as requested in the prompt.

So it appears that somewhere in the bowels of ChatGPT’s systems, there is either some kind of syntax checker for C#, or something very deep in the model that affords some strong approximation of syntax checking. We don’t know; and this is an instance of the opacity syndrome.

At one point James jokingly asked “Could ChatGPT solve the halting problem?” So we gave that a try too. Read it; it’s short, and bizarre, and hilarious.

Finally, there’s another mystery here: why did Paramjit’s prompt produce all the code it did? My experiments with the exact same prompt provided far simpler results. You can see these here, and here (with an instance of a three-character string in the “output”), and here. This is an instance of the capriciousness syndrome; or an instance of the inconsistency within the product oracle.


At one point during all of this, my ChatGPT session expired, and I had to log in again. I went back to the list of my experiments:

Notice anything weird here? The chat.openai.com site has given a German title to one of my experiments for no reason that I can fathom; this is inconsistent with reliability.

In addition (there would be no way for you to know this from the image above on its own), OpenAI has presented the log of the experiments in a different order from the order in which I performed them! This is a problem; it’s inconsistent with history (before I logged in again, the list was in the correct time order), inconsistent within the product, and inconsistent with user desires (reliability) oracles.

The messed-up order (which I had assumed to be reliable) now requires me to retrace my steps through the experiment by looking at the changes in the code. This makes reporting and analysis more difficult, which is inconsistent with user desires (testability). So, to my surprise, I’m not only testing ChatGPT, but other aspects of the OpenAI portal besides.

And there are bugs, bad bugs, everywhere we look.


What you’re seeing from me, here, is testing of ChatGPT’s programming capabilities, not a demonstration. You’re seeing me and James perform experiments, analyze them critically, and share our results, opening them to critique.

And none of the Non-Critical AI Fanboys (NAIFs) will do that.

By one set of plausible interpretations of the requirements, ChatGPT’s code can work; that is, it can appear to meet some set of some requirements to some degree based on some set of assumptions.

Based on that, the NAIFs will tell you that because ChatGPT’s program can work, they’ll say “See?! Just like a real programmer can do!” The trouble is that NAIFs will take exactly the wrong interpretation from this—seeing it as a success and ignoring the problems. But a demonstration is not a test. Can work does not mean does work. And the NAIFs appear not to have the energy nor the diligence to really test their own claims.

There are dozens of ways in which output from ChatGPT’s code could be deemed incorrect. Moreover, the code contains a subtle bug that you won’t notice simply by running it a few of times: there’s no check to ensure that each generated path is unique — which is one of the few requirements in the problem statement that is actually reasonably clear! This is an instance of the incorrectness and negligence/laziness syndromes.

Then the NAIFs will say “Well! If you want something different, you should use a better prompt!” I agree. But notice that ChatGPT never makes this suggestion; never questions the prompt; never asks for context.

Or keep asking until you’re happy!” This is exactly the difference between a doctor and a drug dealer: the doctor wants to act in your best interest. The drug dealer doesn’t care what your best interests might be, as long as you’re happy. And as long as the money keeps coming in.

The trouble is that you don’t know that you are justifiably happy without testing and analysis. And we know the NAIFs aren’t doing that, or else they’d moderate their claims.

ChatGPT turns programming into a game of “bring me a rock”; the programmer, in the role of manager, saying “bring me a rock”, followed by cycles of ChatGPT as the hapless intern saying “Certainly! Here’s a rock!”, and the programmer saying “No, that rock has a problem; bring me another rock without that problem.”

The trouble is that “another rock without that problem”, in the form of a new version of the code, may introduce new problems that weren’t in earlier versions, or may reintroduce old problems that were addressed in newer versions.

Is ChatGPT useful?

Like everything based on AI, ChatGPT is most useful as a lens and a mirror on the biases in our training data and our experience of the world. AI shows us what we’re like.

I would say that ChatGPT is very useful here as a learning tool. To start with, it’s a patent example of what bad programmers do with vague, imprecise, or ambiguous requirement statements. They rush in and write some code (maybe with fake output, to boot). If they get things wrong, we must correct them, they apologise, and they try again and the cycle continues until we get something we think we like — or until we give up in frustration.

One key difference between ChatGPT and a real programmer is that ChatGPT forgets about its prior experience with us when we start a new session. The real programmer tends to learn something from each experience; and we can hold a real programmer responsible. If the programmer is completely unable to learn from experience, the relationship doesn’t last very long.

ChatGPT also provides us with a target-rich example of an opaque and unreliable system. This in turn affords the opportunity to learn how to test in a highly exploratory way, feeding the outcome of each small experiment into ideas for the next one.

People are producing systems based on the underlying models that OpenAI offers through its marketplace. Are those systems being tested? This experiment, and our report, also provides us with an example of how to test those systems, and how to challenge the unsupported claims of the NAIFs. That’s important, because businesses are making huge investments in technology that — at the moment — we can show to be fundamentally unreliable in all kinds of ways.

Testing allows us the opportunity to learn about the truth of things. Such truths allow us to make better decisions: is it safe or risky to apply technology, or to release a product? Are there problems? If so, what improvements would we have to make to address those problems? Are the risks too large, or the problems too difficult, to make it worthwhile to put money into them?

Let’s learn something worthwhile from all that.

Leave a Comment