DevelopsenseLogo

Language Models

“Language models” is typically interpreted as a compound noun, something that models language. A model is an idea in your mind, an activity, or an object (such as a diagram, a list of words, a spreadsheet, a role play, a person, a toy, an equation, a demonstration, or a program…) that represents (literally! re-presents!) another idea, activity, or object (such as something complex that you need to work with or study).

What happens if we consider “models” as a verb, though? We get a simple declarative sentence, with an implied object: language models something. Language models our thinking, or language models our world. Language models!

As with any model, replacing or modifying one of its elements can suggest something interesting. Doing that can help us to refine our answers to two big questions that we ask when we’re testing. One is a question that we ask ourselves, as testers, continuously, expansively, and tacitly about the product: “Is there a problem here?” The other is a question that we ask our clients, periodically, specifically, and explicitly about the problems and the product: “Are you okay with all this?”

So, here’s a heuristic: search through any statement or claim about the product. When you see “AI” (or “artificial intelligence”, “AI Model”, “Large Language Models”, “generative AI”,…) try replacing it with “software” and see if it makes a meaningful difference to your perception of value, or of risk.

Try it! Here are some examples.

  • AI in self-driving cars is used for sensing, decision-making, predictive modeling, and natural language processing.”

    Software in self-driving cars is used for sensing, decision-making, predictive modeling, and natural language processing.”

  • “A McKinsey study claims that software developers can complete coding tasks up to twice as fast with generative AI.”

    “A McKinsey study claims that software developers can complete coding tasks up to twice as fast with software.”

  • “UnitedHealth uses AI model with 90% error rate to deny care, lawsuit alleges”.

    “UnitedHealth uses software with 90% error rate to deny care, lawsuit alleges”.

  • Large Language Models can handle all sorts of jobs such as recognizing and generating text with amazing accuracy.”

    Software can handle all sorts of jobs such as recognizing and generating text with amazing accuracy.”

The replacement feels pretty familiar, doesn’t it? People have been saying the same kinds of stuff about software practically forever.

Now: using the same sentences, replace “AI” (“AI Model”, “Large Language Models”, “generative AI”) with “algorithms whose behaviour is obscure, not written by a conscious or conscientious person, and for which we don’t have source code”.

  • AI in self-driving cars is used for sensing, decision-making, predictive modeling, and natural language processing.”

    Algorithms whose behaviour is obscure, not written by a conscious or conscientious person, and for which we don’t have source code in self-driving cars is used for sensing, decision-making, predictive modeling, and natural language processing.”

  • “A McKinsey study claims that software developers can complete coding tasks up to twice as fast with generative AI.”

    “A McKinsey study claims that software developers can complete coding tasks up to twice as fast with algorithms whose behaviour is obscure, not written by a conscious or conscientious person, and for which we don’t have source code.”

  • “UnitedHealth uses AI model with 90% error rate to deny care, lawsuit alleges”.

    “UnitedHealth uses algorithms whose behaviour is obscure, not written by a conscious or conscientious person, and for which we don’t have source code with 90% error rate to deny care, lawsuit alleges”.

  • Large Language Models can handle all sorts of jobs such as recognizing and generating text with amazing accuracy.”

    Algorithms whose behaviour is obscure, not written by a conscious or conscientious person, and for which we don’t have source code can handle all sorts of jobs such as recognizing and generating text with amazing accuracy.”

Notice how changes in the words change your perception of risk. Language models!

What’s our strategy for testing systems that involve AI? Easy, in one sense. Based on the first search-and-replace exercise above, apply a long-standing, solid, overarching strategy for testing software; any software.

Approach the product to learn about it any way you can: read about it; talk about it with the development team; learn the domain. Then learn about the actual product through experiencing, exploring, and experimenting with it. As you do so, give it a set of challenges. Start with some easy ones. As you do that, examine and evaluate what you see or hear. Ask yourself continuously, expansively, and tacitly, “Is there a problem here?”

With that stance, you are more likely to find problems. The product may fall over at the first hurdle. If you don’t notice any problems, look more deeply, try harder challenges, look in other places, or look from other angles. Make sure that you have access to and apply sufficient expertise to notice problems. Make sure you have good heuristics for covering the product with testing, for recognizing problems, and for identifying quality criteria that might be threatened.

Now apply the second search-and-replace above. The results should remind you that you’re dealing with a black box somewhere inside the process that the software is modeling. The developers may know something about the processes around the model — the ones that might filter or modify the input, or that might modify or massage the output. The workings of the model itself will be a black box, even to the developers, and its output will likely be inconsistent and unreliable.

The inconsistency is by design; human-like variability in the output is a feature of Large Language Models. The key here is to avoid treating one happy outcome as evidence of consistently happy outcomes.

Notice how even the same prompt, issued multiple times, can return subtly different responses. Though subtle, those differences can represent serious problems. We cannot see the scope of the problems without lots of increasingly challenging and varying trials and careful scrutiny of the output. As we do that, we’ll need to apply LLM-specific risk models in addition to all the usual threats to quality criteria.

In a test I performed the other day, the chatbot was “analyzing” a program and how to test it, providing vague and not very astute output. It did mention four key variables, and it did so consistently for a while. Then it suddenly “forgot” the name of one variable and changed it to something else. That sort of behaviour can be disastrous for anything that requires accuracy and consistency — especially generated code that matters.

In another test, the chatbot cheerfully gave me source code that included a class method that didn’t exist. When I prompted the chatbot for a correction, it “recognised” the error and offered a different class with the same method — one that also didn’t exist. To evaluate the LLMs output, you’ll need to review it AND to test the resulting code. As you do so, notice how little of the generated code contains error handling at critical points. (Wow! Just like a real programmer! Okay, a really bad programmer.)

As I write, there’s a story about an LLM “hallucinating” a piece on widely-used news site, generating a story about a murder that never happened. A number of people including the local police were, unsurprisingly, not happy about this.

The second substitution above and the intrinsic unreliability issues should remind us that, for important stuff, even after we’ve tested the output from LLMs must be examined every time. LLMs turn everyone — including our customers, and often their customers — into testers. Or, if not testers, potential victims. Is your testing client and your business okay with that risk?

So: periodically check in with your client. The client might be interested in test cases, but probably not. More likely, your client will be interested in risk, and problems that threaten the value of the product. Therefore: describe the product and the problems in it, and how those problems affect the business. If there are obstacles to your testing — things that slow it down, make it harder, make it less valuable, relevant, or timely, note those.

When it comes to large language models or any other form of AI, note the elevated risk associated with algorithms whose behaviour is obscure, not written by a conscious or conscientious person, and for which we don’t have source code.

Then ask the client, specifically and explicitly, “Are you okay with all this?”

This post was revised 2024-06-08. It may get revised again.

Leave a Comment