DevelopsenseLogo

Evaluating the Chatbots

This ChatGPT getting dumber? This paper raises the question; this blog post questions the conclusions; and this article has more to say.

That’s not a very useful question, because “dumber” is not exactly a property of ChatGPT (or anything else). It’s a set of relationships between ChatGPT’s behaviour; people’s notion(s) of dumb and smart; and the context. Evaluating that requires a complex set of perspectives, values, and social judgements.

For instance: if a robot were to attain the capability to light a paper match, we might say that it’s smart. If the robot lit a match within a couple of feet of a pump at the gas station, we’d probably say it was dumb.

If a robot did exactly that to incinerate a school bus full of suspected terrorists, some people might say that was smart, and other people (who believe in due process) might say it was dumb. And that evaluation would depend on things we might find out later. Maybe our suspicions were wrong, and it turns out that the school bus wasn’t full of terrorists, but of high-school students; the terrorists were in the other bus.

“Smartness” and “dumbness” are subject to the Relative Rule and to the Unsettling Rule. Maybe changes to ChatGPT that make it “dumber” make it safer, which is, in the long run, smarter. Safety, too, is a set of emergent relationships between product, people, and context. As Ted Williams once said, “If you don’t think too good, don’t think too much.”

Moreover, with ChatGPT, the temperature setting (that is, roughly, the degree of predictability of the output) influences the nature of the replies that ChatGPT offers. At higher settings, higher randomness in the output affords more surprising replies that we might consider insightfully striking. Higher still, and ChatGPT starts to sound like a college intern on LSD, or Lucky’s monologue from Waiting for Godot.

Or maybe it’s because ChatGPT is really good at providing answers that sound good, rather than what the answers should be. Over time, experience reveals more problems than people may have noticed when something debuted. As we learn more about a system, our bullshit detectors become more finely tuned.

“When GPT-4 came out, we were still all in a place where anything LLMs could do felt miraculous,” Willison told Ars. “That’s worn off now and people are trying to do actual work with them, so their flaws become more obvious, which makes them seem less capable than they appeared at first.”

Simon Willison, https://arstechnica.com/information-technology/2023/07/is-chatgpt-getting-worse-over-time-study-claims-yes-but-others-arent-sure/

That’s why (dammit) it would be nice to see people actually testing these things; looking at them deeply and critically before breathlessly talking about how, and how much, they make work faster and easier.

There can be real risk here. Don’t be a Black Swan farmer, and don’t drive the school bus blindfolded.

Leave a Comment