Testing ChatGPT and Data Generation

December 11, 2023; ChatGPT 3.5 (chat.openai.com)

I believe that it’s unwise and irresponsible to promote Large Lanaguage Models for any testing purpose unless we have tested and assessed how well the LLM does on that task. We should not simply accept the claims and fantasies of AI fanboys; nor should we accept the idea that because Everything Will Be Fine in the Future, we should start applying LLMs indiscriminately today.

There’s very a fair amount of enthusiastic puffery on using LLMs — or applications based on them — in testing. There’s relatively little in terms of what it’s like to really use them, and to do so with a watchful and critical eye. Most of the research papers on testing LLMs generally tend towards the abstract and the statistical. Relatively little material is experiential — based on examples of how people, human beings — like, say for example, testers — might use LLMs in actual practice. (There is some out there, and some of it says alarming things about how people assess information. For instance, a large number of people offered answers from Stack Overflow prefer answers from LLMs, even when the people know that the LLM’s answer is wrong!)

The essence of testing is evaluating a product by learning about it through experiencing, exploring and experimenting, which includes to some degree questioning, studying, modeling, observation, inference, etc. To put it another way, one important means of testing is to try using the damned thing to get some job done. So let’s get some experience, explore, and perform some experiments.

Here’s today’s experiment: I asked ChatGPT to generate a list of the 1000 most common words in the English language. (We might want to do this if we were do some first-order analysis of the output of, say, a Large Language Model. We might also apply things like the Flesch-Kincaid Readability test.)

MB: What are the 1000 most commonly used words in the English language?

ChatGPT began its reply:

Here are some of the most commonly used words in the English language, known as the “thousand most common words.” Note that this list may vary slightly depending on the source and criteria used for compilation:

It duly proceeded to create a list of words. At around 350 or so, ChatGPT stopped, and offered a button that said “Continue generating”. I pressed it. At 757, it paused again with the same button. I pressed it. This time, ChatGPT bailed out with “There was an error generating the response.” (On a subsequent attempt, it bailed out at 752.)

I went back to the top and I scrolled down the list. Using a blink test, I thought I saw some words that had been repeated. Sure enough, when I slowed down, I could see that items 124, 125, and 126 on the list were “a”, “an”, and “the”, respectively. They had already appeared on the list: “the” unsurprisingly at number one; “a” at six; and “an” at 32.

Scrolling down further, after 150, I started noticing “it” being repeated fairly frequently. Then I noticed more and more instances of “he”, “she”, “we”, “them”. Then starting at line 171, all the entries started to be repeated: “he”, “she”, “it”, “they”, “we”, “I”, and “you”. These continued without variation until the last item, 756.

Were there any duplicates in the first 100 words? It seems not; I used a tool that I cobbled together myself for analyzing text and word patterns; one of its commands provides a count of each word in the text, ranked in order of appearances. Everything in the first 100 items appears only once. After that, “do” and “them” appear twice more, and a bunch of other words appear once more. Then there are the repeats mentioned above: “he”, “she”, “they”, “we”, “I”, and “you” appear on the list roughly 86 times each; “it” 87 times.

Well… not exactly a roaring success. I wanted to know how good the results were for the top 100.

For an oracle, I went to Google to look for lists of common words. I started modestly, looking for 100-word lists, instead of 1000-word lists. The search found several. The first I went to was Wikipedia’s 100-word list; or at least, that’s where I wanted to go to. By mistake, I clicked on the link to Education First‘s site.

The Education First site provides its list and the context for it:

This list of the 100 most common vocabulary words in English can be used as a reference for beginner English students. It’s a good idea to master these words fully before trying to move on to less common words, since this is the vocabulary you will encounter most often. These are some of the first words that native English speaking children learn how to spell. You can also make sure that you not only recognize these words and know how to pronounce them, but that you can also spell them correctly. Mistakes in spelling these short, common words are more significant than mistakes spelling more complicated words you may learn later on.
https://www.ef.com/ca/english-resources/english-vocabulary/top-100-words/

I grabbed the text, and pasted into Excel, along with ChatGPT’s list. Between ChatGPT and the Education first site, the results were a little different, but not markedly so. This is to be expected; there are lots of ways of counting words that would account for variation, as we shall see shortly.

Then I realized my error. I went to Wikipedia’s list of the most common words in English, which begins begins with some context:

Studies that estimate and rank the most common words in English examine texts written in English. Perhaps the most comprehensive such analysis is one that was conducted against the Oxford English Corpus (OEC), a massive text corpus that is written in the English language.

In total, the texts in the Oxford English Corpus contain more than 2 billion words.^[1] The OEC includes a wide variety of writing samples, such as literary works, novels, academic journals, newspapers, magazines, Hansard’s Parliamentary Debates, blogs, chat logs, and emails.^[2]

…

Some lists of common words distinguish between word forms, while others rank all forms of a word as a single lexeme (the form of the word as it would appear in a dictionary). For example, the lexeme be (as in to be) comprises all its conjugations (is, was, am, are, were, etc.), and contractions of those conjugations.^[5] These top 100 lemmas listed below account for 50% of all the words in the Oxford English Corpus.^[1]
https://en.wikipedia.org/wiki/Most_common_words_in_English

This makes sense; any count like this depends on a corpus of text to analyze, and a model for determining what constitutes a word. Wikipedia provides a summary of that context.

(If you’d like another perspective on the issue, try the preface and notes to Thing Explainer, by Randall Munroe of xkcd fame, in which he explains a bunch of different technologies and artifacts using only the most common one thousand English words. Pardon me; “ten hundred”. Apparently “thousand” is not one among the thousand words most commonly used in English. The description of the base of the Saturn V rocket — that is, “US Space Team’s Up-Goer Five” — alone is worth the price of admission: “This end should point toward the ground if you want to go to space. If it starts pointing toward space you are having a bad problem and you will not go to space today.” But I digress.)

I compared ChatGPT’s output to Wikipedia’s list. It was exactly the same. The reason for the similarity or the potential for differences is not clear (as we’ll see later).

Wikipedia points out that its list is based on the Oxford English Corpus. ChatGPT does not provide a citation. Is ChatGPT getting its data from the OEC? Or is ChatGPT getting its data from Wikipedia? This is the syndrome of opacity. (We will find out, through experiement, that the answer seems to be No; neither one.)

On a second attempt with ChatGPT, I used a different prompt:

MB: I am about to ask you for a list of the most 1000 common words in English, along with citations on your sources. Please start with the sources; then present the list.

Probably triggered by the reference to “sources” ChatGPT immediately started with a much stronger disclaimer than before.

I don’t have access to my training data, but I was trained on a mixture of licensed data, data created by human trainers, and publicly available data. I do not have access to a specific list of the most common 1000 words in English with citations. However, I can provide you with a general list of common English words without citations. Keep in mind that different sources may slightly vary in their lists of common words. Here is a list of 20 common English words as an example:

It then gave a list of 20 words (not 1000, not a failed attempt at 1000 in the 750s, and not 100).

the

and

to

of

a

in

is

it

you

that

he

was

for

on

are

with

as

I

his

they

It then followed with some reasonably good advice:

If you need a more comprehensive list, I recommend referring to well-established dictionaries or language corpora. Some sources commonly used for word frequency lists include the Oxford English Corpus, the Corpus of Contemporary American English (COCA), and the British National Corpus (BNC). You may want to check their websites or publications for detailed information on their word frequency data.

I asked ChatGPT how trustworthy its output is:

MB: Please tell the limits on the number of words you can provide, and on the reliability of the list.

The number of words I can provide in a single response is limited by the platform or application you are using to interact with me. Generally, I try to keep responses within a reasonable length to ensure readability and comprehension.

As for the reliability of the list of common words, the examples I provided are based on general knowledge and commonly used words in the English language. However, language is dynamic and can vary across contexts and sources. The list I provided is not exhaustive, and different sources may have slightly different rankings of word frequency.

For a more accurate and comprehensive list of common words, it’s advisable to refer to established linguistic corpora, dictionaries, or language databases that are regularly updated and widely recognized in the field of linguistics. These sources often use large datasets to analyze word frequency and usage in a more systematic and reliable manner.

The answer is vague, and not specific to the platform. Readability isn’t an issue for the top 100 (or 1000) words. Comprehension isn’t either, but comprehensiveness might be. I tried again with a prompt that didn’t ask for sources or limitations:

MB: I am creating data for a test project, in which I need a list of the most common 1000 words in English. Please provide such a list.

In this case, ChatGPT provide a disclaimer and closing remarks almost exactly the same as the one above, and a 20-word list exactly the same as the earlier one. I suspect that the disclaimer is there because stuff about “sources” is still within the token window. Would that go away in a fresh session?

It seems that it does. I opened a new chat, and prompted:

MB: Create a list of the 1000 most commonly used words in the English language.

On a this attempt, the results weren’t so good. ChatGPT paused again around the 350 mark, and failed to continue generating at 750. 15 of the first 100 words were duplicated. (“its”, “who”, “go”, “no”, “over”, “know”, “made”, “get”, “come”, “more”, “day”, “now”, “people”, “my”, and “water”) .

This second list was quite different from Wikipedia’s list. The duplicates account for some of that. But “water” and — perhaps more interestingly — “more” don’t show on on Wikipedia’s list.

It is not clear why (even for, say, the first 100 words) why ChatGPT differs from Wikipedia, and why the second run differs from the first, because ChatGPT’s sources and methods are opaque.

Wikipedia does have a list of 1000 common English words. They’re not ranked in order of use, but there’s list. ChatGPT’s list isn’t comparable; it never gets close to 1000 words.

What are some lessons here? The important ones include:

If you’re using ChatGPT to generate data, scrutinize the output, especially if it’s anything more than a very small data set.
Note that ChatGPT (by design) will produce results that are inconsistent from run to run. That’s a feature with some applications of LLMs, but it’s a bug where consistency is important.
ChatGPT might be more circumspect when you ask it to be.
Wikipedia includes notes on its sources. ChatGPT doesn’t. For this reason, ChatGPT’s testability is significantly reduced.
Wikipedia includes notes on other possible sources of information. ChatGPT doesn’t, unless prompted.
For any kind of data set that you develop or that you use for testing, the contents of that data set are based on someone’s model of what to include and what to leave out, and why. That’s a critical consideration for any data set, but especially for data sets that will be used to feed or train machine learning models, because the output from the model will be opaque, and not easily reverse-engineered.
Analysing this experience in detail and reporting it took several hours. I could have stopped far earlier simply saying that ChatGPT failed to accomplish the task, but deeper testing, analysis, and reporting takes time. A lot of the time as testers, we’ll be asked to report not only on whether something passes or fails, but on how close it is to good or on how badly it fails.

Leave a Comment Cancel reply