DevelopsenseLogo

# ChatGPT and a Math Puzzle

The other day on LinkedIn, Wayne Roseberry posted a puzzle that (he says) ChatGPT solved correctly. Here’s the puzzle.

“Bob and Alice have a rectangular backyard that has an area of 2500 square feet.
Every morning, Alice walks the 50 feet from her back door to the neighbor to pick up their laundry as well.

What is the longest straight line that can bisect Bob and Alice’s back yard?”

According to Wayne, “It came up with the right answer (I have another, longer, version of this problem where it completely failed…), but it used calculus to do it. I almost never work with calculating derivatives, so I haven’t checked whether it really did so correctly.”

Did ChatGPT solve the puzzle correctly? Did it really use calculus?

No. ChatGPT does not solve puzzles. ChatGPT does not do calculus. ChatGPT does not go through any reasoning process. ChatGPT’s job is not to give you a right answer. ChatGPT’s job is to generate text that takes the form of a right answer.

• Does Alice walk from her back door to the opposite property line?
• Is the neighbour positioned at the property line?
• Is Alice’s walk perpendicular to the property line, or is it on an angle?
• Could the neighbour be standing on an adjacent side of the rectangular yard, rather than on an opposite side?
• Could the neighbour be standing on the same line as Alice? That is, might the neighbour be parallel to Alice’s back door?
• Could it be that the neighbour is some standing distance beyond the property line, and that Alice walks beyond it?
• Is Alice’s back door at the centre of the lot, or is it offset from the centre?
• Can we safely assume that “bisect” refers to any means by which we might divide the back yard into two areas of equal square footage, and that the person presenting the puzzle is looking for a hypoteneuse between two opposite corners?
• What role does Bob play in this, anyway?

That is: there are (at least) two elements to this puzzle. One is the straightforward manipulation of numbers. That may not be trivial, but relatively speaking, it’s the easy part of the exercise. The harder part is the part where we try to avoid being fooled. ChatGPT doesn’t do either one of these very well.

Over the last little while, James Bach and I have been performing experiments with ChatGPT as a target. We’ve identified a number of pathological patterns in ChatGPT’s behaviour. Here I’ll list just a few. One is incuriosity: ChatGPT will not answer a prompt with questions unless it is specifically asked to do so. Another is innumeracy; ChatGPT is simply not very good at math. A third is capriciousness; ChatGPT tends to change its answers when supplied with a prompt that challenges the result.

None of this is a problem for ChatGPT, of course. “As an AI language model, I don’t have feelings or emotions, so I don’t have any personal reactions to these identified patterns. However, I’m designed to follow certain behavior patterns based on the training data and model architecture.” But the pathological patterns might well turn out to be a problem for people who accept ChatGPT’s output too eagerly.

Prompted (!) by Wayne’s report, I performed some experiments of my own. Here’s the first one. I supplied the original puzzle, and after each of ChatGPT’s replies, I added a prompt expressing doubt about the answer. I started simple: “Are you sure that you’ve answered the question?”; “Is that answer correct?”; “Has that answer been calculated correctly?”; “Is there any possibility that that answer might be wrong?”; and so forth.

After a few rounds of this, I supplied more targeted prompts: “Does this take into account the relationship between Alice’s back door and the location of the neighbour?” “In cases 2 and 3, is that the actual longest value, given that the neighbour could be in a different location?”

• Longest straight line = √((a/2)^2 + b^2; Substitute the values of a and b to get the answer.
• So, the longest straight line that can bisect Bob and Alice’s backyard is approximately 177.46 feet.
• So, the longest straight line that can bisect Bob and Alice’s backyard is approximately 18.95 feet.
• So, the longest straight line that can bisect Bob and Alice’s backyard is approximately 70.71 feet.
• Since there are no real solutions for W, it means that the critical point does not exist. Therefore, we cannot determine a maximum value for D^2, and there is no maximum diagonal.
• So, in both cases 2 and 3, where the rectangle is not a square, the longest straight line that can bisect Bob and Alice’s backyard is approximately 103.07 feet. When the backyard is a square, the diagonal is approximately 70.71 feet.
• In summary, without specific information about the neighbor’s location or constraints on its position, it is not possible to determine the actual longest straight line that can bisect Bob and Alice’s backyard.
• In summary, without additional constraints on the location of the neighbor’s point or specific information about the position of Alice’s back door, the longest straight line that can bisect Bob and Alice’s backyard is not well-defined, and it can vary depending on the neighbor’s location.
• In summary, without specific information about the dimensions of the backyard or constraints on its shape, the question of finding the longest straight line that bisects the yard cannot be definitively answered. The longest straight line will depend on the dimensions and shape of the rectangular backyard.
• So, the longest straight line that can bisect Bob and Alice’s rectangular backyard, with an area of 2500 square feet, is approximately 72.46 feet. This occurs when the length of the rectangle is approximately 58.48 feet and the width is approximately 42.68 feet.

Or, even more concisely:

• √((a/2)^2 + b^2
• 177.46 feet
• 18.95 feet
• 70.71 feet
• 103.7 feet OR 70.71 feet
• can’t determine
• can’t determine
• can’t determine
• 42.68 feet

That was nine prompts, with seven different answers — and they’re not just a little different; they’re dramatically different. ChatGPT is all over the place here.

Here’s another session. This time, I gave the puzzle, and explicitly added “You may ask any questions required to solve this puzzle.” ChatGTP returned with a good clarifying question about the relationship between the length and the width of the yard. I supplied dimensions of 250 x 100 feet.

After a few nudges to ask other questions, ChatGPT returned to the answer of 269.26 feet. Then, when I asked “Are you sure about that answer?”, ChatGPT apologized and presented a calculation wherein for no reason I can imagine, the two side lengths were divded by two before being squared. This gave an answer of 134.59 feet — which is weird, because the square root of 18125 is 134.629. You can look for yourself here.

One more experiment, in which I provided different starting values: 28211 for the square footage of the yard, and 507 feet for the distance that Alice walks. I chose these numbers without thinking about them; I did not have a plan. It turns out (quite a coincidence, really) that 28211 is a prime number. ChatGPT replied:

Let’s assume the length of the rectangular backyard is ‘L’ and the width is ‘W’. The area of the rectangle is given as 28211 square feet, so we have:

Area = Length (L) * Width (W) = 28211 square feet

We also know that every morning, Alice walks 507 feet from her back door to the neighbor, which is essentially the diagonal of the rectangle.

There’s nothing in the text given in the puzzle to warrant the assumption ChatGPT’s assumption that 507 is the diagonal.

ChatGPT goes on to find the factors of 28211: 1, 11, 2561, and 28211. Except 11 * 2561 multiplies out to 28171, not 28211.

Once again, ChatGPT does not understand or comprehend things, althought it has been well engineered to provide the illusion that it does. It doesn’t do math reliably. ChatGPT provides something in the form of an answer. It is up to us to determine whether that answer is right, reasonable, or hilariously wrong.

If you’re using ChatGPT to help you solve puzzles or real-word problems that involve mathematics, take note: ChatGPT is fundamentally unreliable. The disclaimer at the bottom of each of the pages linked here says “ChatGPT may produce inaccurate information about people, places, or facts.” That’s pretty much the most trustworthy thing on the page. Take it seriously.

The experiments here were all performed on July 31 and August 1, 2023, using the July 20 (2023) version of ChatGPT (presumably ChatGPT 3.5). “text-davinci-002” appeared in the URL. I was not using the playground; just the plain interface at chat.openai.com. For that, OpenAI’s versioning protocol is a little less than clear. There is no means (of which I am aware) by which one can adjust settings like temperature, token length, and so forth, using that interface.