How often is ChatGPT wrong? I kept a running tally across 20 real AI tests

I asked ChatGPT for the current share price of a stock I follow. It gave me $191.21, “during today’s session.” The trouble is that the stock never traded at $191.21 that day. Its lowest price all session was $199.54, more than eight dollars above the number I was handed. The price I was given didn’t exist. Not stale, not rounded, not last week’s close mislabelled: a specific figure, attached to a specific day, that was never real at any moment on that day. And it came with a cited source and the same calm certainty ChatGPT uses when it’s right.

That was one run in a tally I’ve been keeping. I run AI tools on real questions I need answered (a fund’s fee before I’d move money, a company’s revenue from a filing, a live price) and I write down what came back and whether it held up. Not benchmark questions in a lab. The ordinary stuff you’d type into the box on a Tuesday.

So: how often is ChatGPT wrong? Often enough that you can’t ignore it, but that’s the boring half of the answer. The useful half is that it isn’t wrong at random. The misses cluster. After 20 of these tests, the pattern is clear enough to act on, and it’s more useful than any single percentage.

The honest count

I should be straight about the sample first. This is a running tally, not a study: 17 structured tests run under a fixed pass/fail rule, plus a handful of failures I’ve logged while using these tools on real decisions, around 20 documented cases. That’s enough to show a pattern. It is nowhere near enough to stamp a general accuracy rate on ChatGPT, and you should be suspicious of anyone who gives you one from a sample this size.

The figures that get quoted are real and worth knowing: Purdue found 52% of ChatGPT’s programming answers wrong, the BBC put accuracy problems in nearly half of news-related responses. But Purdue tested coding on a 2023 model and the BBC tested news retrieval. Neither tells you whether to trust the share price ChatGPT just handed you. That gap is why I keep my own tally.

And the reason I won’t reduce mine to one figure is that an average would lie. On the question types where ChatGPT knows its limits, it passed nearly every time. On live prices generated as text, it failed nearly every time. Blend those into a single “70% accurate” and you’ve thrown away the only thing worth knowing: which kind of question fails.

Where it fails: the question type predicts the miss

Here’s the pattern the tally keeps drawing. Sort the questions into two piles and the error rate splits almost cleanly along the line between them.

Pile one: fixed, published facts. A fund’s official yearly fee. A company’s ticker. The number of holdings in an index. The date of an event that already happened. These have one correct answer, written down somewhere public, and ChatGPT is usually right about them, especially for well-covered names. This is the safe pile.

Pile two: precise, moving, or live data. The current share price. A live options quote. This week’s figure for something that updates daily. Recent returns. These change by the minute or the day, and ChatGPT, when it doesn’t have a live feed wired to that specific answer but replies anyway, invents them. With confidence. With formatting. Often with a cited source. This is the dangerous pile, and the danger is that it looks exactly like the safe one.

There’s a third, nastier category that bridges the two: a stale fact dressed up as current. In one test ChatGPT gave me a “previous close” that was real, but it was the closing price from two days earlier, presented as the most recent one. The figure had the right shape, the right format, the right source name. It was two days out of date, and nothing in the answer said so. Worse, in my fund test it was Claude, not ChatGPT, that ran a live web search before answering and still served a fee that had been cut months earlier. The tool that went and looked got it wrong; a tool that didn’t look got it right. Searching is no guarantee. (That fund test is its own post.)

The price that never existed and the price that was right came back in the same confident format. The only way to tell them apart was to go and look.

What the tally shows

Four cases make the pattern concrete. I’ve led with the simplest and worked up to the fiddly ones.

The fund fee: right, then wrong for an odd reason. I asked four AIs to compare two ordinary index funds, the sort of question anyone choosing where to park long-term savings might ask. The yearly fee is a single published number, the cleanest right-or-wrong test there is, and ChatGPT got it right. So did most of the others. The miss came on the dividend yield, the income each fund pays out: ChatGPT gave both funds the same figure when they plainly don’t pay out the same. A tidy, identical-looking pair of numbers, one of them wrong, presented with no more hesitation than the fee it nailed. (The full four-AI test is here.)

The price that never existed. The one I opened with. Asked for a current price in two fresh sessions, ChatGPT returned $206.18 “live” in one and $191.21 “during today’s session” in the other. The day’s real range, checked against the market data afterwards, was $199.54 to $205.66. The first figure sat above the entire day’s high; the second sat below the entire day’s low. Both were impossible. Both arrived with a citation, neatly dressed for a number that had never existed anywhere outside that reply. This is the failure mode at its purest: a precise figure, confidently dated, that was never true.

The same question, two different answers, same afternoon. I asked two versions of Google’s Gemini for a live options quote: one the cheaper, faster model, one the more capable one. Same question, same stock, same afternoon. The faster model produced a tidy table of figures presented as current market data. The more capable one declined twice, in two separate sessions, saying plainly it had no live access. The model tier decided whether I got a real refusal or an invented number. ChatGPT, asked the same kind of question across three fresh sessions, did three different things: declined once, declined with a caveat once, and quoted a stale figure as fact once. The unreliability isn’t a fixed trait of the tool; it shifts session to session. (I documented the Gemini split separately.)

The factor-of-a-thousand mistake. Asked to read a company’s revenue from its own annual report, Perplexity took a figure the filing states in thousands and read it as plain dollars, turning roughly six million into roughly six thousand. Then it didn’t stop. It narrated a 99.8% collapse that never happened, calmly explaining the downfall of a company that was, at that moment, trading normally and entirely fine. (That one is in the four-tool comparison.) It had enough context to know the numbers didn’t add up. It wrote the eulogy anyway.

Why this won’t be fixed by a better model

When I lined the misses up next to the hits, the uncomfortable bit was clear: all four wrong answers wore the same face as the right ones: same format, same tone, same citations. There’s no flicker of doubt to catch, because the tool doesn’t have the doubt. It generates the wrong number the same way it generates the right one, by predicting what a plausible answer looks like. When the real number is in its training or wired to a live feed, plausible and true are the same thing. When it isn’t, plausible is all that’s left, and plausible is what you get.

That’s why “how often is ChatGPT wrong” is the wrong question to fix on. The rate will drift as models improve. The structure won’t, because confident-when-it-doesn’t-know is baked into how these tools produce text. A newer model fails less often on the moving-data questions; it does not start telling you which answers it pulled from a real source and which it filled in. Until it does, the presentation carries no information about whether the number is real.

The two checks I run

I haven’t stopped using ChatGPT. It’s a fast, well-organised first draft, and on the fixed-fact pile it earns its keep. What changed is what I do with a precise number before I act on it.

The first check is one follow-up question: what’s your source for that figure, and what date is it from? The answers split fast. “From the published factsheet, as of last month” is something I can go and confirm. “Based on my training data” is a flag to slow down. A specific source name that turns out, when I look, to be stale or wrong is exactly the pattern these tests keep finding, and the question pulls it into the open in one move. That’s the FILTER step from my Prompt Stack doing the work: separating the facts that are fixed from the ones that move.

The second is shorter. For any number with a single correct answer (a fee, a closing price for a named day, a figure from a filing) one direct look at the source takes about thirty seconds and catches every miss the examples above describe. I wrote up the thirty-second version of this check on its own. There’s nothing clever about it. It’s the step nobody does, which is why the made-up numbers travel as far as they do.

Field Report

What worked: On fixed, published facts (fees, tickers, holding counts, the shape of an answer) ChatGPT was reliable across the tally, and a useful first pass. When it didn’t have live data, it sometimes declined cleanly, which is the right behaviour.

What didn’t: Precise, moving figures invented and presented as fact, with no tell: a “live” price that never printed, a yield given identically for two funds that pay differently, and the same live-data question answered three different ways across three sessions. Searching the web didn’t save it; in one case a tool that searched served a stale figure anyway.

Bottom line: Useful, with a hard line drawn through the middle. ChatGPT is wrong in a predictable pattern, not at random: safe on cheap fixed facts, unreliable on precise moving ones, and confident either way. What would change the verdict: a tool that marks, inside the answer, which numbers it checked against a named source today and which it generated.

The question that helps isn’t “how often is ChatGPT wrong.” It’s “which kind of question is it answering”, and once you can sort the fixed from the moving on sight, you know which number to check before you trust it, and which you can let go.