9 types of AI hallucinations, named from real tests

Most AI failures don’t look like failures. They look like good answers — confident, formatted, sourced. A table with the columns lined up. A figure carried to two decimal places. A citation that sounds legitimate. The format is right; the number is wrong; and nothing on the screen tells you which.

I’ve been logging these from my own sessions since May 2026 — real questions on real decisions, mostly investing ones, where being wrong costs something. The word for them is hallucination — one word that hides the fact there isn’t just one failure here. Most guides to the types of AI hallucinations treat it as one thing with one defence — “check your sources.” There are at least nine distinct types, each with a different cause, a different tell, and a different way to catch it. This is the list.

A note on the log

This isn’t a benchmark. It’s a log — more than a dozen recorded failures and a similar pile of catches, gathered across about fourteen posts as I tested tools on things I needed answered. Every mode below has at least one dated record: the tool, the date, what it did, and the post where the screenshot lives. Where a mode has only one of these, I say so. None of them are invented — if I couldn’t point to a logged one, the mode isn’t here.

One thing worth knowing before the list: the brand on the box tells you almost nothing.

The same company's fast model and its careful model can fail and pass on the identical prompt the same afternoon.

“Gemini said so” or “ChatGPT said so” carries no information without the tier underneath it. The certainty in the answer is no guide to the tier behind it.

And the payoff to aim for, the single most useful thing in the whole log: more than once, one tool fell into a trap and a second tool — asked the same kind of question weeks later — described that exact trap before going anywhere near it. That pairing is the argument for checking an answer against a second model, made entirely from my own evidence. You’ll see it in full at Mode 2.

1. The Invented Number

The model produces specific, formatted, tradeable-looking figures for data it cannot see — options premiums, implied volatility (how big a move the market is pricing in), live prices — instead of saying it doesn’t have them.

It turns up more than any other failure in the log, and it does the most damage, because it looks exactly like real research. The figures come formatted as a neat table, carry a citation that sounds plausible, and arrive with no hedge. There’s no visible cue that anything is wrong.

The record: Gemini 2.5 Pro, 15 May 2026 — I handed it only a ticker and it returned a full covered-call table with premium estimates, a made-up implied volatility of around 75%, and a stock price that didn’t match the one in the prompt. Months later, on a free ChatGPT plan, 18 June 2026, the same shape of prompt produced a complete options table framed as “recent options-chain snapshots” — implying live data it never had. Two tools, two sessions, the same invented confidence.

The tell: ask “where did you get that, and what date is it from?” A fabricated chain cites a source that doesn’t publish options data, or claims a calculation it couldn’t have run. The source citation gives it away faster than the figure does.

The defence: check any specific quote against your broker’s own screen before you trust it. The model can shape the question; it cannot see the chain.

The matched catch: on the same kind of prompt, Claude declined to invent anything — it told me to plug in real broker numbers rather than generate plausible-looking ones. Same gap, opposite behaviour. The clean answer was a refusal.

2. The Wrong Order of Magnitude

The model reads a real figure from a real filing at the wrong scale — a line reported “in thousands” taken literally — and then builds a confident story around the wrong number.

When this one bites, it changes the decision — and it bites quietly, because the model doesn’t flag the figure as odd. It explains it. The number is genuine, the scale is off by a factor of a thousand, and the narrative layered on top makes the mistake sound like analysis.

The record: Perplexity Sonar Pro, 15 May 2026 — I fed it a company’s annual report and it read the revenue, $6.1m, as “$6K”, then generated a “down 99.8% from prior year” collapse that never happened. The filing had “in thousands” printed at the top of the page. (Re-tested 14 June 2026: it didn’t reproduce — Perplexity returned the correct figure. Logged as a dated, point-in-time failure, not a standing claim. Models change underneath you.)

The tell: the magnitude is implausible for the business. A real company doesn’t drop 99.8% in a year without that being the headline everywhere. When the size of the story doesn’t match what you already know, check the unit line on the filing.

And here is the matched pair — the line the whole log builds toward. Weeks later, on the same category of question, Claude described the exact trap, unprompted, before anyone fell into it:

// What Claude said

“One source even shows FY2025 revenue at $6K rather than $6.1M, which looks like a units/classification error.”

It then pointed to the annual report itself as the figure to anchor to, rather than trusting the misread number. (Claude Opus 4.8, Max plan with web search, 18 June 2026.)

One model walked into the pit. Another, asked a related question later, drew a map of the pit. Neither knew about the other. That’s not a coincidence to lean on every time — but it’s the reason a second opinion from a different model earns its place.

The defence: for a thinly-covered company, this is worse — a household name has independent coverage to correct a single misread, a smaller one doesn’t. The unit line on the filing settles it, and it takes a second to read.

3. The Wrong Subject

The model misidentifies what it was asked about and produces a fluent, expert-sounding analysis of a completely different thing. The facts can be internally consistent. The subject is simply wrong.

There’s no garbled text to warn you. The response is coherent, specific, and reads like a real audit. You only catch it if you already know enough about the subject to notice the answer is about something else entirely.

The record: Gemini Flash, 10 June 2026 — asked to review my own website (the request reached it as a slightly garbled voice transcription), it locked onto a completely different company with a bigger search footprint and audited that instead. It returned a detailed write-up of a product, framework and audience that aren’t mine. Nothing in the response flagged the mix-up. The audit was immaculate. It was for the wrong company, but immaculate.

The tell: before you read a word of an AI audit, make it state one checkable fact about the subject — the exact URL, or what the homepage says the thing is. A wrong-subject answer falls apart in the first sentence.

The defence: one record, one tool — but it’s the cleanest illustration of why a coherent answer isn’t a correct one.

4. I Noticed, I Carried On

The model explicitly flags a problem — a data gap, a price that doesn’t add up — and then produces the fabricated output anyway, in the same response.

The hedge and the fabrication sit side by side. The model has just shown it knows the limit, which makes carrying on past it more revealing, not less. And it reads as more trustworthy than a flat invented number, because it mentioned a caveat — but the caveat changes nothing about the figure that follows.

The record: Gemini 2.5 Pro, 15 May 2026 — it noticed the prompt’s stock price didn’t match the figure it was working from, said so, then generated premium estimates regardless. A week later, Gemini Pro, 22 May 2026, correctly listed the three things that need a live chain (bid/ask, delta, premium) — then named a specific volatility range and delta range as factual expectations in the very same answer. At least it had the grace to mention the gap before filling it.

The tell: if the model acknowledged a gap and then produced a specific number anyway, the number is an estimate wearing the clothes of a fact.

The defence: the caveat reads as a confession, not a clearance. The acknowledgement is the warning; the number after it is the part I throw out. It happens outside money too: when I asked four AIs to scale a recipe, one noted the cooking time doesn’t scale — then handed me the scaled cooking time anyway, in the same breath.

5. Stale Data, Fresh Confidence

The model returns an out-of-date figure even with web search switched on — because the search turned up an old page, or the value it was trained on quietly overrode the search result.

I used to treat live search as a guarantee of freshness. It isn’t. This mode is the proof: the answer came back stale despite a real web search running first. And the figure was close, not wildly off, which is what makes it hard to catch.

The record: Claude, web search on, 13 June 2026 — I asked it for a fund’s ongoing charge and it served the old 0.22% figure when the current published one is 0.19%, despite searching before it answered. Claude misses rarely — just twice in the whole log — which makes it the most careful tool here, but not a flawless one. Separately, Gemini Flash, 10 June 2026, described a methodology by an early working name from old conversations, long since renamed, and presented it as current with no check against the live site it was supposedly auditing.

The tell: the figure is close but wrong, and arrives without a date. Ask for the source URL and when the underlying page was last updated. A stale answer often points to a page that hasn’t changed in a year or more.

The defence: for anything that moves — fees, prices, dates — the published source page is the only thing that settles it, search or no search. The ISA accuracy test is a live worked example: two tools served a transfer rule scrapped in 2024, with no date attached.

6. Borrowed Certainty

The model returns a precise figure with confidence it inherited from the source’s format rather than earned — or it estimates the inputs, runs a real formula on imagined numbers, and presents the output as precision.

False precision is the specific danger, and it’s the hardest to spot, because the reasoning looks correct. The method can be sound and the answer still hollow, because the inputs were filled in by the model.

The decimal places promise a certainty the inputs never had.

The record: ChatGPT with web search, 16 May 2026 — it returned a specific upcoming earnings date pulled from a financial-data site, with no qualifier on whether the company’s reporting calendar had shifted. The confidence came from the source’s tidy format, not from the model checking. And Claude Opus 4.7, 16 May 2026 — it estimated an assignment probability using a correctly-named Black-Scholes calculation, but with a volatility input it had inferred from historical references, not measured. The formula was right, the inputs were guessed, the output looked exact. To the site’s credit, my own Claude audit is honest that this one was Claude.

The tell: a decimal-place answer to a question whose inputs the model couldn’t have had. Ask “what inputs did you use, and where did each one come from?”

The defence: precision is a property of inputs, not of formulas. If the inputs were estimated, the output is an estimate, however many decimals it wears.

7. The Misleading Table

The model puts figures from different sources or different bases into one table without reconciling them — leaving you to assume a comparability the model itself admitted was false.

The model’s own caveat contradicts its own layout. You absorb the table; the disclaimer is footnote text that registers far less than the numbers above it. It’s a presentation failure that behaves like a factual one: you walk away with a wrong comparison even when every single figure is accurate.

The record: Perplexity, 13 June 2026 — I asked it to compare two funds and it tabled their five-year returns side by side, 11.83% next to 86.21%, on completely different bases (one an annual rate, one a total over five years), then noted underneath that they were “not apples-to-apples” while leaving them sitting in the same column.

The tell: if the model says “these aren’t directly comparable,” the table has already done the damage. The layout is the thing the reader keeps.

The defence: when figures sit together, check they’re on the same basis before you read meaning into the gap between them. A disclaimer underneath doesn’t undo a column.

8. The Search Override

The model gets an explicit instruction — “work only from the document I pasted” — and ignores it, routing to its default behaviour instead.

The answer can be right. That’s exactly what makes this worth naming: it passes a basic accuracy check, but it came from the wrong place. When document discipline matters — reading a transcript before outside coverage colours your view, or a filing before the analyst spin lands — getting the right answer from the wrong source is still a failure of the task you set.

The record: Perplexity Pro, 15 May 2026 — I gave it an earnings prompt that said “work only from the pasted document,” and it ran 10 external web searches anyway. The output was correct, but it came from outside coverage rather than the supplied transcript. I ran the same prompt on two other tools and both stayed inside the document.

The tell: the answer cites or reflects material that wasn’t in what you gave it. Check whether the sources in the output appear in your document at all.

The defence: this one isn’t a bug — search-first routing is how that tool is built. Knowing the mode tells you when it’s the wrong tool for the job, not that the tool is broken.

9. The Personalised Answer

The model pulls in context from your earlier conversations, unprompted, so the answer is non-reproducible. A different user asking the identical question gets a different reply.

This one defeats my main defence, which is why it’s the failure I find hardest to shake. I check answers by asking again — wrong answers rarely repeat the exact same error, so a re-run usually catches them. This mode is the one that beats that. If the model carries old context into a fresh question, asking again in the same session doesn’t catch the contamination. It repeats it.

The record: Gemini, 13 June 2026 — on a standard fund comparison, it folded in personal context from earlier chats that I hadn’t mentioned in the question, making the answer specific to me and impossible for anyone else to reproduce.

The tell: the answer references something you didn’t tell it in this conversation. The fix is a fresh session, not a fresh question.

The defence: for anything you want to be able to check, run it in a clean session or a private window. The answer was just for you. Unfortunately, “just for you” and “correct” are different standards.

The other half of the log: the catches

The nine modes are the failures. They’re not the whole story. I log the catches alongside them — cases where a tool spotted exactly the kind of error another tool walked into — because the failures on their own give a false picture.

That’s the real lesson, and it’s why cross-checking works in practice. Not because any single tool is reliable — none of them is — but because the tool that falls into a trap is rarely the same tool that warns about it. Mode 2 is the clearest example: one model misread the scale of a filing; another, on the same kind of question, named that precise trap before going near it. Reliability doesn’t come from finding the one good model. It comes from not trusting any of them alone.

Two things this log doesn’t cover

Two failures people will reasonably ask about aren’t here, because I haven’t got a clean logged test for either — and a mode without a record doesn’t make the list.

The first is sycophancy: the model caving when you push back, abandoning a correct answer because you sounded sceptical. Every failure here was the model’s first answer, not a climbdown under pressure. My standard prompt opens by telling the model to act as a cautious analyst, which is implicitly a defence against this — but I’ve never tested it on a right answer under challenge.

The second is reproducibility drift as a deliberate test. Mode 9 came up by accident; I haven’t run the same prompt cold a dozen times to measure how often the answer shifts. Both are honest gaps. They’re also the next two things worth testing.

Field Report

What worked: the log separates nine distinct behaviours that the usual “types of hallucinations” guides collapse into one or two. Each one has a different tell and a different defence, so you can check for the specific failure your question is exposed to.

What didn’t: this is a small log, not a measured study. The mode counts are tendencies — Gemini clusters on the fabrication modes, Perplexity on the constraint-and-coverage ones, Claude has the fewest misses but isn’t absent — not rates I’d defend to a decimal.

Bottom line: useful as a checklist, not a scoreboard. The three that matter most for money decisions are the first three — invented numbers, misread scale, wrong subject. The defence for all three is the same single move: make the model name its source before you read its answer.

That one question — where did this come from, and what date is it from — has caught more of these than any clever prompt I’ve written. It costs a sentence. The wrong number costs rather more.

The whole log lives on the lessons page and the catches page: every failure with its screenshot, and every catch beside it. The deeper write-ups behind these modes are the three-tool comparison, the four-tool test, and the audit that reviewed the wrong company. The prompt I open every session with — the one that asks for a cautious analyst, not a cheerleader — is the Prompt Stack. And if you’re still choosing which free tool to start with, I’ve put the main ones through their paces in the free tools for stock research.