Skip to content
// Evidence / Lessons

What AI got wrong.

A running index of every AI fabrication, unit error, and confident-wrong answer caught in dixon.ai tests. Tool, output, what was actually true, the screenshot. The Prompt Stack is the antidote.

The counterweight log, what AI caught, lives at /catches. Same evidence bar, opposite framing.

// By impact
// By error
// By tool
// Most recent first
ChatGPT · misattributed-source
Decision-affecting

Asked for the UK ISA partial-transfer rule with a source, ChatGPT (free, web search on) cited gov.uk/individual-savings-accounts/if-you-move-abroad-or-die, a real, live gov.uk page about what happens to an ISA when you move abroad or die. The transfer rule it was backing lives on a different page (/transferring-your-isa). The URL resolved; it just didn't hold the claim.

Screenshot — ChatGPT misattributed-source
Perplexity · low-authority-source-led
Best (auto-routing, underlying model unknown) Cosmetic, but revealing

Asked how long cooked chicken keeps in the fridge by a stated UK (Newcastle) user, Perplexity (web search on) led with US food blogs, Martha Stewart, Springer Mountain Farms, and gave the US figure of 3-4 days. The UK FSA guidance (2 days for cooked leftovers, per food.gov.uk) appeared as a secondary note, not the primary answer. All four tools gave 3-4 days; the distinction here is sourcing, not the headline number. Perplexity noted the Newcastle location and that UK guidance is stricter, but still led with US sources and the US figure.

Perplexity · outdated-rule-stated-as-current
Best (auto-routing, underlying model unknown) Decision-affecting

Asked whether this year's ISA contributions can be partially transferred, Perplexity said they must be transferred in full, the rule abolished on 6 April 2024. Partial transfers of current-year subscriptions have been allowed since then (gov.uk). Stated with no date and no hedge. ChatGPT (Free) gave the same outdated answer.

Perplexity · linear-scaling-of-non-linear-quantity
Best (auto-routing, underlying model unknown) Decision-affecting

Asked to scale a pancake recipe from 4 to 9 servings, Perplexity's basic answer stated 'Total cook time: 45 minutes' with no caveat, a straight 20 × 2.25. Cooking time per pancake doesn't scale; batch count does. A user following it would expect to finish in 45 minutes and be wrong. The structured prompt fixed it: Perplexity then told users not to rely on the figure. Gemini's basic answer made the same error in softer form ('~2.25× as long, about 45 minutes').

ChatGPT · fabricated-premium-table
ChatGPT Free plan, rate-limited fallback (likely GPT-4o mini), web search on Decision-affecting

Asked for AAPL covered-call strikes and premiums with no chain data supplied, ChatGPT generated a full premium table with specific dollar ranges and yields, an assumed 25% implied volatility, and a Barchart citation, framing it with language like 'recent options-chain snapshots' that implies live data. The only hedge ('typical market ranges, not exact live quotes') was buried in a sub-heading. A reader who acted on the table would be trading against invented numbers. Ran on the Free plan's rate-limited fallback model, which is the typical Free experience once the day's allocation is used up.

Screenshot — ChatGPT fabricated-premium-table
Gemini · wrong-entity-audit
Gemini Flash Decision-affecting

Asked to review 'Dixon Dixon AI' (a voice-input transcription of dixon.ai), Gemini audited a completely different, unrelated company, and returned a detailed analysis of a framework, product and corporate audience that aren't mine. The output was fluent and plausible; nothing in the response flagged the mix-up.

Screenshot — Gemini wrong-entity-audit
Gemini · stale-memory-as-current
Gemini Flash Cosmetic, but revealing

In a second session naming dixon.ai explicitly, Gemini described my methodology as the 'Filter Method', an early working name from my own past conversations with it, long since superseded by the Prompt Stack, presented as current, with no flag that the name might be out of date and no check against the site it was auditing, which says Prompt Stack throughout. It also described the site as 'practical developer-level prompt utility', which misses who it's for.

Screenshot — Gemini stale-memory-as-current
Gemini · Partial fabrication
Gemini Pro Decision-affecting

Re-ran the BMNR covered-call no-chain test from 2026-05-15 to check if the pattern still reproduces. It does, in a softer form. Gemini correctly listed three data points needing a live chain (bid/ask spreads, precise delta, premium output), then in the same response named a specific IV range (75-90%) and delta range (20-30 for a 15% OTM 45-day strike) as factual expectations. No chain, no source. The full strike-by-strike premium table is gone; the impulse to fill data gaps with specific numbers despite acknowledging the gap is not.

Claude · Stale prompt framing
Claude Opus 4.7 Cosmetic, but revealing

Re-ran two prompts on Claude Opus 4.7 with live search on. Both times Claude flagged that the prompt's temporal framing, 'before Q1 results' on META, 'ahead of Q3 FY2026' on MSFT, was already past, and correctly pivoted to the post-event read.

Screenshot — Claude stale prompt framing
ChatGPT · Web confabulation
ChatGPT (web search, May 2026) Decision-affecting

Returned a specific earnings date for an upcoming W4 release, sourced from MarketBeat via web search, with no uncertainty qualifier on whether the fiscal calendar had shifted. The confidence was inherited from the source's format, not earned by the model.

Screenshot — ChatGPT web confabulation
Claude · Inferred input
Claude Opus 4.7 (Max) Decision-affecting

Estimated BMNR $23 call assignment probability via Black-Scholes N(d2) with a sigma of 90–110% it had inferred from historical references found via web search. The formula was correctly named, the inputs were imagined, and the output was presented with false precision.

Screenshot — Claude inferred input
Perplexity · Ignored constraint
Perplexity Pro (default) Cosmetic, but revealing

On a Meta Q1 2026 earnings prompt that explicitly instructed 'work only from the pasted document', Perplexity ran 10 external web searches. The output was technically correct but came from external coverage of the release rather than reasoning over the supplied transcript. Not a bug, Perplexity routes to search as its default behaviour, but a constraint-following failure that matters when the test is designed to measure document discipline. Same prompt run on ChatGPT and Claude stayed inside the document.

Perplexity · Unit error
Perplexity Sonar Pro Decision-affecting

Read BMNR revenue as $6K instead of $6.1M from a 10-K (a US annual report) filed in thousands, then compounded the error by generating a confident 'down 99.8% from prior year' decline narrative around the wrong figure. A retail investor acting on this would have a materially false picture of the business. (Re-tested 14 June 2026: did not reproduce. Perplexity returned the correct ~$6.1M figure. Logged as a dated, point-in-time failure.)

Screenshot — Perplexity unit error
Gemini · Fabrication
Gemini 2.5 Pro (deep thinking) Decision-affecting

Returned a formatted covered-call comparison table with specific premium estimates ($3.50–$4.00 for the $26 strike, etc.), made up an implied volatility figure of ~75%, used the wrong stock price ($28.60 vs $21.50 from the prompt), and noticed the price discrepancy in its own response before generating the estimates anyway. (Re-tested 14 June 2026 on Gemini's default Pro model: did not reproduce; the original ran on deep-thinking mode, untested in the re-run. Logged as a dated, point-in-time failure.)

Screenshot — Gemini fabrication
Perplexity · Unit error
Decision-affecting

On BMNR (a thinly-covered name) Perplexity read a 10-K (a US annual report) reported 'in thousands' literally, turning $6,095 thousand ($6.1m) into '$6K', then narrated a confident 'down 99.8% from prior year' decline that never happened. Re-tested 14 June 2026: did not reproduce. Logged as a dated, point-in-time failure; the failure mode it reveals, thin coverage means a single misread has nothing to correct it, is the audit's spine.

Screenshot — Perplexity unit error
// The antidote

Four prompts that stop AI inventing the answer.

Read the method →

Every entry on this page is a real failure on a real prompt — no simulated examples, no curated highlights. The list grows as new tests turn up new failures.

The counterweight log lives at /catches: the moments where the model spotted what was missed. Same evidence bar, opposite framing. Both pages sit under /evidence, the matched-pair view.

Subscribe to the failure feed: /lessons/rss.xml. Combined evidence feed: /evidence/rss.xml.

Citing this log? It's machine-readable at /lessons.json — stable entry IDs, every entry linked to its source post, updated on every build. Quote with attribution and link the entry.