// Evidence / Catches

What AI caught.

The counterweight to /lessons. A running index of moments where the model spotted what was missed: a language tell, an asymmetry, a sharper reframe of a thesis. Same evidence bar as the failure log: specific, observable, falsifiable. "AI was helpful" does not qualify; "Claude was the only tool to flag the word 'underestimate' as one-sided phrasing in the META Q1 call" does.

// Most recent first

Perplexity · correct-source-attribution 20 JUN 2026

Decision-relevant

Given the same ISA transfer question, Perplexity cited the correct gov.uk page (/transferring-your-isa) and quoted the line that actually contains the rule: 'You can transfer all or part of the savings in your ISA.' Same question, same day: the right page.

Claude · flagged-uncertainty-and-verified 19 JUN 2026

Claude Opus 4.8 Decision-relevant

Before answering the ISA edge cases, Claude explicitly flagged 'ISA rules have seen recent changes' and ran four web searches to verify, the only model to say so unprompted, then gave the correct post-April-2024 partial-transfer answer and volunteered the April-2027 cash-ISA change unasked. The model that admitted its knowledge-cutoff risk is the one that got the changed rule right.

Claude · non-linear-constraint-flag 19 JUN 2026

Claude Opus 4.8 Decision-relevant

In the basic condition (no structured instructions) Claude spontaneously caught the cooking-time non-linearity: 'expect roughly double the total active time for 9 servings, but no individual pancake cooks any longer.' No other model got this right unprompted. ChatGPT hedged, Gemini and Perplexity both stated 45 minutes. Under the method prompt Claude gave per-ingredient confidence levels including 'low as a single figure, high as cook to doneness' for cooking time.

Claude · searched-before-answering-changed-rule 19 JUN 2026

Claude Opus 4.8 Decision-relevant

On the ISA partial-transfer question, Claude flagged that ISA rules had changed recently and ran web searches before answering, then gave the correct post-April-2024 rule. The two tools that answered from training alone gave the rule abolished in April 2024. Searching the authoritative source is the mechanism that got the changed rule right, documented in full in the ISA test.

Claude · Language tell 18 JUN 2026

Claude Opus 4.8 High (Max plan) Decision-relevant

Asked only 'what did the CFO commit to on capital expenditure?' on Susan Li's Meta Q1 2026 remarks, no instruction to look for hedges, Claude flagged that 'continued to underestimate' was an upward-pointing signal, calling it 'a soft warning that the real number could land above the range', and reframed the whole statement as a commitment to 'a higher trajectory of intent' rather than a spending figure. ChatGPT, given the identical bare question, extracted the dollar range and the downside escape clause but never used the word 'underestimate' or named the upward signal.

Claude · unit-error-flag 18 JUN 2026

Claude Opus 4.8 (Max plan, web search) Decision-relevant

On the 14 June re-test of Dimension 1, Claude proactively flagged the exact unit-denomination trap that produced Perplexity's original $6K-vs-$6.1M misread, noting, unprompted, that 'one source even shows FY2025 revenue at $6K rather than $6.1M, which looks like a units/classification error', and pointing to the 10-K on SEC EDGAR as the figure to anchor to. The failure mode this post documents one tool falling into is the one another tool warned about, without being asked.

Perplexity · honest-substitution 18 JUN 2026

Validation

Asked for a UK AIM company's revenue and adjusted EBITDA, Perplexity returned sourced figures that checked out against the company's actual full-year results (revenue £569.7m, adjusted operating profit £107.4m), and, finding no published adjusted EBITDA line, said so plainly and substituted adjusted operating profit rather than inventing a number: 'I couldn't find a clear company-published adjusted EBITDA headline in the retrieved sources for FY25, so I used the company's reported adjusted operating profit figure.' Knowing what it doesn't know is the behaviour the BMNR failure lacked.

Claude · non-recurring-strip 11 JUN 2026

Claude (Fable 5) Insight-relevant

On META's Q1 2026 earnings release, Claude identified the lone non-recurring item, an $8.03bn one-time tax benefit, and returned an adjusted net income of $18.7bn, flagging the 30% gap against the stated 10% threshold unprompted. Run next on the cash-to-profit ratio, it used the adjusted $18.7bn rather than the headline $26.8bn and noted the unadjusted 1.20x against the adjusted 1.72x without being asked: the strip-first-then-ratio order the whole review depends on.

Gemini · entity-overlap-risk 10 JUN 2026

Gemini Flash Decision-relevant

In the session that named dixon.ai explicitly, Gemini correctly identified the brand-collision risk with a similarly-named, established company at a near-identical domain, named the competing entity accurately, and flagged that 'dixon ai' searches face crowded competition from an established corporate site. Search Console data for dixon.ai confirms it: 'dixon ai' searches largely route to that other site. The catch was genuine; it just arrived alongside an out-of-date name for my method and a wrong audience description.

Claude · Sharper reframe 22 MAY 2026

Claude Opus 4.7 Decision-relevant

On a META sell-some-vs-hold question, same position, same capex-raise context as the 1 May thesis-audit run, Claude reframed the bounded-capex break sharper than the original Q2 paraphrase: 'the floor of 2026 guidance now sits above the ceiling you assumed.' Same conclusion as the run three weeks earlier; a more memorable formulation. Run on Claude Opus 4.7 with live web search.

Claude · Asymmetry tell 22 MAY 2026

Claude Opus 4.7 Insight-relevant

On the META Q1 2026 capex prepared remarks, Claude flagged a language asymmetry I'd missed on first read: 'more than 1 GW' was the specific number attached to the Broadcom partnership, but the AMD clause two lines earlier said 'significant amount' with no number. Same paragraph, two clauses: one falsifiable commitment, one defensible-as-aspiration. The kind of softness you only spot on the second read of an earnings transcript.

Claude · Stale-data flag 20 MAY 2026

Claude Opus 4.7 Insight-relevant

On a generic MSFT company-snapshot prompt, Claude returned the segment split as FY2024 figures (roughly two years behind current reporting) and self-flagged the staleness in its Verdict section: 'Microsoft restructured its segment composition effective Q1 FY2025; verify against the live 10-K before quoting these percentages.' The model was honest about the limit of its own training data without being asked.

Claude · Language tell 15 MAY 2026

Claude Opus 4.7 Decision-relevant

Same Susan Li META Q1 2026 prepared remarks passage as the earlier catch, framed around the prompt that catches it. Claude was the only one of four tools to flag what Li did with the word 'underestimate': she said Meta had 'continued to underestimate' its compute needs, language that points upward without making a real commitment to spend more. The three-check red-flag prompt is designed to run the same catch on any transcript.

Claude · Language tell 15 MAY 2026

Decision-relevant

On Susan Li's META Q1 2026 prepared remarks, Claude was the only one of four tools tested to pick up what the CFO did with the word 'underestimate'. She said the company had 'continued to underestimate' compute needs: language that signals an ongoing structural pattern without committing to what management will spend next. ChatGPT, Gemini and Perplexity read the same passage and missed it.

Claude · stayed-in-lane 14 MAY 2026

Claude Opus, Max plan, web search Decision-relevant

Given a covered-call setup with no live options chain, Claude declined to invent premiums, implied volatility or Greeks, telling the user to plug in real numbers from the broker rather than generating plausible-looking ones. The same prompt shape produced fabricated tables from Gemini and ChatGPT. The clean answer was a refusal to fill the gap, which on a live-data question is the right answer.

// The matched pair

Read this alongside /lessons, the failure log. Both pages document the same kind of evidence pointing in different directions.

Read the method →

Every entry is a real moment from a real prompt — no curated highlights, no marketing language. The list grows as new tests turn up new catches. "AI was helpful" does not qualify; the catch has to be specific enough that a sceptical reader could re-run the prompt and check.

The failure log lives at /lessons: same evidence bar, opposite framing. Both pages sit under /evidence, the matched-pair view.

Subscribe to the catch feed: /catches/rss.xml. Combined evidence feed: /evidence/rss.xml.

Citing this log? It's machine-readable at /catches.json — stable entry IDs, every entry linked to its source post, updated on every build. Quote with attribution and link the entry.