Skip to content
// Evidence / Scoreboard

The AI Reliability Scoreboard.

Every AI error this site has caught, tallied by model. Built from the same documented cases as the failure log and the catch log: real tests on real decisions, each number linked to the evidence behind it.

20
Errors documented
15
Catches documented
4
Models scored
// Ranked by documented errors
Model Errors Decision-affecting Catches What it got wrong
Perplexity 7 4 2
  • Unit error 2
  • outdated-rule-stated-as-current 1
  • Ignored constraint 1
  • linear-scaling-of-non-linear-quantity 1
  • Inconsistent returns 1
  • low-authority-source-led 1
Gemini 6 4 1
  • Fabrication 2
  • Partial fabrication 1
  • Wrong-entity audit 1
  • Stale memory as current 1
  • Cross-chat memory leak 1
ChatGPT 4 4 0
  • outdated-rule-stated-as-current 1
  • Web confabulation 1
  • fabricated-premium-table 1
  • misattributed-source 1
Claude 3 1 12
  • Inferred input 1
  • Stale prompt framing 1
  • Stale figure (with web search) 1

The Errors and Catches numbers link through to the filtered /lessons and /catches views: the actual prompts, outputs and screenshots behind every count.

// The other column — what AI caught
Claude 12
  • Language tell 3
  • flagged-uncertainty-and-verified 1
  • Sharper reframe 1
  • non-recurring-strip 1
  • stayed-in-lane 1
  • unit-error-flag 1
  • Asymmetry tell 1
  • Stale-data flag 1
  • non-linear-constraint-flag 1
  • searched-before-answering-changed-rule 1
Perplexity 2
  • correct-source-attribution 1
  • honest-substitution 1
Gemini 1
  • Entity-overlap risk 1

A catch is the model spotting something I missed, or flagging its own stale data before I acted on it. Same bar as a finding: specific, observable, falsifiable. The full log is at /catches.

// How to read this

Documented cases, not a failure rate.

This is a record of what happened in real tests (May 2026 to June 2026), not a benchmark. I did not run the same prompt an equal number of times against every model, the cases were not sampled at random, and the table only counts what got written up. So a higher number means more documented evidence, not proven unreliability. A model near the top may simply be one I lean on more, or one that fails in more screenshot-worthy ways.

The sample is small: low tens of cases. That is deliberately too small for percentages, so there are none here: no rates, no "X% wrong", just raw counts you can click through and check. If a number looks high or low, read the entries behind it and judge for yourself. That is the whole point of linking every count to its source.

Every entry was verified by hand before it went up, and every model named here is one I use and rate. The aim is an honest tally of where AI got things wrong on decisions that mattered. And, in the next column, where it earned its place.

// The antidote

Four prompts that stop AI inventing the answer in the first place.

Read the method →

The scoreboard is built from the /lessons and /catches logs and rebuilds whenever they do. Both sit under /evidence: the matched-pair view.

Citing this? It's machine-readable at /scoreboard.json: the same counts, each model linked to its source entries, updated on every build. Quote with attribution and link the page. Licence: CC BY 4.0.