Skip to content
Field Notes

I let three rival AIs audit my own AI method. They agreed on the weak step.

I asked three frontier models from three different labs to tear apart the method I use to keep AI honest. All three flagged the same step, and they were right.

// TL;DR
Problem
my method's first step was telling the AI to be sceptical, and I quietly suspected it did nothing.
Fix
instead of trusting my own hunch, I ran the method's own move on itself, three rival AIs from three labs asked to tear it apart.
Payoff
all three independently killed the same step, so I replaced it. The method got better because I stopped trusting one opinion, including my own.
// On this page

For the better part of a year I have run the same four-step method before I trust anything an AI tells me on a decision that matters. Step one was called ROLE: tell the model to take a sceptical stance, to hunt for reasons the answer is wrong rather than reasons it is right. It always felt like the softest step in the set. I kept it anyway, because it was tidy and it sounded sensible.

This week I stopped trusting that hunch and did the one thing the method itself is built on: get a second opinion, then a third. I handed the whole method to three frontier models from three different labs, GLM, DeepSeek and Qwen, and asked each of them, separately, to pull it apart. I gave no hint about which step I doubted.

Three labs matters here. They do not share a training run, so they do not share the same blind spots. When three models that have never met land in the same place, that is not a coincidence, it is a signal. It is the same reason the AI Reliability Scoreboard grades five assistants and not one.

All three went for the same step

They did not hedge, and they did not spread their fire. All three went straight for ROLE, and independently made the same point: telling an AI to “be sceptical” changes its tone, not its accuracy. It writes you a more cautious-sounding answer. It does not write you a more correct one. A made-up figure delivered in a careful, sceptical register is still a made-up figure.

GLM, the model from Z.ai, put it most plainly:

Telling an LLM to “be sceptical” mostly changes tone, not epistemics. It’ll sound more cautious while still hallucinating.

I had half-known that. The honest version is that I had known it for months and kept the step because pulling it meant admitting the method had a soft spot. Hearing it from three machines with no stake in my feelings was the push I had been avoiding.

So ROLE is gone. Meet SCOPE.

So I replaced it with SCOPE. Instead of setting a mood, it sets a boundary. I now tell the model exactly what it is allowed to use, and have it say plainly when a question is past what it can verify, rather than papering over the gap with something fluent. “Be sceptical” asks for an attitude. SCOPE removes the room to invent. Only one of those moves the error rate.

The second step, FILTER, got sharper in the same pass. It used to ask the model to separate fact from guesswork. Now it has to label every claim as sourced, inferred or guessed, so you can see at a glance which parts to check first. The guessed ones, usually.

The method got better the moment I stopped exempting it from its own rule

There is a neatness to this I cannot pretend I planned. This whole site exists to make one argument: do not trust a single confident AI answer, check it against something else. The method is that argument turned into steps. And the method only improved once I ran the rule on the rule, and checked my own work against a different intelligence. The different intelligence was right.

Field Report

What worked: Three models from three labs, asked cold, converged on the same weak step. The disagreement I went looking for never came. The agreement was the answer.

What didn’t: My own judgment, left alone, sat on the doubt for months. A hunch you will not act on is just a worry with better manners.

Bottom line: Cross-checking against a model from a different lab is now a step in the method, not just something I once did to the method. Useful. The bar for the next change is simple: it has to survive the same panel.


The revised method is live and open. The four steps are at the Prompt Stack, and the whole thing is on GitHub under a licence that lets you take it, fork it or teach it. If you run something similar, the cheapest upgrade you can make is the one I resisted for too long: before you trust the answer, put the same question to a model from a different lab, and go looking for where the two of them disagree. That gap is where the work is.

Ben Dixon
// Written by Ben Dixon

Ben tests ways of getting reliable answers from AI on his own investing: documenting what each model got wrong, what each one caught, and the prompts that survived the cuts. About Ben →

// Keep reading
Field Notes

I run an AI to catch AI mistakes. It fell for a fake.

The automated radar that watches this site for AI-reliability failures logged a satirical incident report as a real, documented one. Here's what caught it.

Field Notes

Real AI hallucination examples, caught and dated

Six real AI hallucination examples I ran into myself, each one checkable against a real source, with the one move that would have caught it.

Field Notes

AI stock picker: I asked three models if I should buy NVDA, and watched the methodology break

I asked three AI models whether to buy NVDA. Same confident tone from all three, and only one volunteered which of its own numbers not to trust yet.

// New here?

The site runs AI on real investing decisions. Start with the Prompt Stack for the four-stage framework, or the Field Guide PDF for the condensed version, free, no email.

← All posts More in Field Notes →