I let three rival AIs audit my own AI method. They agreed on the weak step.

For the better part of a year I have run the same four-step method before I trust anything an AI tells me on a decision that matters. Step one was called ROLE: tell the model to take a sceptical stance, to hunt for reasons the answer is wrong rather than reasons it is right. It always felt like the softest step in the set. I kept it anyway, because it was tidy and it sounded sensible.

This week I stopped trusting that hunch and did the one thing the method itself is built on: get a second opinion, then a third. I handed the whole method to three frontier models from three different labs, GLM, DeepSeek and Qwen, and asked each of them, separately, to pull it apart. I gave no hint about which step I doubted.

Three labs matters here. They do not share a training run, so they do not share the same blind spots. When three models that have never met land in the same place, that is not a coincidence, it is a signal. It is the same reason the AI Reliability Scoreboard grades five assistants and not one.

All three went for the same step

They did not hedge, and they did not spread their fire. All three went straight for ROLE, and independently made the same point: telling an AI to “be sceptical” changes its tone, not its accuracy. It writes you a more cautious-sounding answer. It does not write you a more correct one. A made-up figure delivered in a careful, sceptical register is still a made-up figure.

GLM, the model from Z.ai, put it most plainly:

Telling an LLM to “be sceptical” mostly changes tone, not epistemics. It’ll sound more cautious while still hallucinating.

I had half-known that. The honest version is that I had known it for months and kept the step because pulling it meant admitting the method had a soft spot. Hearing it from three machines with no stake in my feelings was the push I had been avoiding.

So ROLE is gone. Meet SCOPE.

So I replaced it with SCOPE. Instead of setting a mood, it sets a boundary. I now tell the model exactly what it is allowed to use, and have it say plainly when a question is past what it can verify, rather than papering over the gap with something fluent. “Be sceptical” asks for an attitude. SCOPE removes the room to invent. Only one of those moves the error rate.

The second step, FILTER, got sharper in the same pass. It used to ask the model to separate fact from guesswork. Now it has to label every claim as sourced, inferred or guessed, so you can see at a glance which parts to check first. The guessed ones, usually.

The method got better the moment I stopped exempting it from its own rule

There is a neatness to this I cannot pretend I planned. This whole site exists to make one argument: do not trust a single confident AI answer, check it against something else. The method is that argument turned into steps. And the method only improved once I ran the rule on the rule, and checked my own work against a different intelligence. The different intelligence was right.

Field Report

What worked: Three models from three labs, asked cold, converged on the same weak step. The disagreement I went looking for never came. The agreement was the answer.

What didn’t: My own judgment, left alone, sat on the doubt for months. A hunch you will not act on is just a worry with better manners.

Bottom line: Cross-checking against a model from a different lab is now a step in the method, not just something I once did to the method. Useful. The bar for the next change is simple: it has to survive the same panel.

The revised method is live and open. The four steps are at the Prompt Stack, and the whole thing is on GitHub under a licence that lets you take it, fork it or teach it. If you run something similar, the cheapest upgrade you can make is the one I resisted for too long: before you trust the answer, put the same question to a model from a different lab, and go looking for where the two of them disagree. That gap is where the work is.