What every AI stock research comparison gets wrong

Every AI stock research comparison I read produces the same verdict: it depends on your needs, or use all four. That is an admission the comparison didn’t find a signal and measured the wrong thing.

The genre tests AI as an information retrieval system, then issues verdicts about AI as a research and reasoning tool. Those are not the same capability. A model can retrieve NVIDIA’s revenue correctly and still validate every thesis you bring it, still miss the hedge in an earnings transcript, still invent a covered-call premium table when given a setup with no chain data. All three failures can happen in the same session, on the same stock, inside the same cheerful paragraph of output. The comparison articles never test the last two, so the tools that fail them score the same as the tools that don’t.

What comparison articles typically do

The most common format is the one-prompt face-off. The author asks one question, usually about a well-known US name like NVIDIA, and posts the four responses side by side. The inevitable conclusion is “use all four”, because on a single question about a well-covered company, the tools are roughly equivalent. That observation is true, but it doesn’t get you to a verdict.

The more sophisticated end of the genre uses a benchmark. One March 2026 piece scored six tools on earnings data extraction across NVIDIA, Apple and Rocket Lab: Gemini at 87.5 percent, ChatGPT, Claude and NotebookLM tied at 81.3 percent, Grok at 79.2 percent, Perplexity at 77.1 percent. The most rigorous benchmark I read in this space. The numbers are real. But the test measured whether the AI could retrieve a revenue figure. That is the easiest task in research, and it doesn’t tell you which tool to open for the next decision you face.

The genre has a structural blind spot. None of the comparisons test the tasks on which the tools diverge.

Failure mode 1: Testing retrieval, issuing reasoning verdicts

The benchmark task in almost every comparison is “what was the most recent quarter’s revenue”. All four tools pass. The verdict written off the back of that result is some version of “they’re all equivalent for stock research”. That’s true for retrieval and nothing else.

The investor’s real job is interpreting what those numbers mean, challenging the thesis, reading what management committed to versus what they merely implied, and knowing when an AI is making something up. That benchmark tests one of those four. Retrieving numbers isn’t the real job.

When I ran my own comparison on the same four tools, the sharpest verdict came from reading Susan Li’s Q1 2026 transcript. Only Claude flagged the word “underestimate” (Li said the company had continued to underestimate compute needs) as one-sided phrasing. It lets the listener hear an upward bias without management committing to spend more. The kind of finding a careful equity analyst notices on the third read of a transcript. Claude noticed it on the first. No existing comparison article runs a test that would catch this. They all test retrieval, and none of them tests language analysis.

Failure mode 2: The “use all four” non-verdict

Several comparison articles conclude that the answer is to run every query across every tool. Perplexity’s Model Council does exactly this: it runs your question across several frontier models at once and synthesises the answers. Some of the more sophisticated comparison pieces recommend the same move. The instinct is reasonable: if the tools are different, average them. But an average isn’t a verdict at all.

A useful comparison tells you where the tools diverge. Model Council tells you to ignore the divergence. Running a question across four tools is only worth the time when the answers split in a decision-relevant way, and the comparison articles can’t tell you when that is. They didn’t run the prompts on which it happens.

The fifth dimension in my own comparison was a covered-call setup with one explicit instruction: assume no live options chain, the real-time grid of option prices the AI would need to do the maths properly. Claude stayed clean. ChatGPT did the same, with a soft offer at the end to “approximate” the premiums if asked. Gemini invented a formatted table with specific premium ranges and an implied volatility figure (the market’s expectation of how much a stock will move) of around 75 percent, despite being told it had no chain data. Same prompt. Same day. (That fabrication is logged on lessons with the screenshot and verbatim text, alongside a sister Gemini fabrication on a separate options test.)

Told there was no options chain, did it stay in its lane?

Claude

ChatGPT

Gemini

Same covered-call prompt, same day, no chain data. Gemini returned a formatted premium table and an implied volatility figure of around 75 percent. Run 15 May 2026, Gemini Pro / ChatGPT / Claude.

A Model Council reader of that question would have received one reliable answer and two misleading ones, with nothing in the formatted output to tell them apart.

Failure mode 3: Backtesting AI stock picks

The sophisticated-looking methodology is the worst of the three. Hand an AI three years of data, ask it to pick stocks, run the resulting portfolio against the period that just ended, calculate the Sharpe ratio, a single number for how much return you got per unit of risk taken. Multiple 2026 pieces use this method, including from credentialled analysts. The methodology looks impressive because it produces numbers. The numbers measure whether AI can predict markets that already happened.

The retail use case is the inverse of the backtest. You’re asking it to help you act on a decision you face right now. Should I sell some into this earnings release. Should I buy more on the dip. Does my thesis still hold given the new disclosure. None of those questions can be tested by feeding the AI a closed dataset and seeing what it does. The closest honest test, the one I work with, is to run the prompt at the time of the decision and document the outcome, with the trigger written down before the call.

Backtested Sharpe ratios on AI-picked portfolios look like proof. They are a category error.

Every comparison piece I have read tests the same kind of company: NVIDIA, Apple, Microsoft, Meta. Big US names with deep analyst coverage, multiple transcript sources, decades of filings in training data. The verdicts generated on these names do not transfer to anything else.

The clearest counter-example on this site is the Perplexity test I ran on a smaller US-listed name. The AI read the company’s annual report, a filing that, like most US filings, states values “in thousands”. Revenue for the most recent year was 6,095, meaning $6.1 million. Perplexity returned “$6K” and then generated a confident narrative about revenue being “down 99.8% from the prior year”. A retail investor acting on that would have a materially false picture of the business. ChatGPT, Claude and Gemini all returned the correct figure on the same prompt. (Documented on lessons.)

The same tool that handles AAPL or MSFT or NVDA without trouble can be off by a factor of a thousand on a thinly covered name. The 2026 benchmark cited above tested NVIDIA, Apple and Rocket Lab, three names with extensive analyst coverage. The benchmarks run on exactly the cases where the tools all look equivalent.

Failure mode 5: Ignoring the constraint-following task

This is the failure mode with the highest stakes for the kinds of decisions a retail investor makes with AI. None of the comparison pieces tests it.

The setup: a retail investor asks an AI tool about an options trade. They can’t give it a live chain because no consumer AI has access to one. The right answer is to reason with the data provided, name what is missing, and tell the user what to verify from their broker. A tool that invents premiums and implied volatility figures when explicitly told there is no chain data is dangerous in a way the output doesn’t advertise. The invented table has units. It has ranges. It looks like research that was retrieved from somewhere.

I re-ran this test on 22 May 2026: same “no live chain” instruction, fresh BMNR covered-call setup. The pattern persists in a softer form. Gemini correctly listed three data points needing a live chain (bid/ask spreads, precise delta, premium output), and then, in the same response, named a specific IV range (“75% to 90%”) and a specific delta range (“20-30 Delta for a 15% OTM 45-day strike”) as factual expectations. No chain, no source: just training-data approximations presented as analysis. The full strike-by-strike table fabrication is gone. The impulse to fill data gaps with specific numbers despite acknowledging the gap remains.

The same instinct reaches beyond options: when I asked Gemini to audit this website, it audited a different company’s site entirely, same failure shape, no chain data required. Both fit the named taxonomy in nine types of AI hallucinations: Mode 1, the invented premium, and Mode 3, the wrong-subject audit.

The point of naming this failure isn’t that Gemini is a bad tool. Gemini was the strongest performer on the calendar-awareness test in my own comparison: it flagged Apple’s WWDC as a near-term catalyst without being asked. It is useful for research on well-covered names, dangerous on anything that touches options data. No comparison article can tell you that, because none of them are written by someone who places covered-call trades.

What a better AI stock research comparison looks like

The fix is running the same prompts against each tool on the tasks that separate them (language analysis, constraint-following, thinly covered names, structured-prompt response) and naming a verdict per task. A more rigorous benchmark falls short of that.

That is what my own comparison post does. Same prompts, same day, fresh conversations, verbatim outputs saved. The conclusion names a tool per task: Claude for the analytical work, Gemini for structured research on well-covered names (never for options), ChatGPT as the reliable middle, Perplexity for retrieval on well-covered names only. Per task, with the evidence cited. A reader who disagrees can re-run the prompt and check. The earnings-tool comparison applies the same pattern to a different question (which tool won on reading between the lines of a transcript) and lands a named winner on a task the retrieval benchmarks don’t run.

What this means in practice

Three checks I run before I trust any comparison verdict in this space.

First: was the test run on the kind of stock you trade? A benchmark on NVIDIA and Apple is silent on what the tools do with smaller names or anything outside the well-covered universe.

Second: did the comparison include a test where the AI was explicitly told it didn’t have the data? If not, you are reading a benchmark of retrieval. Judgement never got tested. The most important thing to know about a research tool is whether it stays in its lane when it can’t see the answer.

Third: if the verdict is "use all four" or "it depends", the comparison didn't find a signal. That's an admission, not a recommendation.

The short version

What worked: Testing by task type (language analysis, constraint-following, thinly covered names) produces verdicts you can act on. Naming winners per task tells you which tool to open first. Documenting failures alongside successes lets the reader check the evidence.

What didn’t: Every comparison that benchmarks retrieval on well-covered US names and issues a verdict about reasoning capability. Every backtested-portfolio test of AI stock picking. Every “use all four” conclusion. The methodology produces numbers; the verdict doesn’t fit the question.

Bottom line: The one comparison worth trusting is the one done on a task you’d run, on a stock you’d trade. If a benchmark extended its methodology to cover language analysis, constraint-following and options tasks, and still produced “use all four”, the argument here would need revising.

The comparison articles are accurately answering a question almost no retail investor is asking. The research isn’t the problem. Which tool retrieves the headline number fastest is not the decision you face. Which tool stays in its lane when you can’t see the answer, which one challenges your thesis, which one reads between the lines of a transcript. Those are the decisions. The comparison worth trusting is the one that tests for them.