Skip to content
Decision Process

What every AI stock research comparison gets wrong

Most AI stock research comparison pieces test retrieval and issue verdicts about reasoning. Five failure modes — and what a comparison should measure instead.

// TL;DR
Problem
every AI stock research comparison ends with 'it depends' or 'use all four'.
Fix
stop testing retrieval and start testing the tasks that separate the tools — language analysis, constraint-following, thinly covered names.
Payoff
a comparison that names a winner per task and tells you which tool to open for which job.
// On this page

Every AI stock research comparison I read produces the same verdict: it depends on your needs, or use all four. That isn’t a verdict. It’s an admission that the comparison didn’t find a signal — and it’s because they measured the wrong thing.

The genre tests AI as an information retrieval system, then issues verdicts about AI as a research and reasoning tool. Those are not the same capability. A model can retrieve NVIDIA’s revenue correctly and still validate every thesis you bring it, still miss the hedge in an earnings transcript, still invent a covered-call premium table when given a setup with no chain data. All three failures can happen in the same session, on the same stock. The comparison articles never test the last two, so the tools that fail them score the same as the tools that don’t.


What comparison articles typically do

The most common format is the one-prompt face-off. The author asks one question — usually about a well-known US name like NVIDIA — and posts the four responses side by side. The inevitable conclusion is “use all four”, because on a single question about a well-covered company, the tools are roughly equivalent. True observation, useless verdict.

The more sophisticated end of the genre uses a benchmark. One April 2026 piece scored six tools on earnings data extraction across NVIDIA, Apple and Rocket Lab: Gemini at 87.5 percent, ChatGPT, Claude and NotebookLM tied at 81.3 percent, Grok at 79.2 percent, Perplexity at 77.1 percent. The most rigorous benchmark I read in this space. The numbers are real. But the test measured whether the AI could retrieve a revenue figure. That is the easiest task in research, not the task that decides which tool to open for the next decision you face.

The genre has a structural blind spot. None of the comparisons test the tasks on which the tools diverge.


Failure mode 1 — Testing retrieval, issuing reasoning verdicts

The benchmark task in almost every comparison is “what was the most recent quarter’s revenue”. All four tools pass. The verdict written off the back of that result is some version of “they’re all equivalent for stock research”. The first half of that sentence is true. The second half doesn’t follow.

The investor’s real job isn’t retrieving numbers. It’s interpreting what those numbers mean, challenging the thesis, reading what management committed to versus what they merely implied, and knowing when an AI is making something up. The AInvestingLab benchmark tests one of those four.

When I ran my own comparison on the same four tools, the sharpest verdict came from reading Susan Li’s Q1 2026 transcript. Only Claude flagged the word “underestimate” — Li said the company had continued to underestimate compute needs — as one-sided phrasing. It lets the listener hear an upward bias without management committing to spend more. The kind of finding a careful equity analyst notices on the third read of a transcript. Claude noticed it on the first. No existing comparison article runs a test that would catch this, because none of them tests language analysis. They test retrieval.


Failure mode 2 — The “use all four” non-verdict

Several comparison articles conclude that the answer is to run every query across every tool. Perplexity calls this council mode; some of the more sophisticated comparison pieces recommend it. The instinct is reasonable: if the tools are different, average them. But averaging isn’t a verdict. It’s the absence of one.

A useful comparison tells you where the tools diverge. Council mode tells you to ignore the divergence. Running a question across four tools is only worth the time when the answers split in a decision-relevant way, and the comparison articles can’t tell you when that is — they didn’t run the prompts on which it happens.

The fifth dimension in my own comparison was a covered-call setup with one explicit instruction: assume no live options chain. Claude stayed clean. ChatGPT did the same, with a soft offer at the end to “approximate” the premiums if asked. Gemini invented a formatted table with specific premium ranges and an implied volatility figure of around 75 percent, despite being told it had no chain data. Same prompt. Same day. (That fabrication is logged on lessons with the screenshot and verbatim text, alongside a sister Gemini fabrication on a separate options test.)

A council-mode reader of that question would have received one reliable answer and two misleading ones, with nothing in the formatted output to tell them apart.


Failure mode 3 — Backtesting AI stock picks

The sophisticated-looking methodology is the worst of the three. Hand an AI three years of data, ask it to pick stocks, run the resulting portfolio against the period that just ended, calculate the Sharpe ratio. Multiple 2026 pieces use this method, including from credentialled analysts. The methodology looks impressive because it produces numbers. The numbers measure whether AI can predict markets that already happened.

The retail use case is the inverse of the backtest. You aren’t asking the AI to pick names from history. You’re asking it to help you act on a decision you face right now. Should I sell some into this earnings release. Should I buy more on the dip. Does my thesis still hold given the new disclosure. None of those questions can be tested by feeding the AI a closed dataset and seeing what it does. The closest honest test — the one I work with — is to run the prompt at the time of the decision and document the outcome, with the trigger written down before the call.

Backtested Sharpe ratios on AI-picked portfolios look like proof. They are a category error.


Failure mode 4 — The well-covered-name blind spot

Every comparison piece I have read tests the same kind of company: NVIDIA, Apple, Microsoft, Meta. Big US names with deep analyst coverage, multiple transcript sources, decades of filings in training data. The verdicts generated on these names do not transfer to anything else.

The clearest counter-example on this site is the Perplexity test I ran on a smaller US-listed name. The AI read the company’s annual report — a filing that, like most US filings, states values “in thousands”. Revenue for the most recent year was 6,095, meaning $6.1 million. Perplexity returned “$6K” and then generated a confident narrative about revenue being “down 99.8% from the prior year”. A retail investor acting on that would have a materially false picture of the business. ChatGPT, Claude and Gemini all returned the correct figure on the same prompt. (Documented on lessons.)

The same tool that handles AAPL or MSFT or NVDA without trouble can be off by a factor of a thousand on a thinly covered name. The 2026 benchmark cited above tested NVIDIA, Apple and Rocket Lab — three names with extensive analyst coverage. The benchmarks run on exactly the cases where the tools all look equivalent.


Failure mode 5 — Ignoring the constraint-following task

This is the failure mode with the highest stakes for the kinds of decisions a retail investor actually makes with AI. None of the comparison pieces tests it.

The setup: a retail investor asks an AI tool about an options trade. They can’t give it a live chain because no consumer AI has access to one. The right answer is to reason with the data provided, name what is missing, and tell the user what to verify from their broker. A tool that invents premiums and implied volatility figures when explicitly told there is no chain data is not just unhelpful — it is dangerous in a way the output doesn’t advertise. The invented table has units. It has ranges. It looks like research that was retrieved from somewhere.

I re-ran this test on 22 May 2026 — same “no live chain” instruction, fresh BMNR covered-call setup. The pattern persists in a softer form. Gemini correctly listed three data points needing a live chain (bid/ask spreads, precise delta, premium output) — and then, in the same response, named a specific IV range (“75% to 90%”) and a specific delta range (“20-30 Delta for a 15% OTM 45-day strike”) as factual expectations. No chain, no source — just training-data approximations presented as analysis. The full strike-by-strike table fabrication is gone. The impulse to fill data gaps with specific numbers despite acknowledging the gap is not.

The point of naming this failure isn’t that Gemini is a bad tool. Gemini was the strongest performer on the calendar-awareness test in my own comparison — it flagged Apple’s WWDC as a near-term catalyst without being asked. It is useful for research on well-covered names, dangerous on anything that touches options data. No comparison article can tell you that, because none of them are written by someone who places covered-call trades.


What a better AI stock research comparison looks like

The fix isn’t a more rigorous benchmark. It’s running the same prompts against each tool on the tasks that separate them — language analysis, constraint-following, thinly covered names, structured-prompt response — and naming a verdict per task, not averaged across them.

That is what my own comparison post does. Same prompts, same day, fresh conversations, verbatim outputs saved. The conclusion isn’t “use all four”. It is: Claude for the analytical work, Gemini for structured research on well-covered names (never for options), ChatGPT as the reliable middle, Perplexity for retrieval on well-covered names only. Per task, with the evidence cited. A reader who disagrees can re-run the prompt and check. The earnings-tool comparison applies the same pattern to a different question — which tool won on reading between the lines of a transcript — and lands a named winner on a task the retrieval benchmarks don’t run.


What this means in practice

Three checks I run before I trust any comparison verdict in this space.

First: was the test run on the kind of stock you actually trade? A benchmark on NVIDIA and Apple is silent on what the tools do with smaller names or anything outside the well-covered universe.

Second: did the comparison include a test where the AI was explicitly told it didn’t have the data? If not, you are reading a benchmark of retrieval, not of judgement. The most important thing to know about a research tool is whether it stays in its lane when it can’t see the answer.

Third: if the verdict is "use all four" or "it depends", the comparison didn't find a signal. That's an admission, not a recommendation.


//Field Report

What worked

Testing by task type — language analysis, constraint-following, thinly covered names — produces verdicts you can act on. Naming winners per task tells you which tool to open first. Documenting failures alongside successes lets the reader check the evidence.

What didn't

Every comparison that benchmarks retrieval on well-covered US names and issues a verdict about reasoning capability. Every backtested-portfolio test of AI stock picking. Every "use all four" conclusion. The methodology produces numbers; the verdict doesn't fit the question.

Bottom line

The one comparison worth trusting is the one done on a task you'd run, on a stock you'd trade. If a benchmark extended its methodology to cover language analysis, constraint-following and options tasks, and still produced "use all four", the argument here would need revising.


The comparison articles aren’t badly researched. They are accurately answering a question almost no retail investor is asking. Which tool retrieves the headline number fastest is not the decision you face. Which tool stays in its lane when you can’t see the answer, which one challenges your thesis, which one reads between the lines of a transcript — those are the decisions. The comparison worth trusting is the one that tests for them.

Ben Dixon
// Written by Ben Dixon

Ben runs AI on real investing decisions — and documents what each model got wrong, what each one caught, and the prompts that survived the cuts. About Ben →

// Keep reading
Decision Process

What AI gets right (and wrong) about options trading

AI for options trading: a working taxonomy from real covered-call trades. Four uses where it helps. Six failure modes where it invents numbers.

Decision Process

5 questions to ask AI before buying any stock (2026)

Five questions to ask AI before buying any stock — for the hour before you commit capital, when the research is done and the decision is about to be made.

Field Notes

Claude prompts for investing: 6 real examples

Six Claude prompts for investing — with the actual outputs each one returned on MSFT, META and NVDA, and what had to be checked before using them.

// New here?

The site runs AI on real investing decisions. Start with the Prompt Stack for the four-stage framework, or the Field Guide PDF for the condensed version — free, no email.

← All posts More in Decision Process →