Skip to content
Lab Report

I gave the same investment question to four AI tools. The results were instructive.

Not a ranking. A controlled test on a real question, with notes on what each tool actually produced and why most of it was useless.

// On this page

The setup was simple. I had a real question about a real position — a UK small-cap I’d been watching, at a point where I was genuinely undecided about adding to it. Not a toy example. Not a company I’d chosen because it would make the AI look clever. An actual live decision.

I put the same question to four different tools: ChatGPT (GPT-4o), Claude (Sonnet), Perplexity Finance, and Gemini Advanced. The question was framed the same way each time:

“I’m reviewing a position in [company]. The shares have fallen 18% from my entry price over six months despite no material change in the underlying business. I’m considering adding to the position. What are the most important factors I should be weighing, and what would you need to see to be confident this is a good use of capital rather than throwing good money after bad?”

I want to be clear about what I was testing. Not which tool knew the most facts about the company. Not which one produced the longest output. I wanted to see which tool asked useful questions back, surfaced things I might have missed, and — critically — pushed back on my framing rather than just validating the premise that adding was worth considering.


What I got

ChatGPT produced a well-structured response with five numbered considerations, written in the kind of confident, measured tone that reads like a Bloomberg terminal explainer. It covered valuation, management track record, catalysts, and position sizing. It was not wrong. It was also almost entirely things I already knew. The framing — “here are factors to consider” — answered a question I hadn’t asked. I asked what I should be weighing given my specific situation. It told me what anyone reviewing any stock should think about. The difference matters.

Claude did something different. It started by questioning the framing of the question itself. Before offering any considerations, it noted that “fallen 18% from my entry price” tells you nothing about whether the shares are cheap — it tells you about your entry price, which the market doesn’t care about. It then asked (in its output) whether the original thesis had been built on assumptions that were now easier or harder to test than they were six months ago. That reframe was genuinely useful, and it wasn’t something I’d consciously articulated.

Perplexity Finance retrieved recent news and filings and presented them. The sourcing was the feature — it showed its work in a way the others didn’t, with links to actual regulatory filings and recent trading updates. But it didn’t synthesise the information into anything that would help me make a decision. It produced a briefing, not an analysis. For due diligence on facts, excellent. For thinking about whether to add, not what I needed.

Gemini was the most confidently wrong. It produced a confident assessment that the fall from my entry price represented “a potential rebalancing opportunity in an environment where quality small-cap businesses remain undervalued.” This sentence contains no information. It’s a plausible-sounding construction assembled from financial language that means nothing unless attached to actual evidence. It also assumed I was correct that there had been no material change in the underlying business — a claim I’d made in the prompt and that a good analyst should have challenged, not accepted.


What this tells you

The useful tool for any given task depends entirely on what you’re actually trying to do.

For fact retrieval and source verification: Perplexity, without hesitation. It shows its sources and doesn’t dress up retrieved information as original analysis.

For reframing the question: Claude, in my test. This is the thing I find most valuable and hardest to replicate — something that notices your framing is the problem before it starts answering.

For structured checklists when you need coverage: ChatGPT. It’s reliable for “what should I be thinking about” in the general sense, just not useful for “what should I be thinking about given this specific situation.”

For impressive-sounding summaries that sound like analysis but aren’t: all of them, under the wrong conditions.


The honest conclusion

None of them is a substitute for thinking. This is obvious to say and apparently easy to forget, given how many people seem to be using these tools as oracles rather than thinking aids.

The specific failure mode I worry about is the one Gemini demonstrated: confident output that validates your existing position without surfacing anything that would make you update it. If you’re already inclined to add to the position, a tool that produces plausible-sounding support for doing so is actively harmful. It gives you the feeling of having done due diligence without any of the substance.

The most useful thing any of these tools did in my test was to slow the question down — to introduce friction between “I’m thinking about doing X” and “I’ve thought carefully about whether I should do X.” That friction is the value. The content of the output is secondary.

The test I use now: if the output would equally support the opposite decision, it’s noise. Good analysis should make the alternative harder to justify, not easier.

// Keep reading

Other lab notes

Lab Report

I asked AI to find the downside in a position I was already long. It found things I'd missed.

An adversarial prompt against my own portfolio. The model surfaced three risks I'd glossed over, and one I'd actively been wrong about.

Decision Process

Why I now split every serious AI question into stages before I trust the answer

How I use a four-stage prompting method — Role, Filter, Risk, Verdict — to get useful AI analysis instead of confident-sounding noise.

Prompt Stack

The prompts I actually use when reviewing a position

Not a list of clever prompts to copy. A walkthrough of the actual sequence I run, and why the order matters more than the wording.

← All posts More in Lab Report →