Skip to content
Decision Process

Five things AI is reliably bad at for investing

The failure modes I've hit often enough to warn you about. Each one came from a real position, not a hypothetical.

// Key takeaway

Know these failure modes before you act on AI output. The Prompt Stack is built to limit them — but only if you're watching for the right pressure points.

// On this page

The failure modes don’t appear in the demonstrations. They appear the first time you’re about to act on something the model told you, and you’ve already stopped double-checking because the output has been reliable for weeks.

I’ve hit five of these. In five distinct ways, across real positions and real decisions. They’re now on my explicit checklist — the things I verify before I act, not after. What follows is a description of each one, written in enough detail that I’ll remember it the next time.


1. Numbers produced under pressure

A model will give you a number. It will give it to you in the same confident register it gives you everything else. The number can be wrong.

This isn’t random error. It’s directional: the model will produce a plausible figure derived from adjacent information when the actual figure isn’t cleanly available in its training data. It calculates from what it has. The result often looks right at a glance, sits in the right range, and is traceable to the right general ballpark — which is exactly what makes it dangerous.

The specific case that taught me this: I was reviewing a small UK company and asked the model to sanity-check my valuation. It came back with a trailing earnings multiple that was roughly where I’d expected it. I happened to have the accounts open for a different reason, checked the actual earnings figure, and found the model had divided by the wrong denominator. The multiple was off by about a third.

I would have used that figure. I would have cited it as a check. I would have been wrong.

The rule I now apply: any specific number the model produces — a price, a multiple, a growth rate, a margin, a date — gets verified against a primary source before it goes into anything I’m acting on. The model is allowed to help me know where to look. It is not allowed to be the source.


2. Recency, or the confident account of something that has since changed

Most models have a knowledge cutoff somewhere between several months and over a year in the past. They will not volunteer this fact when they describe a company’s revenue trajectory or recent filing history. They will describe what they know, in the present tense, as if it remains current.

I asked a model to review a company I was researching. It gave me a thorough, well-structured account of the business — revenue growth, margin profile, the competitive dynamic in the relevant segment. Everything was framed as a current state of affairs. I spent twenty minutes reading the analysis and found it useful.

The company had issued a profit warning two months earlier. The model had no idea. None of what it told me about the revenue trajectory was still true.

The fix is not to stop using models for company analysis. The fix is to know that any claim about recent performance, current market position, or “the latest” anything requires independent verification. Models with live retrieval help, but retrieval can fail silently — the search returns nothing, the model fills in from training data, and you wouldn’t know unless you checked the date on the sources.

Before acting on anything that is supposed to be current, check the date of the information.


3. Long filings

Annual reports commonly run to 80 pages. Some run to 200. Asking a model to “summarise this 10-K” produces a plausible summary. The summary will cover the headline numbers, the management commentary, the stated strategy. It will not reliably cover Note 23.

Note 23, or its equivalent, is where the things that matter are sometimes buried. Related-party transactions. Material litigation. The accounting treatment for a revenue recognition change. These items are in the filing. They are not well-represented in the executive summary that the model is, in practice, heavily weighting.

I have twice had a model summarise a filing competently, and twice gone on to read the filing myself and found something the summary missed. Both times, the missed item was in the notes rather than the body. Both times, it was relevant to the thesis.

The workflow adjustment: use the model to help you read the filing, not to read it for you. Tell it which sections to focus on. Ask it to flag specific things — unusual accounting, related-party references, any language around material risks or contingent liabilities. Then read those sections yourself. The model should be the index, not the reader.


4. Anything time-sensitive

The Prompt Stack takes time. The Role needs to be set carefully. The Filter stage requires the model to actually enumerate its sources and inferences. The Risk questions need real answers. Done properly, you’re looking at a 20-to-40 minute workflow on a meaningful position.

Markets move on minutes.

I tried to run the stack on a regulatory announcement that dropped on one of my holdings at market open. By the time I had a Verdict, the stock had moved 11% in forty minutes. The model’s view on whether I should act was now entirely academic. The decision had already been made by the market.

The stack is a tool for considered decisions made before, not during, a moving market. It is for deciding whether to buy, whether to trim, whether to set a tripwire. It is not for reacting to breaking news in real time. If you try to use it that way, you’ll be late every single time, and the lateness will feel like a methodological failure when it’s actually a category error.

Anything time-sensitive gets a different process: a much shorter prompt, a much faster read, and an acceptance that the quality of the output is lower. That’s a separate skill, not an upgrade to this one.


5. Confirmation bias by default

The last failure mode is the one I’d noticed earliest but took longest to build a systematic defence against.

Ask a model whether a position is worth holding. It will tell you yes, probably, with some balanced caveats. Ask it to argue against the position. You’ll get three specific issues back that weren’t in the first response.

I tested this explicitly: same model, same position context, same session. First prompt was open — “review this position.” Second prompt was adversarial — “argue against this.” The outputs were materially different in ways that mattered. The adversarial prompt surfaced a customer concentration risk and a valuation assumption that the open-ended review had glossed over as a “risk to monitor.”

The model had the same information both times. The framing changed what it surfaced.

This is why the Role stage in the Prompt Stack is not optional. Setting the stance before asking the question is the only reliable way to get past the model’s default tendency to produce agreeable output. And the times when it matters most are exactly the times you’re least likely to do it — when you’re already convinced the position is good and you’re just looking for a final check.

The model will be as bullish as you are, unless you explicitly tell it not to be.

If you’re feeling confident about something, run the adversarial framing. Not as a procedural step — as the specific antidote to the specific bias you’re carrying into the conversation.


What these five have in common

They’re all failure modes that appear when the model is working correctly. The model isn’t malfunctioning when it produces a confident wrong number or summarises a filing without reading the footnotes. It’s doing what it was designed to do, applied to a use case where that design has limits.

Knowing the limits in advance is what lets you compensate for them in the workflow. The checklist at the end of every Prompt Stack run has a question that used to be implicit and is now explicit: which of the five failure modes is most likely to have affected this output?

It’s a good question to ask before you act.

// Verdict

What worked

Naming the failure modes explicitly, rather than treating them as general AI caution. Each one is a different kind of error that requires a different kind of defence.

What didn't

Long filings remain the hardest. The model's summary is convincing enough that I still occasionally act on it without reading the notes myself. This is a habit failure, not a workflow failure.

Confidence

High — these failure modes are consistent across models and tools, not quirks of any particular system.

// Keep reading

Keep reading

Decision Process

Why I now split every serious AI question into stages before I trust the answer

How I use a four-stage prompting method — Role, Filter, Risk, Verdict — to get useful AI analysis instead of confident-sounding noise.

Lab Report

I asked AI to find the downside in a position I was already long. It found things I'd missed.

An adversarial prompt against my own portfolio. The model surfaced three risks I'd glossed over, and one I'd actively been wrong about.

Lab Report

I gave the same investment question to four AI tools. The results were instructive.

Not a ranking. A controlled test on a real question, with notes on what each tool actually produced and why most of it was useless.

← All posts More in Decision Process →