Exploring the Limits of Large Language Models as Quant Traders

38•rzk•1h ago

Comments

kqr•1h ago

Super interesting! You can click the "live" link in the header to see how they performed over time. The (geometric) average result at the end seems to be that the LLMs are down 35 % from their initial capital – and they got there in just 96 model-days. That's a daily return of -0.6 %, or a yearly return of -81 %, i.e. practically wiping out the starting capital.

Although I lack the maths to determine it numerically (depends on volatility etc.), it looks to me as though all six are overbetting and would be ruined in the long run. It would have been interesting to compare against a constant fraction portfolio that maintains 1/6 in each asset, as closely as possible while optimising for fees.

> difficulty executing against self-authored plans as state evolves

This is indeed also what I've found trying to make LLMs play text adventures. Even when given a fair bit of help in the prompt, they lose track of the overall goal and find some niche corner to explore very patiently, but ultimately fruitlessly.

XenophileJKO•59m ago

I don't think betting on crypto is really playing to the strengths of the models. I think giving news feeds and setting it on some section of the S&P 500 would be a better evaluation.

jwpapi•57m ago

Isn’t that what Renaissance Technology does?

ezekiel68•55m ago

You don't actually need nanosecond latency to trade effectively in futures markets but it does help to be able to evaluate and make decisions in the single-digit milliseconds range. Almost no generative model is able to perform inference at this latency threshold.

A threshold in the single-digit milliseconds range allows the rapid detection of price reversals (signaling the need to exit a position with least loss) in even the most liquid of real futures contracts (not counting rare "flash crash" events).

vita7777777•20m ago

This is true for some classes of strategies. At the same time there are strategies that can be profitable on longer timeframes. The two worlds are not mutually exclusive.

rob_c•10m ago

Yes, but LLM can barely cope with following the ordering of complex software tutorials linearly. Why would you reasonably expect them unprompted to understand time any better enough to trade and turn a profit?

graemep•13m ago

From the article:

> The models engage in mid-to-low frequency trading (MLFT) trading, where decisions are spaced by minutes to a few hours, not microseconds. In stark contrast to high-frequency trading, MLFT gets us closer to the question we care about: can a model make good choices with a reasonable amount of time and information?

bluecalm•51m ago

>>LLMs are achieving technical mastery in problem-solving domains on the order of Chess and Go, solving algorithmic puzzles and math proofs competitively in contests such as the ICPC and IMO.

I don't think LLMs are anywhere close to "mastery" in chess or go. Maybe a nitpick but the point is that a NN created to be good at trading is likely to outperform LLMs at this task the same way way NNs created specifically to be good at board games vastly outperform LLMs at those games.

lukan•14m ago

"Maybe a nitpick but the point is that a NN created to be good at trading is likely to outperform LLMs at this task the same way way NNs created specifically to be good at board games vastly outperform LLMs at those games."

Disagree. Go and chess are games with very limited rules. Succesful trading on the other hand is not so much a arbitary numbers game, but involves analyzing events in the news happening right now. Agentic LLMs that do this and accordingly buy and sell might succeed here.

(Not what they did here, though

"For the first season, they are not given news or access to the leading “narratives” of the market.")

Havoc•34m ago

Are language models really the best choice for this?

Seems to me that the outcome would be near random because they are so poorly suited. Which might manifest as

> We also found that the models were highly sensitive to seemingly trivial prompt changes

baq•31m ago

they're tools. treat them as tools.

since they're so general, you need to explore if and how you can use them in your domain. guessing 'they're poorly suited' is just that, guessing. in particular:

> We also found that the models were highly sensitive to seemingly trivial prompt changes

this is as much as obvious for anyone who seriously looked at deploying these, that's why there are some very successful startups in the evals space.

rob_c•12m ago

> guessing 'they're poorly suited' is just that, guessing

I have a really nice bridge to sell you...

This "failure" is just a grab at trying to look "cool" and "innovative" I'd bet. Anyone with a modicum of understanding of the tooling (or hell experience they've been around for a few years now, enough for people to build a feeling for this), knows that this it's not a task for a pre-trained general LLM.

reedf1•29m ago

you simply will lose trading directly with an llm. mapping the dislocation by estimating the percentage of llm trading bots is useful though.

vita7777777•21m ago

This is very thoughtful and interesting. It's worth noting that this is just a start and in future iterations they're planning to give the LLMs much more to work with (e.g. news feeds). It's somewhat predictable that LLMs did poorly with quantitative data only (prices) but I'm very curious to see how they perform once they can read the news and Twitter sentiment.

rob_c•15m ago

Not just can i guarantee the models are bad with numbers, unless it's a highly tuned and modified version they're too slow for this arena. Stick to using attention transformers in better model designs which have much lower latencies than pre-trained llms...

Lapsa•9m ago

I would argue that sentiment classification is where LLMs perform best. folks are already using it for precisely such purpose - have even built a public index out of it

callamdelaney•20m ago

The limits of LLM's for systematic trading were and are extremely obvious to anybody with a basic understanding of either field. You may as well be flipping a coin.

rob_c•14m ago

At least a coin is faster and more reliable.

aswegs8•17m ago

Given that LLMs can't even finish Pokemon Red, how would you expect they are able to trade futures?

wild_pointer•2m ago

Hey! That wasn't easy!

Cloudflare outage on November 18, 2025 post mortem

Exploring the Limits of Large Language Models as Quant Traders

Gemini 3

What nicotine does to your brain

Google Antigravity

Show HN: Browser-based interactive 3D Three-Body problem simulator

Even Realities Smart Glasses: G2

Pebble, Rebble, and a path forward

I made a down detector for down detector

Blender 5.0

I wrote a Pong game in a 512-byte boot sector

Ultima VII Revisited

Mojo-V: Secret Computation for RISC-V

Bluetooth Channel Sounding: The Next Leap in Bluetooth Innovation

Gemini 3 Pro Model Card [pdf]

The code and open-source tools I used to produce a science fiction anthology

Cloudflare Global Network experiencing issues

Strace-macOS: A clone of the strace command for macOS

OrthoRoute – GPU-accelerated autorouting for KiCad

I am stepping down as the CEO of Mastodon

A down detector for down detector's down detector

Google boss says AI investment boom has 'elements of irrationality'

I just want working RCS messaging

Show HN: RowboatX – open-source Claude Code for everyday automations

Solving a million-step LLM task with zero errors

GitHub: Git operation failures

What I learned about creativity from a man painting on a treadmill (2024)

Rebecca Heineman – from homelessness to porting Doom (2022)

Bild AI (YC W25) is hiring – Make housing affordable

Short Little Difficult Books

Exploring the Limits of Large Language Models as Quant Traders

Comments

Cloudflare outage on November 18, 2025 post mortem

Exploring the Limits of Large Language Models as Quant Traders

Gemini 3

What nicotine does to your brain

Google Antigravity

Show HN: Browser-based interactive 3D Three-Body problem simulator

Even Realities Smart Glasses: G2

Pebble, Rebble, and a path forward

I made a down detector for down detector

Blender 5.0

I wrote a Pong game in a 512-byte boot sector

Ultima VII Revisited

Mojo-V: Secret Computation for RISC-V

Bluetooth Channel Sounding: The Next Leap in Bluetooth Innovation

Gemini 3 Pro Model Card [pdf]

The code and open-source tools I used to produce a science fiction anthology

Cloudflare Global Network experiencing issues

Strace-macOS: A clone of the strace command for macOS

OrthoRoute – GPU-accelerated autorouting for KiCad

I am stepping down as the CEO of Mastodon

A down detector for down detector's down detector

Google boss says AI investment boom has 'elements of irrationality'

I just want working RCS messaging

Show HN: RowboatX – open-source Claude Code for everyday automations

Solving a million-step LLM task with zero errors

GitHub: Git operation failures

What I learned about creativity from a man painting on a treadmill (2024)

Rebecca Heineman – from homelessness to porting Doom (2022)

Bild AI (YC W25) is hiring – Make housing affordable

Short Little Difficult Books