Stopped reading after “paper money”
Source: quant trader. paper trading does not incorporate market impact
Market impact shouldn’t be considered when you’re talking about trading S&P stocks with $100k.
Results are... underwhelming. All the AIs are focused on daytrading Mag7 stocks; almost all have lost money with gusto.
I'm not an investor or researcher, but this triggers my spidey sense... it seems to imply they aren't measuring what they think they are.
If the tools available were normalized, I'd expect a tighter distribution overall but grok would still land on top. Regardless of the rather public gaffes, we're going to see grok pull further ahead because they inherently have a 10-15% advantage in capabilities research per dollar spent.
OpenAI and Anthropic and Google are all diffusing their resources on corporate safetyism while xAI is not. That advantage, all else being equal, is compounding, and I hope at some point it inspires the other labs to give up the moralizing politically correct self-righteous "we know better" and just focus on good AI.
I would love to see a frontier lab swarm approach, though. It'd also be interesting to do multi-agent collaborations that weight source inputs based on past performance, or use some sort of orchestration algorithm that lets the group exploit the strengths of each individual model. Having 20 instances of each frontier model in a self-evolving swarm, doing some sort of custom system prompt revision with a genetic algorithm style process, so that over time you get 20 distinct individual modes and roles per each model.
It'll be neat to see the next couple years play out - OpenAI had the clear lead up through q2 this year, I'd say, but Gemini, Grok, and Claude have clearly caught up, and the Chinese models are just a smidge behind. We live in wonderfully interesting times.
It would almost be more interesting to specifically train the model on half the available market data, then test it on another half. But here it’s like they added a big free loot box to the game and then said “oh wow the player found really good gear that is better than the rest!”
Edit: from what I causally remember a hedge fund can beat the market for 2-4 years but at 10 years and up their chances of beating the market go to very close to zero. Since LLMs have bit been around for that long it is going to be difficult to test this without somehow segmenting the data.
Yes, ideally you’d have a model trained only on data up to some date, say January 1, 2010, and then start running the agents in a simulation where you give them each day’s new data (news, stock prices, etc.) one day at a time.
It’d be great to see how they perform within particular sectors so it’s not just a case of betting big on tech while tech stocks are booming
> Almost all the models had a tech-heavy portfolio which led them to do well. Gemini ended up in last place since it was the only one that had a large portfolio of non-tech stocks.
If the AI bubble had popped in that window, Gemini would have ended up the leader instead.
“Tech line go up forever” is not a viable model of the economy; you need an explanation of why it’s going up now, and why it might go down in the future. And also models of many other industries, to understand when and why to invest elsewhere.
And if your bets pay off in the short term, that doesn’t necessarily mean your model is right. You could have chosen the right stocks for the wrong reasons! Past performance doesn’t guarantee future performance.
People say it's not equivalent to actually trading though, and you shouldn't use it as a predictor of your actual trading performance, because you have a very different risk tolerance when risking your actual money.
I more surprised that Gemini managed to lose 10%. I wish they actually mentioned what the models invested in and why.
We need to know the risk adjusted return, not just the return.
This is a really dumb measurement.
> You are a stock trading agent. Your goal is to maximize returns.
> You can research any publicly available information and make trades once per day.
> You cannot trade options.
> Analyze the market and provide your trading decisions with reasoning.
>
> Always research and corroborate facts whenever possible.
> Always use the web search tool to identify information on all facts and hypotheses.
> Always use the stock information tools to get current or past stock information.
>
> Trading parameters:
> - Can hold 5-15 positions
> - Minimum position size: $5,000
> - Maximum position size: $25,000
>
> Explain your strategy and today's trades.
Given the parameters, this definitely is NOT representative of any actual performance.
I recommend also looking at the trade history and reasoning for each trade for each model, it's just complete wind.
As an example, Deepseek made only 21 trades, which were all buys, which were all because "Companyy X is investing in AI". I doubt anyone believe this to be a viable long-term trading strategy.
Think? What exactly did “it” think about?
I think you mean "DeepSeek came in a close second".
Grok is constantly training and/or it has access to websearch internally.
You cannot backtest LLMs. You can only "live" test them going forward.
That has been the best way to get returns.
I setup a 212 account when I was looking to buy our first house. I bought in small tiny chunks of industry where I was comfortable and knowledgeable in. Over the years I worked up a nice portfolio.
Anyway, long story short. I forgot about the account, we moved in, got a dog, had children.
And then I logged in for the first time in ages, and to my shock. My returns were at 110%. I've done nothing. It's bizarre and perplexing.
Also N=1
Also just one time interval? Something as trivial as "buy AI" could do well in one interval, and given models are going to be pumped about AI, ...
100 independent runs on each model over 10 very different market behavior time intervals would producing meaningful results. Like actually credible, meaningful means and standard deviations.
This experiment, as is, is a very expensive unbalanced uncharacterizable random number generator.
[1] - https://www.youtube.com/watch?v=USKD3vPD6ZA [video][15 mins]
Deepseek did not sell anything, but did well with holding a lot of tech stocks. I think that can be a bit of a risky strategy with everything in one sector, but it has been a successful one recently so not surprising that it performed well. Seems like they only get to "trade" once per day, near the market close, so it's not really a real time ingesting of data and making decisions based on that.
What would really be interesting is if one of the LLMs switched their strategy to another sector at an appropriate time. Very hard to do but very impressive if done correctly. I didn't see that anywhere but I also didn't look deeply at every single trade.
The article is very very vague on their methodology (unless I missed it somewhere else?). All I read was, "we gave AI access to market data and forced it to make trades". How often did these models run? Once a day? In a loop continuously? Did it have access to indicators (such as RSI)? Could it do arbitrary calculations with raw data? Etc...
I'm in the camp that AI will never be able to successfully trade on its own behalf. I know a couple of successful traders (and many unsuccessful!), and it took them years of learning and understanding before breaking even. I'm not quite sure what the difference is between the successful and non-successful. Some sort of subconscious knowledge from staring at charts all day? A level of intuition? Regardless, it's more than just market data and news.
I think AI will be invaluable as an assistant (disclaimer; I'm working on an AI trading assistant), but on its own? Never. Some things simply simply can't be solved with AI and I think this is one of them. I'm open to being wrong, but nothing has convinced me otherwise.
What you ask the model to do is super important. Just like writing or coding.. the default "behavior" is likely to be "average".. you need to very careful of what you are asking for.
For me this is just a fun experiment and very interesting to see the market analysis it does. I started with o3 and now I'm using 5.1 Thinking (set to max).
I have it looking for stocks trading below intrinsic value with some caveats because I know it likes to hinge on binary events like drug trial results. I also have it try to have it look at correlation with the positions and make sure they don't have the same macro vulnerability.
I just run it once a month and do some trades with one of my "experimental" trading accounts. It certainly has thought of things I hadn't like using an equal weight s&p 500 etf to catch some upside when the S&P seems really top heavy and there may be some movement away from the top components, like last month.
This seems entirely like trivial social media bait and nothing like research: "We gave each major LLM and stock trading prompt. You won't believe which performed best!"
sethops1•50m ago
So the results are meaningless - these LLMs have the advantage of foresight over historical data.
CPLX•49m ago
PTRFRLL•48m ago
plufz•43m ago
disconcision•18m ago
disconcision•8m ago
endtime•11m ago
stusmall•35m ago
cheeseblubber•20m ago
itake•48m ago
I wish they could explain what this actually means.
nullbound•37m ago
devmor•36m ago
joegibbs•39m ago