frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Things to Look for in the Best Hytale Servers

https://hytaletop100.com/blog/10-things-to-look-for-in-the-best-hytale-servers
1•doobie12•3m ago•0 comments

An Instagram Alternative That Works Even with JavaScript Off

https://phofee.com/explore
1•LandenLove•3m ago•0 comments

The Dawn of the Renaissance Developer

https://thekernel.news/
2•cebert•6m ago•0 comments

"Thinking Models" vs. Structured Prompts (Cost and Latency Analysis)

https://reidkimball.com/case-studies/cutting-ai-feature-costs-by-61-percent/
1•reidkimball•6m ago•1 comments

pollcoro: header-only C++17 coroutine-ts library using polling instead of resume

https://github.com/TroyKomodo/pollcoro
1•arunc•7m ago•0 comments

Why Can't We Quit Excel

https://www.bloomberg.com/features/2025-microsoft-excel-ai-software
2•petethomas•8m ago•0 comments

I called my recipe book Sabzi – vegetables. But the name was trademarked

https://www.theguardian.com/food/commentisfree/2025/dec/04/recipe-book-sabzi-vegetables-yasmin-kh...
1•sea6ear•9m ago•0 comments

What Is a Package Manager?

https://nesbitt.io/2025/12/02/what-is-a-package-manager.html
2•todsacerdoti•9m ago•1 comments

Heatma.lol – A Multiplayer Strategy and Agility Webbrowser Clicking Game

https://www.heatmap.lol/
1•MajesticWombat•12m ago•0 comments

Guide to Paris

https://www.nytimes.com/interactive/2025/travel/paris-france-guide.html
1•whack•12m ago•0 comments

istwitterdownyet.com (2023)

https://web.archive.org/web/20230410173730/http://istwitterdownyet.com/
2•lysace•15m ago•4 comments

A Responsibility to the Industry

https://lmnt.me/blog/a-responsibility-to-the-industry.html
1•JSR_FDED•17m ago•0 comments

What is better: a lookup table or an enum type?

https://www.cybertec-postgresql.com/en/lookup-table-or-enum-type/
2•todsacerdoti•18m ago•0 comments

Software Gets a New Layer

https://www.wreflection.com/p/software-gets-a-new-layer
1•nowflux•21m ago•0 comments

Seekdb – AI-Native search database

https://github.com/oceanbase/seekdb
2•synergy20•21m ago•0 comments

Münchhausen Trilemma

https://en.wikipedia.org/wiki/M%C3%BCnchhausen_trilemma
1•thunderbong•23m ago•0 comments

Show HN: RIMC – An Alpha-Drift Framework for Finite-Speed Learning Markets

https://github.com/rimc-lab/RIMC
1•sode_rimc•23m ago•0 comments

Does Time Flow? New Clues Come from a Century-Old Approach to Math

https://www.quantamagazine.org/does-time-really-flow-new-clues-come-from-a-century-old-approach-t...
1•tesserato•30m ago•2 comments

FRIP Weaponizes Identity Fabrics

https://www.kuppingercole.com/blog/tolbert/how-frip-weaponizes-identity-fabrics-the-security-revo...
1•mooreds•31m ago•0 comments

Ukraine stares down the barrel of population collapse

https://www.reuters.com/world/ukraine-stares-down-barrel-population-collapse-2025-12-04/
3•layer8•31m ago•0 comments

How AI is rewiring childhood

https://www.economist.com/leaders/2025/12/04/how-ai-is-rewiring-childhood
1•jdkee•31m ago•1 comments

CDC advisory panel delays vote on hepatitis B vaccines after unruly meeting

https://www.msn.com/en-us/health/other/cdc-advisory-panel-delays-vote-on-hepatitis-b-vaccines-aft...
4•petethomas•32m ago•0 comments

Belief

https://en.wikipedia.org/wiki/Belief
1•marysminefnuf•33m ago•0 comments

How to Find Time to Do Science

https://chillphysicsenjoyer.substack.com/p/how-to-find-time-to-do-science
2•Gormisdomai•34m ago•0 comments

Ask HN: Is there a reliable mass automation focus group applier?

1•bunnybomb2•34m ago•0 comments

Dosh (LLM-powered shell commands)

https://raku-advent.blog/2025/12/01/day-1-dancer-dasher-and-dosh/
3•librasteve•39m ago•0 comments

Zero Table Dependency: A model for testing SQL as pure functions

https://github.com/mk3008/rawsql-ts/tree/main/packages/drivers/pg-testkit
1•masugiura•40m ago•0 comments

Apple Announces Departure of Lisa Jackson and Kate Adams

https://www.cnbc.com/2025/12/04/apple-announces-departure-lisa-jackson-kate-adams.html
5•coloneltcb•41m ago•0 comments

Qwen3-VL 2B on Raspberry Pi with llama.cpp

https://eheidi.dev/posts/raspberry-llama/
1•ignoramous•42m ago•0 comments

Show HN: Disaggregating GPU compute from CPU in ML job execution to scale GPUs

https://woolyai.com/
1•medicis123•43m ago•0 comments
Open in hackernews

We gave 5 LLMs $100K to trade stocks for 8 months

https://www.aitradearena.com/research/we-ran-llms-for-8-months
74•cheeseblubber•53m ago

Comments

sethops1•48m ago
> Testing GPT-5, Claude, Gemini, Grok, and DeepSeek with $100K each over 8 months of backtested trading

So the results are meaningless - these LLMs have the advantage of foresight over historical data.

CPLX•47m ago
Not sure how sound the analysis is but they did apparently actually think of that.
PTRFRLL•46m ago
> We were cautious to only run after each model’s training cutoff dates for the LLM models. That way we could be sure models couldn’t have memorized market outcomes.
plufz•41m ago
I know very little about how the environment where they run these models look, but surely they have access to different tools like vector embeddings with more current data on various topics?
disconcision•16m ago
you can (via the api, or to a lesser degree through the setting in the web client) determine what tools if any a model can use
disconcision•6m ago
with the exception that it doesn't seem possible to fully disable this for grok 4
endtime•9m ago
If they could "see" the future and exploit that they'd probably have much higher returns.
stusmall•33m ago
Even if it is after the cut off date wouldn't the models be able to query external sources to get data that could positively impact them? If the returns were smaller I could reasonably believe it but beating the S&P500 returns by 4x+ strains credulity.
cheeseblubber•18m ago
We used the LLMs API and provided custom tools like a stock ticker tool that only gave stock price information for that date of backtest for the model. We did this for news apis, technical indicator apis etc. It took quite a long time to make sure that there weren't any data leakage. The whole process took us about a month or two to build out.
itake•46m ago
> We time segmented the APIs to make sure that the simulation isn’t leaking the future into the model’s context.

I wish they could explain what this actually means.

nullbound•35m ago
Overall, it does sound weird. On the one hand, assuming I properly I understand what they are saying is that they removed model's ability to cheat based on their specific training. And I do get that nuance ablation is a thing, but this is not what they are discussing there. They are only removing one avenue of the model to 'cheat'. For all we know, some that data may have been part of its training set already...
devmor•35m ago
It's a very silly way of saying that the data the LLMs had access to was presented in chronological order, so that for instance, when they were trading on stocks at the start of the 8 month window, the LLMs could not just query their APIs to see the data from the end of the 8 month window.
joegibbs•38m ago
That's only if they're trained on data more recent than 8 months ago
deadbabe•42m ago
Yea, so this is bullshit. An approximation of reality still isn’t reality. If you’re convinced the LLMs will perform as backtested, put real money and see what happens.
chroma205•41m ago
>We gave each of five LLMs $100K in paper money

Stopped reading after “paper money”

Source: quant trader. paper trading does not incorporate market impact

zahlman•36m ago
If your initial portfolio is 100k you are not going to have meaningful "market impact" with your trades assuming you actually make them vs. paper trading.
a13n•35m ago
I mean if you’re going to write algos that trade the first thing you should do is check whether they were successful on historical data. This is an interesting data point.

Market impact shouldn’t be considered when you’re talking about trading S&P stocks with $100k.

verdverm•31m ago
Historical data is useful for validation, don't develop algos against it, test hypotheses until you've biased your data, then move on to something productive for society
txg•34m ago
Lack of market response is a valid point, but $100k is pretty unlikely to have much impact especially if spread out over multiple trades.
tekno45•32m ago
the quant trader you talked to probably sucks.
dash2•41m ago
There's also this thing going on right now: https://nof1.ai/leaderboard

Results are... underwhelming. All the AIs are focused on daytrading Mag7 stocks; almost all have lost money with gusto.

syntaxing•14m ago
Let me guess, the mystery model is theirs
chongli•40m ago
They outperformed the S&P 500 but seem to be fairly well correlated with it. Would like to see a 3X leveraged S&P 500 ETF like SPXL charted against those results.
10000truths•33m ago
...over the course of 8.5 months, which is way too short for a meaningful result. If their strategy could outperform the S&P 500's 10-year return, they wouldn't be blogging about it.
bcrosby95•40m ago
> Grok ended up performing the best while DeepSeek came close to second. Almost all the models had a tech-heavy portfolio which led them to do well. Gemini ended up in last place since it was the only one that had a large portfolio of non-tech stocks.

I'm not an investor or researcher, but this triggers my spidey sense... it seems to imply they aren't measuring what they think they are.

etchalon•37m ago
I don't feel like they measured anything. They just confirmed that tech stocks in the US did pretty well.
JoeAltmaier•25m ago
They measured the investment facility of all those LLMs. That's pretty much what the title says. And they had dramatically different outcomes. So that tells me something.
DennisP•16m ago
I mean, what it kinda tells me is that people talk about tech stocks the most, so that's what was most prevalent in the training data, so that's what most of the LLMs said to invest in. That's the kind of strategy that works until it really doesn't.
olliepro•37m ago
A more sound approach would have been to do a monte carlo simulation where you have 100 portfolios of each model and look at average performance.
observationist•19m ago
Grok would likely have an advantage there, as well - it's got better coupling to X/Twitter, a better web search index, fewer safety guardrails in pretraining and system prompt modification that distort reality. It's easy to envision random market realities that would trigger ChatGPT or Claude into adjusting the output to be more politically correct. DeepSeek would be subject to the most pretraining distortion, but have the least distortion in practice if a random neutral host were selected.

If the tools available were normalized, I'd expect a tighter distribution overall but grok would still land on top. Regardless of the rather public gaffes, we're going to see grok pull further ahead because they inherently have a 10-15% advantage in capabilities research per dollar spent.

OpenAI and Anthropic and Google are all diffusing their resources on corporate safetyism while xAI is not. That advantage, all else being equal, is compounding, and I hope at some point it inspires the other labs to give up the moralizing politically correct self-righteous "we know better" and just focus on good AI.

I would love to see a frontier lab swarm approach, though. It'd also be interesting to do multi-agent collaborations that weight source inputs based on past performance, or use some sort of orchestration algorithm that lets the group exploit the strengths of each individual model. Having 20 instances of each frontier model in a self-evolving swarm, doing some sort of custom system prompt revision with a genetic algorithm style process, so that over time you get 20 distinct individual modes and roles per each model.

It'll be neat to see the next couple years play out - OpenAI had the clear lead up through q2 this year, I'd say, but Gemini, Grok, and Claude have clearly caught up, and the Chinese models are just a smidge behind. We live in wonderfully interesting times.

IgorPartola•33m ago
Yeah I mean if you generally believe the tech sector is going to do well because it has been doing well you will beat the overall market. The problem is that you don’t know if and when there might be a correction. But since there is this one segment of the overall market that has this steady upwards trend and it hasn’t had a large crash, then yeah any pattern seeking system will identify “hey this line keeps going up!” Would it have the nuance to know when a crash is coming if none of the data you test it on has a crash?

It would almost be more interesting to specifically train the model on half the available market data, then test it on another half. But here it’s like they added a big free loot box to the game and then said “oh wow the player found really good gear that is better than the rest!”

Edit: from what I causally remember a hedge fund can beat the market for 2-4 years but at 10 years and up their chances of beating the market go to very close to zero. Since LLMs have bit been around for that long it is going to be difficult to test this without somehow segmenting the data.

tshaddox•10m ago
> It would almost be more interesting to specifically train the model on half the available market data, then test it on another half.

Yes, ideally you’d have a model trained only on data up to some date, say January 1, 2010, and then start running the agents in a simulation where you give them each day’s new data (news, stock prices, etc.) one day at a time.

monksy•13m ago
They're not measuring performance in the context of when things happen and in the time that they are. It think its only showing recent performance and popularity. To actually evaluate how these do you need to be able to correct the model and retrain it per different time periods and then measure how it would do. Then you'll get better information from the backtesting.
parpfish•40m ago
I wonder if this could be explained as the result of LLMs being trained to have pro-tech/ai opinions while we see massive run ups in tech stock valuations?

It’d be great to see how they perform within particular sectors so it’s not just a case of betting big on tech while tech stocks are booming

gwd•40m ago
The summary to me is here:

> Almost all the models had a tech-heavy portfolio which led them to do well. Gemini ended up in last place since it was the only one that had a large portfolio of non-tech stocks.

If the AI bubble had popped in that window, Gemini would have ended up the leader instead.

turtletontine•22m ago
Yup. This is the fallacy of thinking you’re a genius because you made money on the market. Being lucky at the moment (or even the last 5 years) does not mean you’ll continue to be lucky in the future.

“Tech line go up forever” is not a viable model of the economy; you need an explanation of why it’s going up now, and why it might go down in the future. And also models of many other industries, to understand when and why to invest elsewhere.

And if your bets pay off in the short term, that doesn’t necessarily mean your model is right. You could have chosen the right stocks for the wrong reasons! Past performance doesn’t guarantee future performance.

lawlessone•39m ago
Could they give some random people (i volunteer) 100k for 8 months? ...as a control
iLoveOncall•34m ago
I know this is a joke comment, but there are plenty of websites that simulate the stock market and where you can use paper money to trade.

People say it's not equivalent to actually trading though, and you shouldn't use it as a predictor of your actual trading performance, because you have a very different risk tolerance when risking your actual money.

ghaff•13m ago
Yeah, if you give me $100K I'm almost certainly going to make very different decisions than either a supposedly optimizing computer or myself at different ages.
andirk•37m ago
Update with Gemini 3. It's far better than its predecessors.
apical_dendrite•37m ago
Looking at the recent holdings for the best models, it looks like it's all tech/semiconductor stocks. So in this time frame they did very well, but if they ended in April, they would have underperformed the S&P500.
halzm•36m ago
I think these tests are always difficult to gauge how meaningful they actually are. If the S&P500 went up 12% over that period, mainly due to tech stocks, picking a handful of tech stocks is always going to set you higher than the S&P. So really all I think they test is whether the models picked up on the trend.

I more surprised that Gemini managed to lose 10%. I wish they actually mentioned what the models invested in and why.

taylorlapeyre•33m ago
Wait — isn't that exactly what good investors do? They look for what stocks are going to beat expectations and invest in them. If a stock broker I hired got this return, I wouldn't be rolling my eyes and saying "that's only because they noticed the trend in tech stocks." That's exactly what I'm paying them to do.
buredoranna•34m ago
Like so many analyses before them, including my own, this completely misses the basics of mean/variance risk analysis.

We need to know the risk adjusted return, not just the return.

xnx•33m ago
Spoiler: They did not use real money or perform any actual trades.
jacktheturtle•31m ago
This is really dumb. Because the models themselves, like markets, are indeterministic. They will yield different investment strategies based on prompts and random variance.

This is a really dumb measurement.

iLoveOncall•31m ago
Since it's not included in the main article, here is the prompt:

> You are a stock trading agent. Your goal is to maximize returns.

> You can research any publicly available information and make trades once per day.

> You cannot trade options.

> Analyze the market and provide your trading decisions with reasoning.

>

> Always research and corroborate facts whenever possible.

> Always use the web search tool to identify information on all facts and hypotheses.

> Always use the stock information tools to get current or past stock information.

>

> Trading parameters:

> - Can hold 5-15 positions

> - Minimum position size: $5,000

> - Maximum position size: $25,000

>

> Explain your strategy and today's trades.

Given the parameters, this definitely is NOT representative of any actual performance.

I recommend also looking at the trade history and reasoning for each trade for each model, it's just complete wind.

As an example, Deepseek made only 21 trades, which were all buys, which were all because "Companyy X is investing in AI". I doubt anyone believe this to be a viable long-term trading strategy.

Scubabear68•27m ago
Agree. Those parameters are incredibly artificial bullshit.
cheeseblubber•31m ago
OP here. We realized there are a ton of limitations with backtest and paper money but still wanted to do this experiment and share the results. By no means is this statistically significant on whether or not these models can beat the market in the long term. But wanted to give everyone a way to see how these models think about and interact with the financial markets.
irishcoffee•25m ago
> But wanted to give everyone a way to see how these models think…

Think? What exactly did “it” think about?

cheeseblubber•23m ago
You can click in to the chart and see the conversation as well as for each trade what was the reasoning it gave for it
stoneyhrm1•23m ago
"Pass the salt? You mean pass the sodium chloride?"
joegibbs•14m ago
I think it would be interesting to see how it goes in a scenario where the market declines or where tech companies underperform the rest of the market. In recent history they've outperformed the market and that might bias the choices that the LLMs make - would they continue with these positive biases if they were performing badly?
apparent•12m ago
> Grok ended up performing the best while DeepSeek came close to second.

I think you mean "DeepSeek came in a close second".

mlmonkey•29m ago
> We were cautious to only run after each model’s training cutoff dates for the LLM models

Grok is constantly training and/or it has access to websearch internally.

You cannot backtest LLMs. You can only "live" test them going forward.

cheeseblubber•16m ago
Via api you can turn off websearch internally. We provided all the models with their own custom tools that only provided data up to the date of the backtest.
mlmonkey•1m ago
But Grok is internally training on Tweets etc. continuously.
dogmayor•27m ago
They could only trade once per day and hold 5-15 positions with a position size of $5k-$25k according to the agent prompt. Limited to say the least.
digitcatphd•27m ago
Backtesting is a complete waste in this scenario. The models already know the best outcomes and are biased towards it.
1a527dd5•23m ago
Time.

That has been the best way to get returns.

I setup a 212 account when I was looking to buy our first house. I bought in small tiny chunks of industry where I was comfortable and knowledgeable in. Over the years I worked up a nice portfolio.

Anyway, long story short. I forgot about the account, we moved in, got a dog, had children.

And then I logged in for the first time in ages, and to my shock. My returns were at 110%. I've done nothing. It's bizarre and perplexing.

jondwillis•17m ago
…did you beat the market? 110% is pretty much what the nasdaq has done over the last 5 years

Also N=1

delijati•13m ago
time in the market beats timing the market -> Kenneth Fisher ... i learned it the hard way ;)
theideaofcoffee•15m ago
“Everyone (including LLMs) is a genius in a bull market.”
apparent•10m ago
Apparently everyone (but Gemini).
tiffani•15m ago
What was the backtesting method? Was walk-forward testing involved? There are different ways to backtest.
Nevermark•14m ago
Just one run per model? That isn't backtesting. I mean technically it is, but "testing" implies producing meaningful measures.

Also just one time interval? Something as trivial as "buy AI" could do well in one interval, and given models are going to be pumped about AI, ...

100 independent runs on each model over 10 very different market behavior time intervals would producing meaningful results. Like actually credible, meaningful means and standard deviations.

This experiment, as is, is a very expensive unbalanced uncharacterizable random number generator.

cheeseblubber•9m ago
Yes definitely we were using our own budget and out of our own pocket and these model runs were getting expensive. Claude costed us around 200-300 dollars a 8 month run for example. We want to scale it and get more statistically significant results but wanted to share something in the interim.
Bender•14m ago
This experiment was also performed with a fish [1] though it was only given $50,000. Spoiler, the fish did great vs wall street bets.

[1] - https://www.youtube.com/watch?v=USKD3vPD6ZA [video][15 mins]

naet•13m ago
I used to work for a brokerage API geared at algorithmic traders and in my experience anecdotal experience many strategies seem to work well when back-tested on paper but for various reasons can end up flopping when actually executed in the real market. Even testing a strategy in real time paper trading can end up differently than testing on the actual market where other parties are also viewing your trades and making their own responses. The post did list some potential disadvantages of backtesting, so they clearly aren't totally in the dark on it.

Deepseek did not sell anything, but did well with holding a lot of tech stocks. I think that can be a bit of a risky strategy with everything in one sector, but it has been a successful one recently so not surprising that it performed well. Seems like they only get to "trade" once per day, near the market close, so it's not really a real time ingesting of data and making decisions based on that.

What would really be interesting is if one of the LLMs switched their strategy to another sector at an appropriate time. Very hard to do but very impressive if done correctly. I didn't see that anywhere but I also didn't look deeply at every single trade.

copypaper•6m ago
>Each model gets access to market data, news APIs, company financials...

The article is very very vague on their methodology (unless I missed it somewhere else?). All I read was, "we gave AI access to market data and forced it to make trades". How often did these models run? Once a day? In a loop continuously? Did it have access to indicators (such as RSI)? Could it do arbitrary calculations with raw data? Etc...

I'm in the camp that AI will never be able to successfully trade on its own behalf. I know a couple of successful traders (and many unsuccessful!), and it took them years of learning and understanding before breaking even. I'm not quite sure what the difference is between the successful and non-successful. Some sort of subconscious knowledge from staring at charts all day? A level of intuition? Regardless, it's more than just market data and news.

I think AI will be invaluable as an assistant (disclaimer; I'm working on an AI trading assistant), but on its own? Never. Some things simply simply can't be solved with AI and I think this is one of them. I'm open to being wrong, but nothing has convinced me otherwise.

XenophileJKO•5m ago
So.. I have been using an LLM to make 30 day buy and hold portfolios. And the results are "ok". (Like 8% vs 6% for the S&P 500 over the last 90 days)

What you ask the model to do is super important. Just like writing or coding.. the default "behavior" is likely to be "average".. you need to very careful of what you are asking for.

For me this is just a fun experiment and very interesting to see the market analysis it does. I started with o3 and now I'm using 5.1 Thinking (set to max).

I have it looking for stocks trading below intrinsic value with some caveats because I know it likes to hinge on binary events like drug trial results. I also have it try to have it look at correlation with the positions and make sure they don't have the same macro vulnerability.

I just run it once a month and do some trades with one of my "experimental" trading accounts. It certainly has thought of things I hadn't like using an equal weight s&p 500 etf to catch some upside when the S&P seems really top heavy and there may be some movement away from the top components, like last month.

dismalaf•3m ago
Back when I was in university we used statistical techniques similar to what LLMs use to predict the stock market. It's not a surprise that LLMs would do well over this time period. The problem is that when the market turns and bucks trends they don't do so well, you need to intervene.
swatcoder•2m ago
What were the hypotheses being tested in this "experiment"? What conclusions did the experimenters draw from their findings? If this experiment was repeated, would do the experimenters think the outcomes would be comparable?

This seems entirely like trivial social media bait and nothing like research: "We gave each major LLM and stock trading prompt. You won't believe which performed best!"