H100 is not an everyday product. Laptop is
edit: fixed typo
Anyway, I was assuming personal use, like the messing-around experimenting that the article is about. (Or who knows, maybe it was part of the author’s job.)
"Pull out credit card, sign up for some thing and pay a bit of money" is a non-trivial bit of friction! Extremely non-trivial!
Especially in a corporate context - you have to get the expense approved. It's not clear if you can put company data onto the machine. Whereas generally running local things on corporate laptops is far less controversial.
"Download this tool and run it." is still an extremely powerful pitch. Pretty much the only thing that beats it is "go to this website which you can use without any signup or payment".
Anyway, I thought the context was doing stuff for personal use/fun, not work.
There's also no overages or hidden charges with a laptop. Past simply breaking it. You know the replacement cost ahead of time, though.
On that note you can rent an H100 for an hour for under $10 which might make for a slightly more interesting test, whats the best model outcome you can train in under an hour.
Far cheaper these days. More like $2-3 for a consumer to do this. For bulk deals, pricing is often < $2.
In terms of conpute efficiency though, Nvidia still has Apple beat. Nvidia wouldn't have the datacenter market on a leash if Apple was putting up a real fight.
We can / should benchmark and optimize this to death on all axes
What exactly is your point? That instead of expressing workloads in terms of what a laptop could do, you prefer to express them in terms of what a MacBook Pro could do?
Of course that only works if the trial runs are representative of what your full scale model will look like. But within those constraints optimising training time seems very valuable
32 CPU
80 GPU
512GB RAM
https://www.apple.com/shop/buy-mac/mac-studio/apple-m3-ultra...Apple M3 Ultra (GPU - 80 cores) scores 7235.31
NVIDIA GeForce RTX 5090 Laptop GPU scores 7931.31
Note the memory constraints of NVIDIA are not like Apple silicon which tends to also be less i/o constrained. YMMV
https://www.youtube.com/watch?v=d8yS-2OyJhw
https://www.youtube.com/watch?v=Ju0ndy2kwlw
Apple m3/m4 silicon is certainly good in some ways, but the bottleneck is often a lack of CUDA software support and price (could buy >4 times the GPU raw performance on a dual rtx 5090 desktop.) =3
I love the phrase "officially major people"! I wonder how it could be put to use in everyday speech?
Might still be early days I’m trying to use the model to sort my local notes but I don’t know man seems only a little faster yet still unusable and I downloaded the lighter qwen model as recommended.
Again it’s early days maybe I’m being an idiot I did manage to get it to parse one note after about 15 mins though.
gpt-oss-20b eats too much ram to use for anything other than an overnight task. maybe 3tok/s.
Been playing around with the 8b versions of qwen and deepseek. Seems usable so far. YMMV, i'm just messing around in chat at the moment, haven't really had it do any tasks for me
As an alternative you might consider a Ryzen Pro 395+ like in the Framework desktop or HP Zbook G1a but the 128GB versions are still extremely expensive. The Asus Flow Z13 is a tablet with ryzen 395+ but hardly available with 128GB
—
I know, I know. I’m intentionally misinterpreting the OP’s clear intent (the stuff of comedy). And normally a small joke like this wouldn’t be worth the downvotes…
But, I think there’s a deeper double meaning in this brave new world of prompt engineering. Most chat isn’t all that precise without some level of assumed shared context:
These days the meaning of the phrase ai has changed from the classical definition (all algorithms welcome), and now ai usually means LLMs and their derivatives.
A 5yo also has... 5 years of cumulative real world training. I'm a bit of an AI naysayer but I'd say the comparison doesn't seem quite accurate.
Isn't there a bit of a defining line between something like tic-tac-toe that has a finite (and pretty limited for a computer) set of possible combinations where it seems like you shouldn't need a training set that is larger than said set of possible combinations, and something more open-ended where the impact of the size of your training set mainly impacts accuracy?
It's just 3^9, right? 9 boxes, either X,O, or blank? We're only at 19,683 game states and would trim down from here if we account for the cases above.
Would be curious to see a version of your model size comparison chart but letting the training continue until perplexity plateaus / begins to overfit. For example: are your larger models performing worse because they are overfitting to a small dataset, or because you are comparing model sizes at a fixed 5 minute computation time - so that the large models just don't get to learn very much in that time.
(Also interesting would be learning curve comparisons between architecture/param count)
Neural-type models have long passed the point where markov chains made any sense by many orders of magnitude.
Markov models fail by being too opinionated about the style of compute.
In contrast, a linear tensor + non-linear function has incredible flexibility to transform the topology of information. Given large enough tensors, two such layers, with recurrence, can learn any mapping, static or dynamical. No priors (other than massive compute) needed.
All other neural architectures then are simply sparser arrangements, that bring compute demands down. Where the sparseness is fit to the type of problem.
Sparseness can be deeper but narrower information flows (thus “deep” learning). Or in lower numbers of weights to weight application (I.e. shared weights, like convolutions).
That said, it does make it easier to claim progress...
Makes me want to try training a model to sing "Daisy, Daisy..."
The podcast itself with one of the authors was fantastic for explaining and discussing the capabilities of LLMs more broadly, using this small controlled research example.
As an aside: i dont know what the dataset is in the biological analogy, maybe the agar plate. A super simple and controlled environment in which to study simple organisms.
For ref: - Podcast ep https://www.cognitiverevolution.ai/the-tiny-model-revolution... - tinystories paper https://arxiv.org/abs/2305.07759
On a laptop, on a desktop, on a phone?
Train for 5 minutes, an hour, a day, a week?
On a boat? With a goat?
That boat will float your goat!
I think you meant Llama.
The rhymes are admittedly more limited, unless you have a Boston accent.
But honestly I really like the short turnaround times. Makes it easy to experiment with different parameters and develop an intuition for what they do.
The lesson here is that you can't use a laptop to train a useful model - at least not without running that training for probably decades.
That doesn't mean you can't run a useful model on a laptop that was trained in larger hardware. I do that all the time - local models hit really good this year.
> reducing model size while retaining capability will just never happen.
Tell that to Qwen3-4B! Those models are remarkably capable.
Local models are no where near capable compared to frontier big models.
While a small model might be fine for your use case, it can not replace Sonnet-4 for me.
But it is massively more capable than the 4GB models we had last year.
Meanwhile recent models that are within the same ballpark of capabilities as Claude Sonnet 4 - like GLM 4.5 and Kimi K2 and the largest of the Qwen 3 models - can just about fit on a $10,000 512GB of RAM Mac Studio. That's a very notable trend.
The local models can get 10x as good next year, it won't matter to me if the frontier models are still better.
And just because we can run those models (heavily quantized, and thus less capable), they are unusably slow on that 10k dead weight hardware.
There's also a whole world of data curation to improve training, which is likely to be great for small models and seems still underexplored.
Once you setup a good system prompt on these, nothing really compares.
Most of the models you see with high benchmarks are not even comparable on real tasks.
qwen3 or deepseek r1, they aren't even 1/10 as good as Gemini Pro2.5
If model trusts the users, and if user is dumb model will "weigh" user's input much higher and end up with flawed code.
If the model is more independent, it will find the right solution. If just want a dumb model which says yes to everything, and follows you when u are not at smart enough then you'll never end up with good solution if not by luck.
While other models like qwen3, glm promise big in real code writing they fail badly, get stuck in loops.
The only problem right now i run into gemini is i get throttled every now and then with empty response specially around this time.
https://www.ioccc.org/2019/mills/index.html
I suppose if you only have 5 minutes this is probably about the level you'd get.
$5599.00 https://marketplace.nvidia.com/en-us/consumer/gaming-laptops...
Although you can get them with fewer specs and the same GPU for $3,899.99
https://marketplace.nvidia.com/en-us/consumer/gaming-laptops...
[0] https://www.digitaltrends.com/computing/laptop-gpu-power-lim... [1] https://coconote.app/notes/4c75b7a0-eb41-435d-85ee-55ae2dd8d...
Bond has only minutes to train a strong enough AI model to pretend to be him and fool his targets long enough for him to gain entry to their impregnable fortress. Can he do it?!?
bbarnett•4h ago
https://m.youtube.com/shorts/4qN17uCN2Pg
treetalker•4h ago
"You're absolutely right!"