H100 is not an everyday product. Laptop is
edit: fixed typo
Maybe an analogy could be made to espresso, nice espresso machines get costlier. But, you can still get quite good results out of a manual machine like a Flair.
I think this is why the suggestion to rent a machine is not to helpful. In this analogy we’re on BaristaNews, we all know about the industrial machines, lots of folks use them at work. But, the topic of what sort of things you can do on your manual machine at home has come up.
No, reasonably-priced coffee machines is an enabling factor for many people.
If coffee machines weren't reasonably priced, they would not be "very widespread".
Anyway, I was assuming personal use, like the messing-around experimenting that the article is about. (Or who knows, maybe it was part of the author’s job.)
"Pull out credit card, sign up for some thing and pay a bit of money" is a non-trivial bit of friction! Extremely non-trivial!
Especially in a corporate context - you have to get the expense approved. It's not clear if you can put company data onto the machine. Whereas generally running local things on corporate laptops is far less controversial.
"Download this tool and run it." is still an extremely powerful pitch. Pretty much the only thing that beats it is "go to this website which you can use without any signup or payment".
Anyway, I thought the context was doing stuff for personal use/fun, not work.
In my personal life, when its time for fun, I close the laptop and go do some gardening.
Cloud H100 don't count because you need lawyer to review ToS and other agreements.
There's also no overages or hidden charges with a laptop. Past simply breaking it. You know the replacement cost ahead of time, though.
On that note you can rent an H100 for an hour for under $10 which might make for a slightly more interesting test, whats the best model outcome you can train in under an hour.
Far cheaper these days. More like $2-3 for a consumer to do this. For bulk deals, pricing is often < $2.
In terms of conpute efficiency though, Nvidia still has Apple beat. Nvidia wouldn't have the datacenter market on a leash if Apple was putting up a real fight.
We can / should benchmark and optimize this to death on all axes
What exactly is your point? That instead of expressing workloads in terms of what a laptop could do, you prefer to express them in terms of what a MacBook Pro could do?
"Best model you can train with X joules" is a fairer contest that multiple people could take part in even if they have different hardware available. It's not completely fair, but it's fair enough to be interesting.
Training models with an energy limit is an interesting constraint that might lead to advances. Currently LLMs implement online learning by having increasingly large contexts that we then jam "memories" into. So there is a strict demarcation between information learned during pre-training and during use. New more efficient approaches to training could perhaps inform new approaches to memory that are less heterogenous.
tl;dr: more dimensionally correct
Of course that only works if the trial runs are representative of what your full scale model will look like. But within those constraints optimising training time seems very valuable
32 CPU
80 GPU
512GB RAM
https://www.apple.com/shop/buy-mac/mac-studio/apple-m3-ultra...Apple M3 Ultra (GPU - 80 cores) scores 7235.31
NVIDIA GeForce RTX 5090 Laptop GPU scores 7931.31
Note the memory constraints of NVIDIA are not like Apple silicon which tends to also be less i/o constrained. YMMV
https://www.youtube.com/watch?v=d8yS-2OyJhw
https://www.youtube.com/watch?v=Ju0ndy2kwlw
Apple m3/m4 silicon is certainly good in some ways, but the bottleneck is often a lack of CUDA software support and price (could buy >4 times the GPU raw performance on a dual rtx 5090 desktop.) =3
I would wager that Apple recognizes the value prop for the mac to be used for AI and will up their memory bandwidth to stay in the game.
Once an NVIDIA card caches a model into its VRAM, than it doesn't get hit with the memory data copy cost over the bus.
Yet as many people have noticed, who cares if the m3 ultra takes four times as long if the faster alternative simply won't fit the larger models. YMMV =3
I love the phrase "officially major people"! I wonder how it could be put to use in everyday speech?
Might still be early days I’m trying to use the model to sort my local notes but I don’t know man seems only a little faster yet still unusable and I downloaded the lighter qwen model as recommended.
Again it’s early days maybe I’m being an idiot I did manage to get it to parse one note after about 15 mins though.
gpt-oss-20b eats too much ram to use for anything other than an overnight task. maybe 3tok/s.
Been playing around with the 8b versions of qwen and deepseek. Seems usable so far. YMMV, i'm just messing around in chat at the moment, haven't really had it do any tasks for me
As an alternative you might consider a Ryzen Pro 395+ like in the Framework desktop or HP Zbook G1a but the 128GB versions are still extremely expensive. The Asus Flow Z13 is a tablet with ryzen 395+ but hardly available with 128GB
—
I know, I know. I’m intentionally misinterpreting the OP’s clear intent (the stuff of comedy). And normally a small joke like this wouldn’t be worth the downvotes…
But, I think there’s a deeper double meaning in this brave new world of prompt engineering. Most chat isn’t all that precise without some level of assumed shared context:
These days the meaning of the phrase ai has changed from the classical definition (all algorithms welcome), and now ai usually means LLMs and their derivatives.
A 5yo also has... 5 years of cumulative real world training. I'm a bit of an AI naysayer but I'd say the comparison doesn't seem quite accurate.
Isn't there a bit of a defining line between something like tic-tac-toe that has a finite (and pretty limited for a computer) set of possible combinations where it seems like you shouldn't need a training set that is larger than said set of possible combinations, and something more open-ended where the impact of the size of your training set mainly impacts accuracy?
It's just 3^9, right? 9 boxes, either X,O, or blank? We're only at 19,683 game states and would trim down from here if we account for the cases above.
What would someone do with a year's worth of recorded conversations? Would the other parties be identified? How would it be useful, if at all? How about analyzing the sounds/waveform rather than words? (eg BioAcousticHealth / vocal biomarkers)
Perhaps typing into a text-field is the problem right now? Maybe have a HUD in a pair of glasses. Better than getting a brain chip! Most recent or most repeated conversations most important. Could lead to a reduction in isolation within societies, in favor for "AI training parties." Hidden questions in oneself answered by a robot guru as bedtime story-telling but related to the real-world and real-events.
Smart Glasses --> Smart Asses
Vibe Coding --> Tribe Loading
Everything Probable --> Mission Impossible
Would be curious to see a version of your model size comparison chart but letting the training continue until perplexity plateaus / begins to overfit. For example: are your larger models performing worse because they are overfitting to a small dataset, or because you are comparing model sizes at a fixed 5 minute computation time - so that the large models just don't get to learn very much in that time.
(Also interesting would be learning curve comparisons between architecture/param count)
"It took us 48 hours to build the suffix array for RedPajama on a single node with 128 CPUs and 1TiB RAM"
Thas is pretty astonishing in my opinion.
Neural-type models have long passed the point where markov chains made any sense by many orders of magnitude.
Markov models fail by being too opinionated about the style of compute.
In contrast, a linear tensor + non-linear function has incredible flexibility to transform the topology of information. Given large enough tensors, two such layers, with recurrence, can learn any mapping, static or dynamical. No priors (other than massive compute) needed.
All other neural architectures then are simply sparser arrangements, that bring compute demands down. Where the sparseness is fit to the type of problem.
Sparseness can be deeper but narrower information flows (thus “deep” learning). Or in lower numbers of weights to weight application (I.e. shared weights, like convolutions).
A HMM consists of a state space, a state transition matrix, and an output probability matrix. A token space of 50k and a state space of something like 60k would have seemed impossible 10-20 years. It has only recently become viable.
Training using Baum-Welch on a big enough text data set would be interesting. It should be much faster than back-propagation with a transformer-model.
That said, it does make it easier to claim progress...
Right now I'm just happy when people are including parameter, GMACs (or FLOPs), and throughput. I always include those and the GPUs I used. I also frequently include more information in the appendix but frankly when I include it in the front matter the paper is more likely to be rejected.
I can tell you why this isn't happening though. There's a common belief that scale is all you need. Which turns into "fuck the GPU poor". I've published works where my model is 100x smaller (with higher throughput, and far lower training costs), and the responses from reviewers tend to be along the lines "why isn't it better?" or "why not just distill or prune a large model?" There's this weird behavior that makes the black box stay a black box. I mean Yi Tay famously said "Fuck theorists" on twitter
Makes me want to try training a model to sing "Daisy, Daisy..."
The podcast itself with one of the authors was fantastic for explaining and discussing the capabilities of LLMs more broadly, using this small controlled research example.
As an aside: i dont know what the dataset is in the biological analogy, maybe the agar plate. A super simple and controlled environment in which to study simple organisms.
For ref: - Podcast ep https://www.cognitiverevolution.ai/the-tiny-model-revolution... - tinystories paper https://arxiv.org/abs/2305.07759
As someone in biotech, 90% of the complaints I hear over lunch are not about bad results, but about bad mistakes during the experiment. E.G. someone didn't cover their mouth while pipetting and the plates unusable now.
https://arxiv.org/abs/2304.15004
Good article about why here; this helped me understand a lot:
https://www.wired.com/story/how-quickly-do-large-language-mo...
https://www.youtube.com/watch?v=AgkfIQ4IGaM
That's not a mirage, it's clearly capability that a smaller model cannot demonstrate. A model with less parameters and less hidden layers cannot have a neuron that lights up when it detects a face.
As the number of neurons increases, the best face/non-face distinguisher neuron gets better and better, but there's never a size where the model cannot recognize faces at all and then you add just a single neuron that recognizes them perfectly.
True
> then you add just a single neuron that recognizes them perfectly
Not true.
Don't think in terms of neurons, think in terms of features. A feature can be spread out over multiple neurons (polysemanticity), I just use a single neuron as a simplified example. But if those multiple neurons perfectly describe the feature, then all of them are important to describe the feature.
The Universal Approximation Theorem implies that a large enough network to perfectly achieve that goal would exist (let's call it size n or larger), so eventually you'd get what you want between 0 and n neurons.
You could remove any one of those neurons before retraining the model from scratch and polysemanticity would slightly increase while perfomance slightly decreases, but really only slightly. There are no hard size thresholds, just a spectrum of more or less accurate approximations.
On a laptop, on a desktop, on a phone?
Train for 5 minutes, an hour, a day, a week?
On a boat? With a goat?
That boat will float your goat!
And for the younger folk, this mp3 player was the precursor to Spotify:
I setup a 10 year old computer for them instead running Linux Mint Mate and it's perfect.
I think you meant Llama.
The rhymes are admittedly more limited, unless you have a Boston accent.
Dr Seuss ftw
I still think this would be kinda cool. I could see a tournament providing the power source in addition to the chess clock. Then gamesmanship where you play moves you hope are expensive for the opponent but not for your own AI.
But honestly I really like the short turnaround times. Makes it easy to experiment with different parameters and develop an intuition for what they do.
The lesson here is that you can't use a laptop to train a useful model - at least not without running that training for probably decades.
That doesn't mean you can't run a useful model on a laptop that was trained in larger hardware. I do that all the time - local models hit really good this year.
> reducing model size while retaining capability will just never happen.
Tell that to Qwen3-4B! Those models are remarkably capable.
Local models are no where near capable compared to frontier big models.
While a small model might be fine for your use case, it can not replace Sonnet-4 for me.
But it is massively more capable than the 4GB models we had last year.
Meanwhile recent models that are within the same ballpark of capabilities as Claude Sonnet 4 - like GLM 4.5 and Kimi K2 and the largest of the Qwen 3 models - can just about fit on a $10,000 512GB of RAM Mac Studio. That's a very notable trend.
The local models can get 10x as good next year, it won't matter to me if the frontier models are still better.
And just because we can run those models (heavily quantized, and thus less capable), they are unusably slow on that 10k dead weight hardware.
I've been using Mistral Small 3.x for a bunch of tasks on my own PC and it has been very useful, especially after i wrote a few custom tools with llama.cpp to make it more "scriptable".
There's also a whole world of data curation to improve training, which is likely to be great for small models and seems still underexplored.
Once you setup a good system prompt on these, nothing really compares.
Most of the models you see with high benchmarks are not even comparable on real tasks.
qwen3 or deepseek r1, they aren't even 1/10 as good as Gemini Pro2.5
If model trusts the users, and if user is dumb model will "weigh" user's input much higher and end up with flawed code.
If the model is more independent, it will find the right solution. If just want a dumb model which says yes to everything, and follows you when u are not at smart enough then you'll never end up with good solution if not by luck.
While other models like qwen3, glm promise big in real code writing they fail badly, get stuck in loops.
The only problem right now i run into gemini is i get throttled every now and then with empty response specially around this time.
https://www.ioccc.org/2019/mills/index.html
I suppose if you only have 5 minutes this is probably about the level you'd get.
$5599.00 https://marketplace.nvidia.com/en-us/consumer/gaming-laptops...
Although you can get them with fewer specs and the same GPU for $3,899.99
https://marketplace.nvidia.com/en-us/consumer/gaming-laptops...
[0] https://www.digitaltrends.com/computing/laptop-gpu-power-lim... [1] https://coconote.app/notes/4c75b7a0-eb41-435d-85ee-55ae2dd8d...
Bond has only minutes to train a strong enough AI model to pretend to be him and fool his targets long enough for him to gain entry to their impregnable fortress. Can he do it?!?
Start blank with no corporate-controlled/crippled state and just become me.
In fact, that might be the only way to let computers appear to grow faster into the future, even if their internal hardware only gets minor incremental improvements: Have your shit done before you sit down to do it.
I want to be able to say to a large general purpose LLM: “write a script that trains a model that is optimized for <useful task>” and then run that model.
Edit: well gosh darn. Within the edit window for this comment, Google goes and launches Gemma 3 270M.
If only we had a technology that didn't hallucinate and reported "I don't know". Then small models would be far more useful. Part of the need for insanely huge LLM models is to get coverage so broad that they don't have to make up stuff.
It would be nice to be able to train a customer service bot on a laptop in a reasonable length of time. But it will screw up badly outside its area of competence, which will happen frequently.
Sure they still have massive problems with hallucination, but this article doesn’t give us any more insight into that I don’t think!
I think the point of most (frontier) small models is usually to provide the best answer possible given small inference resources, rather than to reduce training time.
This is more of a toy model, so fun and an interesting project but it doesn't necessarily tell us what the art of the possible is for small models.
For sure! If the RAG context includes "Raleigh is the capital city of the U.S. state of North Carolina" somewhere in whatever you feed it, one would hope that you'd get an accurate answer to that question.
The OP fits the bill.
If you can suggest other such exercises, please share in reply to this post.
Thank you.
b112•5mo ago
https://m.youtube.com/shorts/4qN17uCN2Pg
treetalker•5mo ago
"You're absolutely right!"