frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU

https://arxiv.org/abs/2604.05091
97•chrsw•2h ago

Comments

internetguy•1h ago
> MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state

This is pretty awesome. The only compute I have at home is an RTX 3080 with 10 GB of VRAM, so I struggle with training larger models (>40M, 50M params). I get OOM errors and have to optimize a lot.

I have a lot more CPU RAM in my PC, and this would likely increase the size of models I can train locally.

weitendorf•43m ago
To make the most of these architectures I think the key is essentially moving more of the knowledge/capabilities out of the "weights" and into the complimentary parts of the system in a way that's proportionate to the capabilities of the hardware.

In the past couple months there's been a kind of explosion in small-models that are occupying a niche in this kind of AI-transcoding space. What I'm hoping we're right on the cusp of achieving is a similar explosion in what I'd call tool-adaptation, where an LLM paired with some mostly-fixed suite of tools and problem cases can trade off some generality for a specialized (potentially hyper-specialized to the company or user) role.

The thing about more transcoding-related tasks is that they in general stay in sync with what the user of the device is actively doing, which will also typically be closely aligned with the capabilities of the user's hardware and what they want to do with their computer. So most people aren't being intentional about this kind of stuff right now, partly out of habit I think, because only just now does it make sense to think of personal computer as "stranded hardware" now that they can be steered/programmed somewhat autonomously.

I'm wondering if with the right approach to MoE on local devices (which local llms are heading towards) we could basically amortize the expensive hit from loading weights in and out of VRAM through some kind of extreme batch use case that users still find useful enough to be worth the latency. LoRa is already really useful for this but obviously sometimes you need more expertise/specialization than just a few layers' difference. Experimenting with this right now. It's the same basic principle as in the paper except less of a technical optimization and more workload optimization. Also it's literally the beginning of machine culture so that's kind of cool

giancarlostoro•37m ago
> This is pretty awesome. The only compute I have at home is an RTX 3080 with 10 GB of VRAM, so I struggle with training larger models (>40M, 50M params). I get OOM errors and have to optimize a lot.

I'm on the same board, its intimidating to me if I even want to bother training anything at all. Do you mind sharing what kind of training you've done with that GPU? :)

hirako2000•19m ago
The claims of the article assumes far more compute and far more VRAM..while the trick enables less back and forth, they don't eliminate it.

I doubt you meant 50M. Rather 50B?

You can only give it a try, but don't get your hopes high on a large context. If their technique works I would guess 8096k context limits would still OOM. 2048 maybe.

I'm extrapolating based on my experiment without this paper's trick to leverage the system memory.

olliepro•1h ago
This would likely only get used for small finetuning jobs. It’s too slow for the scale of pretraining.
onion2k•1h ago
It’s too slow for the scale of pretraining.

There isn't really such a thing as 'too slow' as an objective fact though. It depends on how much patience and money for electricity you have. In AI image gen circles I see people complaining if a model takes more than 5s to generate an image, and other people on very limited hardware who happily wait half an hour per image. It's hard to make a judgement call about what 'too slow' means. It's quite subjective.

jandrese•1h ago
If it would take so long to train that the model will be obsolete before the training is finished that might be considered too long. With ML you can definitely hit a point where it is too slow for any practical purpose.
ismailmaj•54m ago
Obsolete because of what? Because with limited hardware you’re never aiming for state of the art, and for fine-tuning, you don’t steer for too long anyway.
jandrese•43m ago
Because there is a new model that is better, faster, more refined, etc...

If your training time is measured in years or decades it probably won't be practical.

jwilber•20m ago
That’s just playing semantics. Nobody is talking about, “objective facts” or need define them here. If the step time is measured in days, and your model takes years to train, then it will never get trained to completion on consumer hardware (the entire point).
greenavocado•1h ago
So distribute copies of the model in RAM to multiple machines, have each machine update different parts of the model weights, and sync updates over the network
l1n•1h ago
Seems similar to Microsoft DeepSpeed.
bee_rider•11m ago
The compare against “DeepSpeed ZeRO-3” apparently.
WithinReason•1h ago
I was wondering how well this would work :) You can definitely push this further, the question is: how well can the gradients and updates compress?
1aurent29•51m ago
sounds very similar to https://docs.pytorch.org/docs/stable/distributed.fsdp.fully_... i wonder how much this could be replicated using only this pytorch primitive
ilaksh•28m ago
How long would it actually take to train a 120B model on an H200? What if you have 8?
kouteiheika•12m ago
This isn't really anything new; I've been doing something like this for quite a while, I just haven't bothered writing a paper. (: Probably anyone who would seriously tackle the problem of "how do I train a huge model on a tiny amount of VRAM?" would come up with something similar.

However, most people in the field don't, because the actual practical utility of training huge models on a single GPU is quite low. (e.g they got 341 tok/s for a 14B model on a single 3090 while with my method I was getting ~1k tok/s on a single 4090; that's still very slow)

Also, there are more tricks one can use to speed up training/lower VRAM usage which they're not using. For example, you don't need any gradient offloading (you can just accumulate the gradients directly into the optimizers' states if you modify your optimizer), you can use Muon instead of Adam (which needs only half of VRAM of Adam), you can use quantization (both for parameters and for the optimizer states; e.g. I found Muon quantized into 4-bit working relatively well), etc.

Git commands I run before reading any code

https://piechowski.io/post/git-commands-before-reading-code/
789•grepsedawk•5h ago•178 comments

MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU

https://arxiv.org/abs/2604.05091
99•chrsw•2h ago•17 comments

Veracrypt project update

https://sourceforge.net/p/veracrypt/discussion/general/thread/9620d7a4b3/
654•super256•7h ago•211 comments

They're Made Out of Meat (1991)

http://www.terrybisson.com/theyre-made-out-of-meat-2/
78•surprisetalk•3h ago•22 comments

Škoda DuoBell: A bicycle bell that penetrates noise-cancelling headphones

https://www.skoda-storyboard.com/en/skoda-world/skoda-duobell-a-bicycle-bell-that-outsmarts-even-...
237•ra•5h ago•317 comments

US cities are axing Flock Safety surveillance technology

https://www.cnet.com/home/security/when-flock-comes-to-town-why-cities-are-axing-the-controversia...
183•giuliomagnifico•2h ago•90 comments

Show HN: We fingerprinted 178 AI models' writing styles and similarity clusters

https://rival.tips/research/model-similarity
12•nuancedev•43m ago•2 comments

Audio Reactive LED Strips Are Diabolically Hard

https://scottlawsonbc.com/post/audio-led
89•surprisetalk•1d ago•24 comments

Project Glasswing: Securing critical software for the AI era

https://www.anthropic.com/glasswing
1404•Ryan5453•20h ago•726 comments

Revision Demoparty 2026: Razor1911 [video]

https://www.youtube.com/watch?v=Lw4W9V57SKs&t=5716s
259•tetrisgm•9h ago•89 comments

Lunar Flyby

https://www.nasa.gov/gallery/lunar-flyby/
855•kipi•23h ago•207 comments

The Harvard Library Passport

https://fi-le.net/stamps/
24•fi-le•2d ago•3 comments

Show HN: We built a camera only robot vacuum for less than 300$ (Well almost)

https://indraneelpatil.github.io/blog/2026/robot-vacuum/
68•indraneelpatil•2d ago•19 comments

Your File System Is Already A Graph Database

https://rumproarious.com/2026/04/04/your-file-system-is-already-a-graph-database/
82•alxndr•2d ago•30 comments

Protect your shed

https://dylanbutler.dev/blog/protect-your-shed/
230•baely•11h ago•62 comments

System Card: Claude Mythos Preview [pdf]

https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841.pdf
772•be7a•20h ago•569 comments

Mario and Earendil

https://lucumr.pocoo.org/2026/4/8/mario-and-earendil/
36•doppp•5h ago•14 comments

GLM-5.1: Towards Long-Horizon Tasks

https://z.ai/blog/glm-5.1
577•zixuanlimit•22h ago•233 comments

LLM plays an 8-bit Commander X16 game using structured "smart senses"

https://pvp-ai.russell-harper.com
7•russellharper•1h ago•0 comments

Native Americans had dice 12k years ago

https://www.nbcnews.com/science/science-news/native-americans-dice-games-probability-study-rcna26...
105•delichon•4d ago•46 comments

Cambodia unveils statue to honour famous landmine-sniffing rat

https://www.bbc.com/news/articles/c0rx7xzd10xo
444•speckx•21h ago•105 comments

How to get better at guitar

https://www.jakeworth.com/posts/how-to-get-better-at-guitar/
409•jwworth•2d ago•207 comments

Slightly safer vibecoding by adopting old hacker habits

http://addxorrol.blogspot.com/2026/03/slightly-safer-vibecoding-by-adopting.html
141•transpute•5d ago•81 comments

A truck driver spent 20 years making a scale model of every building in NYC

https://www.smithsonianmag.com/smart-news/a-truck-drive-spent-20-years-making-this-astonishing-sc...
375•1659447091•2d ago•64 comments

S3 Files

https://www.allthingsdistributed.com/2026/04/s3-files-and-the-changing-face-of-s3.html
342•werner•19h ago•102 comments

Show HN: An interactive map of Tolkien's Middle-earth

https://middle-earth-interactive-map.web.app/
255•frasermarlow•18h ago•51 comments

Show HN: I pipe free sports streams into Jellyfin – no ads, just HLS

https://github.com/pcruz1905/hls-restream-proxy
21•pruz•2h ago•2 comments

Binary obfuscation used in AAA Games

https://blog.farzon.org/2026/04/binary-obfuscation-that-doesnt-kill-lto.html
122•noztol•2d ago•66 comments

Cloudflare targets 2029 for full post-quantum security

https://blog.cloudflare.com/post-quantum-roadmap/
370•ilreb•1d ago•112 comments

Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon

https://github.com/mattmireles/gemma-tuner-multimodal
207•MediaSquirrel•19h ago•26 comments