frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)

https://point.free/blog/gemma-4-on-a-2016-xeon/
26•cafkafk•2h ago

Comments

cafkafk•2h ago
Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers.

I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.

I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.

I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.

fragmede•1h ago
(purple on black is really hard to read)

You say it runs "at reading speed". Have you benchmarked it?

cafkafk•1h ago
> (purple on black is really hard to read)

Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.

> You say it runs "at reading speed". Have you benchmarked it?

At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:

llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128

Gives:

  llama_print_timings:        load time =   83911.65 ms
  llama_print_timings:      sample time =      26.99 ms /   128 runs   (    0.21 ms per token,  4742.15 tokens per second)
  llama_print_timings: prompt eval time =     343.41 ms /     7 tokens (   49.06 ms per token,    20.38 tokens per second)
  llama_print_timings:        eval time =   10639.36 ms /   127 runs   (   83.77 ms per token,    11.94 tokens per second)
  llama_print_timings:       total time =   11114.98 ms /   134 tokens
So 11.94 tokens per second while it's also playing binary cache and CI builder.

When I do it properly, I'll add it to the blog as well!

Eonexus•1h ago
I wonder what the tokens per second actually are. Yes, it does say "reading speed" but that varies for everyone, no?
cafkafk•1h ago
That is a very fair point! I just ran a not very scientific benchmark with the system under load, and posted the raw logs in a sibling comment above, but the short answer is that it's hitting 11.94 tokens per second for generation - while it's also being a binary cache and CI build server.

Totally just vibes based, I think it goes up to 20+ tps when it's not under load (and that's me trying to be conservative). For context, reading speed at 250 wpm would be around 5 to 6 tokens per second.

Eonexus•1h ago
Huh, that's actually not bad at all! Sure, it's not at the speed of a GPU, but still, 20 tps is cromulent for a CPU.
potus_kushner•1h ago
@cafkafk got a recommendation for a good model that fits into 64GB and leaves a couple GB free for other tasks ?
cafkafk•1h ago
Honestly, at this point you're probably looking at a smaller model, for the Gemma series I'd go with Gemma 4 E4B with drafters, but that's just a hunch from using it on my laptop (where I do have a RTX 4060 M and 96gb ram).

So you'd change the invocation slightly here, but a lot of things you can potentially reuse.

That said, the Gemma 4 E4B models have so far in my experience been... not great when it comes to long context, but they are very passable for basic tasks, and even seem surprisingly okay at tool calls.

christkv•48m ago
Makes you wonder if its possible to squeeze more tps out of a strix halo system using the 16 zen5 cores as well as the gpu.
asimovDev•21m ago
I have an ancient DDR3 Xeon that doesn't support any AVX (dual x5690 and 96GB 1333 MHz RAM). You reckon it would even build / run at all?
tgtweak•18m ago
It may work - depending on your ram speeds it might not even be that much slower.
cafkafk•18m ago
Loading will take some minutes, but at 96 you can squeeze the model in and have some headroom around like ~10 GB, although depending on the Xeon, you may have to downgrade to E4B instead. Should still work thou.
qwertox•3m ago
CPU (2012)

  Model name:                Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz
Mainboard

  Product Name: P8Z77 WS
GPU

  05:00.0 VGA compatible controller: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti 16GB] (rev a1)
  05:00.1 Audio device: NVIDIA Corporation AD106M High Definition Audio Controller (rev a1)
Memory: 32GB

This works.

vhaudiquet•4m ago
The E5 2620-v4 only supports DDR4.

Malaysia enforces ban on social media accounts for children younger than 16

https://apnews.com/article/malaysia-social-media-ban-16-bfaa7b01163b61b5d53c4ecfa870d133
104•01-_-•1h ago•61 comments

Chuwi Minibook X

https://tylercipriani.com/blog/2026/05/28/chuwi-minibook-x/
263•thcipriani•9h ago•193 comments

A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)

https://point.free/blog/gemma-4-on-a-2016-xeon/
26•cafkafk•2h ago•15 comments

Cloudflare Turnstile requiring fingerprintable WebGL

https://hacktivis.me/articles/cloudflare-turnstile-webgl-fingerprinting
658•HypnoticOcelot•18h ago•356 comments

Decades of Effort Restore Steelhead and Salmon Passage on Alameda Creek

https://www.fisheries.noaa.gov/feature-story/decades-effort-restore-steelhead-and-salmon-passage-...
115•rawgabbit•2d ago•13 comments

Rubin Tracks Skyscraper-Size Asteroids and Failed Supernovas

https://www.quantamagazine.org/rubin-tracks-skyscraper-size-asteroids-failed-supernovas-and-inter...
24•adm4•4h ago•4 comments

ChatGPT for Google Sheets exfiltrates workbooks

https://www.promptarmor.com/resources/gpt-for-google-sheets-data-exfiltration
208•hackerBanana•12h ago•58 comments

Rift: Better Alternative to Git Worktrees

https://github.com/anomalyco/rift
9•f4n4tiX•1h ago•0 comments

1-Bit Bonsai Image 4B Image Generation for Local Devices

https://prismml.com/news/bonsai-image-4b
381•modinfo•17h ago•144 comments

Sony Launches Bravia 9 II and Bravia 7 II with 'True RGB'

https://www.flatpanelshd.com/news.php?subaction=showfull&id=1779897602
20•ksec•4d ago•1 comments

Meta legal action forces Facebook whistleblower to sit in silence – Hay festival

https://www.theguardian.com/technology/2026/may/31/meta-legal-action-forces-facebook-whistleblowe...
16•beardyw•37m ago•2 comments

Two Ways to Draw Infinite Jest's Sierpinski Gasket

https://www.chiply.dev/post-ij-sierpinski
16•chiply•3d ago•7 comments

Dav2d

https://jbkempf.com/blog/2026/dav2d/
477•captain_bender•21h ago•172 comments

United Airlines 767 returns to Newark after Bluetooth name sparks alert

https://simpleflying.com/united-airlines-767-returns-newark-bluetooth-name-alert/
346•Eridanus2•20h ago•657 comments

The Genius of the Barn Owl's Feathers

https://thereader.mitpress.mit.edu/the-genius-of-the-barn-owls-feathers/
30•EA-3167•3d ago•3 comments

Meta launches Instagram, Facebook, and WhatsApp subscriptions

https://techcrunch.com/2026/05/27/meta-officially-launches-instagram-facebook-and-whatsapp-subscr...
216•tambourine_man•15h ago•336 comments

Finding success in industry as a chip designer

https://spectrum.ieee.org/chip-design-academic-vs-industry
33•jnord•2d ago•4 comments

The four programming questions from my 1994 Microsoft internship interview (2023)

https://www.computerenhance.com/p/the-four-programming-questions-from
138•tosh•4d ago•55 comments

Dune's Butlerian Jihad and the Future of AI

https://technology.inquirer.net/147084/dunes-butlerian-jihad-and-the-future-of-ai
14•SVI•1h ago•12 comments

Unix in East Germany (GDR) (1990)

https://groups.google.com/g/comp.unix.wizards/c/QX_dxElrVNs
68•downbad_•2d ago•14 comments

What if remote working, not AI, is to blame for weak junior hiring?

https://www.ft.com/content/2205e2d0-50dc-4e80-9bf7-78d0272276c0
162•uxhacker•2d ago•223 comments

The Speed of Prototyping in the Age of AI

https://darylcecile.net/notes/speed-of-prototyping-age-of-ai
161•mooreds•16h ago•82 comments

Websites have a new way to spy on visitors: analyzing their SSD activity

https://arstechnica.com/security/2026/05/websites-have-a-new-way-to-spy-on-visitors-analyzing-the...
173•Brajeshwar•3d ago•43 comments

The Website Specification

https://specification.website/
497•k1m•1d ago•199 comments

London's Free Roof Terraces

https://diamondgeezer.blogspot.com/2026/05/londons-free-roof-terraces.html
299•zeristor•1d ago•141 comments

Linux/M68k

http://www.linux-m68k.org/
104•doener•2d ago•25 comments

Restartable Sequences

https://justine.lol/rseq/
221•grappler•18h ago•52 comments

Codex just found a "workaround" of not having sudo on my PC

https://twitter.com/i/status/2060746160558543217
536•thunderbong•13h ago•252 comments

Show HN: Streambed – Stream Postgres to Iceberg on S3, Supports Postgres Wire

https://github.com/viggy28/streambed
102•vira28•14h ago•27 comments

New Beam Spring Keyboards

https://www.modelfkeyboards.com/product/beam-spring-b104-keyboard/
91•recursivedoubts•2d ago•68 comments