A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)

https://point.free/blog/gemma-4-on-a-2016-xeon/

5•cafkafk•56m ago

Comments

cafkafk•53m ago

Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers.

I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.

I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.

I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.

fragmede•28m ago

(purple on black is really hard to read)

You say it runs "at reading speed". Have you benchmarked it?

cafkafk•3m ago

> (purple on black is really hard to read)

Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.

> You say it runs "at reading speed". Have you benchmarked it?

At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:

llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128

Gives:

  llama_print_timings:        load time =   83911.65 ms
  llama_print_timings:      sample time =      26.99 ms /   128 runs   (    0.21 ms per token,  4742.15 tokens per second)
  llama_print_timings: prompt eval time =     343.41 ms /     7 tokens (   49.06 ms per token,    20.38 tokens per second)
  llama_print_timings:        eval time =   10639.36 ms /   127 runs   (   83.77 ms per token,    11.94 tokens per second)
  llama_print_timings:       total time =   11114.98 ms /   134 tokens

So 11.94 tokens per second while it's also playing binary cache and CI builder.

When I do it properly, I'll add it to the blog as well!

Eonexus•17m ago

I wonder what the tokens per second actually are. Yes, it does say "reading speed" but that varies for everyone, no?

potus_kushner•15m ago

@cafkafk got a recommendation for a good model that fits into 64GB and leaves a couple GB free for other tasks ?

Decache – you might have lost media in your PC's cache folders

Criminal Activities and Migration

A free, open-source library of DESIGN.md files for AI-generated UIs

Dune's Butlerian Jihad and the Future of AI

MiniMax M3

People are apparently farming citations on ResearchGate – Chuniversiteit

The DOJ Wants to Know Who on Reddit and X Is Criticizing ICE's Tactics

How Elon Musk Killed Hundreds of Thousands of People

Basketeer – a typed TS SDK for your Tesco account, with nutrition data

'Penguin' decays from CERN's Large Hadron Collider experiment hint new physics

Emergence World: A Laboratory for Evaluating Long-Horizon Agent Autonomy

Homebrew lead Mike McQuaid: Sandboxes and Worktrees - My Secure Agentic AI Setup

Lean, Not Backpressure

Using Git's rerere feature to escape recurring conflict hell

Malaysia enforces ban on social media accounts for children younger than 16

AI Dangers Eclipse Nuclear Weapons at Singapore Defense Forum

Open source analytics that answers backbase

Turkey Hacked the Hair Transplant Industry

How GPT Image 2 Is Transforming Marketing Workflows in 2026

Improve Git monorepo performance with a file system monitor

Strava for Claude Code

Rift: Better Alternative to Git Worktrees

MiniMax M3 on Qubrid AI

There's Something Else We Should Be Worrying About

Growth Isn't About Doing Everything

A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)

Celebrity Profile of an A.I. Actress

What Is Windows K2?

AI is devoid of meaning and humanity. Its vapid voice suits the political moment

Show HN: Interpreto – Live Translation for Travel

A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)

Comments

Decache – you might have lost media in your PC's cache folders

Criminal Activities and Migration

A free, open-source library of DESIGN.md files for AI-generated UIs

Dune's Butlerian Jihad and the Future of AI

MiniMax M3

People are apparently farming citations on ResearchGate – Chuniversiteit

The DOJ Wants to Know Who on Reddit and X Is Criticizing ICE's Tactics

How Elon Musk Killed Hundreds of Thousands of People

Basketeer – a typed TS SDK for your Tesco account, with nutrition data

'Penguin' decays from CERN's Large Hadron Collider experiment hint new physics

Emergence World: A Laboratory for Evaluating Long-Horizon Agent Autonomy

Homebrew lead Mike McQuaid: Sandboxes and Worktrees - My Secure Agentic AI Setup

Lean, Not Backpressure

Using Git's rerere feature to escape recurring conflict hell

Malaysia enforces ban on social media accounts for children younger than 16

AI Dangers Eclipse Nuclear Weapons at Singapore Defense Forum

Open source analytics that answers backbase

Turkey Hacked the Hair Transplant Industry

How GPT Image 2 Is Transforming Marketing Workflows in 2026

Improve Git monorepo performance with a file system monitor

Strava for Claude Code

Rift: Better Alternative to Git Worktrees

MiniMax M3 on Qubrid AI

There's Something Else We Should Be Worrying About

Growth Isn't About Doing Everything

A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)

Celebrity Profile of an A.I. Actress

What Is Windows K2?

AI is devoid of meaning and humanity. Its vapid voice suits the political moment

Show HN: Interpreto – Live Translation for Travel