What's in a GGUF, besides the weights – and what's still missing?

https://nobodywho.ooo/posts/whats-in-a-gguf/

31•bashbjorn•2h ago

Comments

ge96•1h ago

Nice, I recently pulled down TheBloke 7B mistral to try out I have a 4070.

ganelonhb•1h ago

I have a 2070 and can confirm it works amazingly fast.

I love TheBloke I wish he still made stuff

ge96•1h ago

What do you use it for? I'm still trying to use agents, I barely use copilot, only at work when I have to.

I didn't want to get personal with an LLM unless it was local so that's why I was setting this up but yeah. So far just research is what I was looking at.

bashbjorn•1h ago

Yeah, TheBloke era of local LLMs were good times. TBF Unsloth are doing a fantastic job of publishing quants of the major models quickly - they just don't have nearly the volume of "weird" models as TheBloke did.

bashbjorn•1h ago

I love mistral, but that model is... not the best. Maybe try out Gemma 4 e4b, it's a similar size to Mistral 7B, and should run great on your 4070 ("E4B" is slightly misleading naming).

ge96•1h ago

Thanks for the tip, what do you use Gemma 4 e4b for?

redanddead•57m ago

some say it’s a miniaturized gemini model

it’s good at writing, coding, decently intelligent

you can try it on nvidia nim

mixtureoftakes•25m ago

7b mistral is quite outdated. On a 12gb 4070 you can run qwen 3.5 9b q4km or qwen 3.6 35b, the latter will be a lot smarter but also a lot slower due to ram offload.

Try both in lm studio, they really are surprisingly capable

ge96•7m ago

I have 80gb of ram but it's slow capped by i9 CPU or specific asus mobo sucks I think only 2400mhz despite being ddr4

Tried all the stuff bios, volting

kenreidwilson•1h ago

>Published May 18, 2026

hmmm...

bashbjorn•1h ago

whoops, my bad. Just a typo in the markdown. Fixed :)

badsectoracula•38m ago

> not to be confused with the somewhat baffling llama_chat_apply_template exposed in the libllama API, which hardcodes a handful of chat formats directly in C++

As someone who is tinkering with a desktop-based inference app in FLTK[0], i wish this used the actual Jinja2 template parser llama.cpp uses (or there was another C function that did that since AFAICT for "proper" parsing you need to be able to pass a bunch of data to the template so it knows if you, e.g., do tool calling). Currently i'm using this adhocky function, but i guess i'll either write a Jinja2 interpreter or copy/paste the one from llama.cpp's code (depending on how i feel at the time :-P).

But yeah, GGUF's "all-in-one" approach is very convenient. And i agree that it feels odd to have the projection models as separate files - i remember when i first download a vision-capable model, i just grabbed whatever GGUF looked appropriate, then llama.cpp told me it couldn't do model and took me a bit to realize that i had to download an extra file. Literally my thought once i did was "wasn't GGUF supposed to contain everything?" :-P

[0] https://i.imgur.com/GiTBE1j.png

Removing the modem and GPS from my 2024 RAV4 hybrid

RTX 5090 and M4 MacBook Air: Can It Game?

New Nginx Exploit

First public macOS kernel memory corruption exploit on Apple M5

The AI Zombification of Universities

The Power of a Free Popsicle (2018)

WinUI 3 Performance: A Leap Forward

HDD Firmware Hacking

A message from President Kornbluth about funding and the talent pipeline

Computer Hobby Movement in Canada

Understanding the Linux Kernel: The Linux Kernel Startup

Terranox AI (YC W26) Is Hiring a Founding AI/ML Engineer and Summer AI/ML Intern

AI is making me dumb

Int a = 5; a = a++ + ++a; a =? (2011)

Germany's Sovereign Tech Fund Backs KDE with €1.3M

You Don't Align an AI, You Align with It

Fossils show millipede and centipede ancestors evolved legs underwater

What's in a GGUF, besides the weights – and what's still missing?

German intelligence offices snub Palantir software

The conflation of money and things

60fps Video on a CGA? – The GlyphBlaster

EditLens: Quantifying the extent of AI editing in text (2025)

DIY open-source ultrasound hardware on the rp2040/rp2350

Rewrite Bun in Rust has been merged

Show HN: Running the second public ODoH relay

Green Card Holders Targeted for Deportation by New 'Removal Apparatus'

Myths about /dev/urandom (2014)

Leaving the Physical World

The Tree House: A voyage to the source of a backyard dream

OpenData Vector: MIT-Licensed Vector Search on Object Storage

What's in a GGUF, besides the weights – and what's still missing?

Comments

Removing the modem and GPS from my 2024 RAV4 hybrid

RTX 5090 and M4 MacBook Air: Can It Game?

New Nginx Exploit

First public macOS kernel memory corruption exploit on Apple M5

The AI Zombification of Universities

The Power of a Free Popsicle (2018)

WinUI 3 Performance: A Leap Forward

HDD Firmware Hacking

A message from President Kornbluth about funding and the talent pipeline

Computer Hobby Movement in Canada

Understanding the Linux Kernel: The Linux Kernel Startup

Terranox AI (YC W26) Is Hiring a Founding AI/ML Engineer and Summer AI/ML Intern

AI is making me dumb

Int a = 5; a = a++ + ++a; a =? (2011)

Germany's Sovereign Tech Fund Backs KDE with €1.3M

You Don't Align an AI, You Align with It

Fossils show millipede and centipede ancestors evolved legs underwater

What's in a GGUF, besides the weights – and what's still missing?

German intelligence offices snub Palantir software

The conflation of money and things

60fps Video on a CGA? – The GlyphBlaster

EditLens: Quantifying the extent of AI editing in text (2025)

DIY open-source ultrasound hardware on the rp2040/rp2350

Rewrite Bun in Rust has been merged

Show HN: Running the second public ODoH relay

Green Card Holders Targeted for Deportation by New 'Removal Apparatus'

Myths about /dev/urandom (2014)

Leaving the Physical World

The Tree House: A voyage to the source of a backyard dream

OpenData Vector: MIT-Licensed Vector Search on Object Storage