frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Vision Now Available in Llama.cpp

https://github.com/ggml-org/llama.cpp/blob/master/docs/multimodal.md
163•redman25•4h ago

Comments

simonw•3h ago
This is the most useful documentation I've found so far to help understand how this works: https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd...
gryfft•3h ago
Seems like another step change. The first time I ran a local LLM on my phone and carried on a fairly coherent conversation, I imagined edge inference would take off really quickly at least with e.g. personal assistant/"digital waifu" business cases. I wonder what the next wave of apps built on Llama.cpp and its downstream technologies will do to the global economy in the next three months.
LPisGood•3h ago
The “global economy in three month is writing some checks that I don’t know all of the recent AI craze has been able to cash in three years.
ijustlovemath•2h ago
AI is fundamentally learning the entire conditional probability distribution of our collective knowledge; but sampling it over and over is not going to fundamentally enhance it, except to, perhaps, reinforce a mean, or surface places we have insufficiently sampled. For me, even the deep research agents aren't the best when it comes to surfacing truth, because the nuance of that is lost on the distribution.

I think that if we're realistic with ourselves, AI will become exponentially more expensive to train, but without additional high quality data (not you, synthetic data), we're back to 1980s era AI (expert systems), just with enhanced fossil fuel usage to keep up with the TPUs. What's old is new again, I suppose!

I sincerely hope to be proven wrong, of course, but I think recent AI innovation has stagnated in terms of new things it can do. It's a great tool, when you use it to leverage that distribution (eg, semantic search), but it might not fundamentally be the approach to AGI (unless your goal is to replicate what we can, but less spikey)

MoonGhost•2h ago
It's not as simple as stochastic parrot. Starting with definitions and axioms all theorems can be invented and proved. That's in theory, without having theorems in the training set. That's thinking models should be able to do without additional training and data.

In other words way forward seems to be to put models in loops. Which includes internal 'thinking' and external feedback. Make them use generated and acquired new data. Lossy compress the data periodically. And we have another race of algorithms.

gryfft•2h ago
It doesn't have to be AGI to have a major economic impact. It just has to beat enough extant CAPTCHA implementations.
nico•3h ago
How does this compare to using a multimodal model like gemma3 via ollama?

Any benefit on a Mac with apple silicon? Any experiences someone could share?

ngxson•1h ago
Two things:

1. Because the support in llama.cpp is horizontal integrated within ggml ecosystem, we can optimize it to run even faster than ollama.

For example, pixtral/mistral small 3.1 model has some 2D-RoPE trick that use less memory than ollama's implementation. Same for flash attention (which will be added very soon), it will allow vision encoder to run faster while using less memory.

2. llama.cpp simply support more models than ollama. For example, ollama does not support either pixtral or smolvlm

danielhanchen•1h ago
By the way - fantastic work again on llama.cpp vision support - keep it up!!
ngxson•1h ago
Thanks Daniel! Kudos for your great work on quantization, I use the Mistral Small IQ2_M from unsloth during development and it works very well!!
danielhanchen•55m ago
:)) I did have to update the chat template for Mistral - I did see your PR in llama.cpp for it - confusingly the tokenizer_config.json file doesn't have a chat_template, and it's rather in chat_template.jinja - I had to move the chat template into tokenizer_config.json, but I guess now with your fix its fine :)
ngxson•47m ago
Ohhh nice to know! I was pretty sure that someone already tried to fix the chat template haha, but because we also allow users to freely create their quants via the GGUF-my-repo space, I have to fix the quants produces from that source
danielhanchen•25m ago
Glad it all works now!
roger_•22m ago
Won’t the changes eventually be added to ollama? I thought it was based on llama.cpp
behnamoh•2h ago
didn't llama.cpp use to have vision support last year or so?
danielhanchen•2h ago
Yes they always did, but they moved it all into 1 umbrella called "llama-mtmd-cli"!
breput•1h ago
Yes, but this is generalized so it was able to be added to the llama-server GUI as well.
danielhanchen•2h ago
It works super well!

You'll have to compile llama.cpp from source, and you should get a llama-mtmd-cli program.

I made some quants with vision support - literally run:

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-12b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-27b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL -ngl -1

Then load the image with /image image.png inside the chat, and chat away!

EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.

danielhanchen•2h ago
If it helps, I updated https://docs.unsloth.ai/basics/gemma-3-how-to-run-and-fine-t... to show you can use llama-mtmd-cli directly - it should work for Mistral Small as well
thenameless7741•1h ago
If you install llama.cpp via Homebrew, llama-mtmd-cli is already included. So you can simply run `llama-mtmd-cli <args>`
danielhanchen•1h ago
Oh even better!!
raffraffraff•1h ago
I can't see the letters "ngl" anymore without wanting to punch something.
danielhanchen•1h ago
Oh it's shorthand for number of layers to offload to the GPU for faster inference :) but yes it's probs not the best abbreviation.
blowsand•1h ago
frfr
NetOpWibby•1h ago
no cap
ozzmotik•36m ago
on GOD
danielhanchen•26m ago
no cnv
banana_giraffe•2h ago
I used this to create keywords and descriptions on a bunch of photos from a trip recently using Gemma3 4b. Works impressively well, including going doing basic OCR to give me summaries of photos of text, and picking up context clues to figure out where many of the pictures were taken.

Very nice for something that's self hosted.

accrual•2h ago
That's pretty neat. Do you essentially loop over a list of images and run the prompt for each, then store the result somewhere (metadata, sqlite)?
banana_giraffe•1h ago
Yep, exactly, just looped through each image with the same prompt and stored the results in a SQLite database to search through and maybe present more than a simple WebUI in the future.

If you want to see, here it is:

https://gist.github.com/Q726kbXuN/f300149131c008798411aa3246...

Here's an example of the kind of detail it built up for me for one image:

https://imgur.com/a/6jpISbk

It's wrapped up in a bunch of POC code around talking to LLMs, so it's very very messy, but it does work. Probably will even work for someone that's not me.

wisdomseaker•1h ago
Nice! How complicated do you think it would be to do summaries of all photos in a folder, ie say for a collection of holiday photos or after an event where images are grouped?
banana_giraffe•59m ago
Very simple. You could either do what I did, and ask for details on each image, then ask for some sort of summary of the group of summaries, or just throw all the images in one go:

https://imgur.com/a/1IrCR97

I'm sure there's a context limit if you have enough images, where you need to start map-reducing things, but even that wouldn't be too hard.

wisdomseaker•43m ago
Thanks for the reply, I'll see if I can work it out :)
nurettin•1h ago
Didn't we already have vision via llava?
gitroom•1h ago
Man, the ngl abbreviation gets me every time too. Kinda cool seeing all the tweaks folks do to make this stuff run faster on their Macs. You think models hitting these speed boosts will mean more people start playing with vision stuff at home?
buyucu•1h ago
It was really sad when vision was removed back a while ago. It's great to see it restored. Many thanks to everyone involved!
simonw•1h ago
llama.cpp offers compiled releases for multiple platforms. This release has the new vision features: https://github.com/ggml-org/llama.cpp/releases/tag/b5332

On macOS I downloaded the llama-b5332-bin-macos-arm64.zip file and then had to run this to get it to work:

  unzip llama-b5332-bin-macos-arm64.zip
  cd build/bin
  sudo xattr -rd com.apple.quarantine llama-server llama-mtmd-cli *.dylib
Then I could run the interactive terminal (with a 3.2GB model download) like this (borrowing from https://news.ycombinator.com/item?id=43943370R)

  ./llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
Or start the localhost 8080 web server (with a UI and API) like this:

  ./llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
I wrote up some more detailed notes here: https://simonwillison.net/2025/May/10/llama-cpp-vision/
ngxson•1h ago
For brew users, you can specify --HEAD when installing the package. This way, brew will automatically build the latest master branch.

Btw, the brew version will be updated in the next few hours, so after that you will be able to simply "brew upgrade llama.cpp" and you will be good to go!

ngxson•1h ago
And btw, -ngl is automatically set to max value now, you don't need to -ngl 99 anymore!

Edit: sorry this is only true on Metal. For CUDA or other GPU backends, you still need to manually specify -ngl

danielhanchen•1h ago
OH WHAT! So just -ngl? Oh also do you know if it's possible to auto do 1 GPU then the next (ie sequential) - I have to manually set --device CUDA0 for smallish models, and probs distributing it amongst say all GPUs causes communication overhead!
ngxson•1h ago
Ah no I mean we can omit the whole "-ngl N" argument for now, as it is internally set to -1 by default in CPP code (instead of being 0 traditionally), and -1 meaning offload everything to GPU

I have no idea how to specify custom layer specs with multi GPU, but that is interesting!

danielhanchen•52m ago
WAIT so GPU offloading is on by DEFAULT? Oh my fantastic! For now I have to "guess" via a Python script - ie I sum sum up all the .gguf split files in filesize, then detect CUDA memory usage, and specify approximately how many GPUs ie --device CUDA0,CUDA1 etc
ngxson•41m ago
Ahhh no sorry I forgot that the actual code controlling this is inside llama-model.cpp ; sorry for the misinfo, the -ngl only set to max by default if you're using Metal backend

(See the code in side llama_model_default_params())

danielhanchen•25m ago
Oh no worries! I re-edited my comment to account for it :)
danielhanchen•1h ago
I'm also extremely pleased with convert_hf_to_gguf.py --mmproj - it makes quant making much simpler for any vision model!

Llama-server allowing vision support is definitely super cool - was waiting for it for a while!

ngxson•58m ago
We also support SmolVLM series which delivers light-speed response thanks to its mini size!

This is perfect for real-time home video surveillance system. That's one of the ideas for my next hobby project!

    llama-server -hf ggml-org/SmolVLM-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-256M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF
dust42•53m ago
To add some numbers, on MBP M1 64GB with ggml-org/gemma-3-4b-it-GGUF I get

  25t/s prompt processing 
  63t/s token generation
Overall processing time per image is ~15secs, no matter what size the image is. The small 4B has already very decent output, describing different images pretty well.

Steps to reproduce:

  git clone https://github.com/ggml-org/llama.cpp.git
  cmake -B build
  cmake --build build --config Release -j 12 --clean-first
  # download model and mmproj files...
  build/bin/llama-server \
    --model gemma-3-4b-it-Q4_K_M.gguf \
    --mmproj mmproj-model-f16.gguf
Then open http://127.0.0.1:8080/ for the web interface

Note: if you are not using -hf, you must include the --mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.

I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.

mrs6969•33m ago
so image processing there but image generation isn't ?

just trying to understand, awesome work so far.

bsaul•8m ago
great news ! sidenote : Does vision include the ability to read a pdf ?

Model Context Protocol (MCP) Clearly Explained

1•Arindam1729•9m ago•0 comments

Show HN: Codigo – The Programming Language Repository

https://codigolangs.com
1•adamjhf•9m ago•0 comments

A whippet waypoint / Nofl: A Precise Immix

https://wingolog.org/archives/2025/05/09/a-whippet-waypoint
1•matt_d•13m ago•0 comments

Mexico sues Google over changing Gulf of Mexico's name for US users

https://www.theguardian.com/world/2025/may/09/mexico-google-lawsuit-gulf-of-mexico
1•beardyw•17m ago•0 comments

Why should I care? or why punks are correct and old wise philosophers are wrong

https://abuseofnotation.github.io/moral-law/
2•boris_m•18m ago•0 comments

Ask HN: How do you hire outside your competencies?

1•gotaqmyfriends•22m ago•0 comments

Tesla Worker Knew His Anti-Musk Website Was a Risk. He Did It Anyway

https://www.businessinsider.com/tesla-worker-anti-musk-website-fired-2025-5
2•Corrado•22m ago•0 comments

AI in Policymaking, Some Thoughts

https://medium.com/@simonbachour66/big-picture-ai-pt1-policymaking-a0c1baa555ec
1•manu_nassour•27m ago•0 comments

Most visitors leave your site without ever talking to you. Fix it with this

1•itsmepeter•27m ago•0 comments

Data-hungry browsers: the choice of 90% of users worldwide?

https://surfshark.com/research/chart/data-collection-mobile-browsers
2•XzetaU8•34m ago•0 comments

Floating kelp forests have limited protection despite intensifying heat threats

https://www.nature.com/articles/s41467-025-58054-4
2•PaulHoule•36m ago•0 comments

Mill as a Direct Style Build Tool

https://mill-build.org/blog/12-direct-style-build-tool.html
1•lihaoyi•38m ago•0 comments

Frink data file for non-changing units

https://frinklang.org/frinkdata/units.txt
1•Bluestein•43m ago•0 comments

Airlines Are Collecting Your Data and Selling It to ICE

https://www.levernews.com/airlines-are-collecting-your-data-and-selling-it-to-ice/
3•mdhb•43m ago•0 comments

Warning signs hinted at Spain's unprecedented power outage

https://www.reuters.com/business/energy/spain-suffered-multiple-power-incidents-build-up-full-blackout-2025-05-02/
2•mpweiher•50m ago•0 comments

NOT a 3 year old chimney sweep (2022)

https://fakehistoryhunter.net/2022/07/26/not-a-3-year-old-chimney-sweep/
27•nixass•51m ago•3 comments

Apple Turnover

https://hypercritical.co/
4•camedee•52m ago•0 comments

Germania: The Origins and Migrations of the Germanic Peoples

https://nemets.substack.com/p/germania
3•alserio•1h ago•0 comments

Taking Control of Image Processing

https://glass.photo/highlights/taking-control-of-image-processing
1•walterbell•1h ago•0 comments

Uncle Bob Martin Morning Bathrobe Rant: Types

https://twitter.com/unclebobmartin/status/1920433782563254776
2•ksec•1h ago•1 comments

Continuous glucose monitors reveal variable glucose responses to the same meals

https://examine.com/research-feed/study/1jjKq1/
3•Matrixik•1h ago•0 comments

High-income groups disproportionately contribute to climate extremes worldwide

https://www.nature.com/articles/s41558-025-02325-x
7•slow_typist•1h ago•0 comments

The Elite Microsoft Unit Constantly Working to Thwart Hackers

https://www.bloomberg.com/news/features/2025-05-09/microsoft-s-hacker-hunters-inside-the-secretive-mstic-unit
3•svmt•1h ago•1 comments

Slow Software for a Burning World

https://bonfirenetworks.org/posts/slow_software_for_a_burning_world/
21•todsacerdoti•1h ago•1 comments

"input interaction" markup languages are required in AI chatting

2•zhangruize•1h ago•0 comments

How to Spread Panic in Slack

https://www.miserablyemployed.com/blogs/news/9-ways-to-spread-panic-in-slack
5•derpasaur•1h ago•2 comments

AI Pocket References: Pilot release of <7min AI concept guides

https://vectorinstitute.github.io/ai-pocket-reference/
1•nerdai•1h ago•1 comments

iOS 19 More Like macOS?

https://mjtsai.com/blog/2025/05/07/ios-19-more-like-macos/
2•mpweiher•1h ago•0 comments

In praise of grobi for auto-configuring X11 monitors

https://michael.stapelberg.ch/posts/2025-05-10-grobi-x11-monitor-autoconfig/
2•secure•1h ago•0 comments

Trustpilot just tried to extort us

https://twitter.com/noahkagan/status/1920850977504014528
5•tosh•1h ago•1 comments