I do think you should disclose that Localforge is your own project though.
Someone in another comment on this post mentioned using one of the micro models (Qwen 0.6B I think?) and having decent results. Maybe you can try that and then progressively move upwards?
EDIT: “Queen” -> “Qwen”
I have 64GB and I can only just fit a bunch of Firefox and VS Code windows at the same time as running a 27B model.
The vast majority of people run between 4-6 bit depending on system capability. The extra accuracy above 6 tends to not be worth it relative to the performance hit.
Seems like a great way to roll out AI to a medium sized team where a very small team can coordinate access to the best available tools so the entire team doesn’t need to keep pace at the current break-neck speed.
ollama-mcp-bridge: A TypeScript implementation that "connects local LLMs (via Ollama) to Model Context Protocol (MCP) servers. This bridge allows open-source models to use the same tools and capabilities as Claude, enabling powerful local AI assistants"
simple-mcp-ollama-bridge: A more lightweight bridge connecting "Model Context Protocol (MCP) servers to OpenAI-compatible LLMs like Ollama"
rawveg/ollama-mcp: "An MCP server for Ollama that enables seamless integration between Ollama's local LLM models and MCP-compatible applications like Claude Desktop"
How you route would be an interesting challenge, presumably could just tell it to use the mcp for certain tasks, thereby offloading locally.
I started by giving it a reference Python MCP server and asking it to modify the code to do that. Now I have 3-4 tools that give me reproducible results.
Why isn't using localforge enough as it ties into models?
What was even more impressive is the 0.6B model which made the sub 1B actually useful for non-trivial tasks.
Overall very impressed. I am evaluating how it can integrate with my current setup and will probably report somewhere about that.
I'm running Q4 and it's taking 17.94 GB VRAM with 4k context window, 20GB with 32k tokens.
As a Python person I've found uv + MLX to be pretty painless on a Mac too.
The latter is super easy. Just download the model (thru the GUI) and go.
The quality of the output is decent, just keep in mind it is only a 30B model. It also translates really well from french to german and vice versa, much better than Google translate.
Edit: for comparision, Qwen2.5-coder 32B q4 is around 12-14t/s on this M1 which is too slow for me. I usually used the Qwen2.5-coder 17B at around 30t/s for simple tasks. Qwen3 30B is imho better and faster.
[1] parameters for Qwen3: https://huggingface.co/Qwen/Qwen3-30B-A3B
[2] unsloth quant: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF
[3] llama.cpp: https://github.com/ggml-org/llama.cpp
It's using 20GB of memory according to ollama.
Conversely, the 4B model actually seemed to work really well and gave results comparable to Gemini 2.0 Flash (at least in my simple tests).
The /think and /no_think commands are very convenient.
Here’s the LM Studio docs on it: https://lmstudio.ai/docs/app/advanced/speculative-decoding
I wonder how far this can go?
My hunch is that there's still some remaining optimization fruit to be harvested but I expect we may be nearing a plateau. I may have to upgrade from 64GB of RAM this year.
1. Sovereignty over data, your outputs can't be stolen or trained on
2. Just for fun / learning / experiment on
3. Avoid detection that you're using AI
4. No Internet connection, in the woods at your cabin or something
If you're on Debian AFAIK AMD is paying someone to experience the pain in your place, so that is an option if you're building something from scratch, but my openSUSE Tumbleweed installation predates the existence of llama.cpp by a few years and i'm not subjecting myself to the horror that is Python projects (mis)managed by AI developers[1] :-P.
EDIT: my mistake, ROCm isn't needed (or actually, supported) by koboldcpp, it uses Vulkan. ROCm is available via a fork. Still, with Vulkan it is fast too.
[0] ...and more than once as after some OS upgrade it might break, like mine
[1] ok, i did it once, because recently i wanted to try out some tool someone wrote that relied on some AI stuff and i was too stubborn to give up - i had to install Python from source on a Debian docker container because some dependency 2-3 layers deep didn't compile with a newer minor version release of Python. It convinced me to thank yet again to thank Georgi Gerganov for making AI-related tooling that enables people to stick with C++
llama.cpp can be built using Debian-supplied libraries with ROCm backend enabled.
I used to use Debian at the past but when i was about to install my current OS i already had the openSUSE Tumbleweed installer in a USB so i went with that. Ultimately i just needed "a Linux" and didn't care which. I do end up building more stuff from source than when i used Debian but TBH the only time that annoyed me was with ROCm because it is broken into 2983847283 pieces, many of them have their own flags for the same stuff, some claim they allow to install them anywhere but in practice can only work via the default in "/opt", and a bunch of them have their own special snowflake build process (including one that downloads some random stuff via a script through the build process - IIRC a Gentoo packager made a bug report about it to remove the need to download stuff, but i'm not sure if it has been addressed or not).
If i was doing a fresh OS install i'd probably go with Gentoo - it packages ROCm like Debian, but AFAICT (i haven't tried it) it also provides some tools for you to make bespoke patches to packages you install that survive updates and i'd like to do some customizations on stuff i install.
In the post I saw there’s gemma3 (multimodal) and qwen3 (not multimodal). Could they be used as above?
How does localforge know when to route a prompt to which agent?
Thank you
I doubt it can perform well with actual autonomous tasks like reading multiple files, navigating dirs and figuring out where to make edits. That’s at least what I would understand under “vibe coding”
I have models on external drive because Apple and through Ollama server they interact really well with Cline or Roo code or even Bolt, but I found Bolt really not working well.
"create a python decorator that uses a trie to do mqtt topic routing”
phi4-reasoning works, but I think the code is buggy
phi4-mini-reasoning freaks out
qwen3:30b starts looping and forgets about the decorator
mistral-small gets straight to the point and the code seems sane
https://mastodon.social/@rcarmo/114433075043021470
I regularly use Copilot models, and they can manage this without too many issues (Claude 3.7 and Gemini output usable code with tests), but local models seem to not have the ability to do it quite yet.
avetiszakharyan•4h ago
nottorp•3h ago
avetiszakharyan•1h ago
pylotlight•58m ago