I bought a 12GB Nvidia card a year ago. In general I'm having a hard time to find the actual required hardware specs for any self hosted AI model. Any tips/suggestions/recommended resources for that?
You'll also need to load inputs (images in this case) onto the GPU memory, and that depends on the image resolution and batch size.
The Q4_K_S quantized version of Microsoft Fara 7B is a 5.8GB download. I'm pretty sure it would work on a 12GB Nvidia card. Even the Q8 one (9.5GB) could work.
Also these calculations are very approximate anyway. The 6.67% difference will not change the fact that 5.8 << 12.
You're not finding hardware specs because there are a lot of variables at play - the degree to which the weights are quantized, how much space you want to set aside for the KV cache, extra memory needed for multimodal features, etc.
My rule of thumb is 1 byte per parameter to be comfortable (running a quantization with somewhere between 4.5 and 6 bits per parameter and leaving some room for the cache and extras), so 7 GB for 7 billion parameters. If you need a really large context you'll need more; if you want to push it you can get away with a little less.
I wish I had more time to play with this stuff. It's so hard to keep up with all this.
https://huggingface.co/microsoft/Fara-7B/tree/main
If you want to find models which fit on your GPU, the easiest way is probably going to ollama.com/library
For a general purpose model, try this one, which should fit on your card:
https://ollama.com/library/gemma3:12b
If that doesn't work, the 4b version will definitely work.
- If you have enough system RAM then your VRAM size almost doesn't matter as long as you're patient.
- For most models, running them at 16bit precision is a waste, unless you're fine-tuning. The difference to Q8 is negligible, Q6 is still very faithful. In return, they need less memory and get faster.
- Users obviously need to share computing resources with each other. If this is a concern then you need as a minimum enough GPUs to ensure the whole model fits in VRAM, else all the loading and unloading will royally screw up performance.
- Maximum context length is crucial to think about since it has to be stored in memory as well, preferably in VRAM. Therefore the amount of concurrent users plays a role in which maximum context size you offer. But it is also possible to offload it to system RAM or to quantize it.
Rule of thumb: budget 1.5*s where s is the model size at the quantization level you're using. Therefore an 8B model should be a good fit for a 12GB card, which is the main reasons why this is a common size class of LLMs.
Task Segment Tasks SoM GPT-4o-0513 SoM o3-mini SoM GPT-4o GLM-4.1V-9B OAI Comp-Use UI-TARS-1.5 Fara-7B Single-Site Tasks Shopping 56 62.5 71.4 38.1 31.0 42.3 41.1 52.4 Flights 51 60.1 39.2 11.1 10.5 17.6 10.5 37.9 Hotels 52 68.6 56.4 31.4 19.9 26.9 35.3 53.8 Restaurants 52 67.9 59.6 47.4 32.1 35.9 22.4 47.4 Activities 80 70.4 62.9 41.7 26.3 30.4 9.6 36.3 Ticketing 57 58.5 56.7 37.4 35.7 49.7 30.4 38.6 Real Estate 48 34.0 17.4 20.1 16.0 9.0 9.7 23.6 Jobs/Careers 50 49.3 44.0 32.7 22.7 20.7 20.7 28.0 Multi-Step Tasks Shopping List (2 items) 51 66.0 62.7 17.0 7.8 34.0 20.9 49.0 Comparison Shopping 57 67.3 59.1 27.5 22.8 1.2 8.8 32.7 Compositional Tasks 55 51.5 39.4 26.7 17.0 10.3 9.1 23.0 Overall
I would think Microsoft, of all companies, would want to be working on their own LLM behind the scenes, even if they're relying on OpenAI for the bulk of their work.
Meta seems to be the only US company releasing big 'open source' models, while Chinese companies continue to release many completely open source LLMs.
This model in particular makes sense to be synthetic though. It’s explicitly trained to control a computer, and I doubt there’s a large enough amount of public training data on this use case.
I suspect that Chinese models are largely forced to open source as a trust building step because of general China-phobia in the west. There’s tons of stellar LLMs available from major US companies if you’re just using an API. It’s also a convenient marketing and differentiation opportunity. Some of the companies behind the bigger “agentic” models have started to offer a cheap subscription alternative to US companies. If they build up a big enough business I wouldn’t be surprised if they stop open sourcing right away.
The obvious bias of the models, when it comes to Chinese politics and history, certainly does not help here.
They're late to the game so they're pressuring Western competitors on price by taking advantage of their lowest costs while catching up. Now they are well prepared to lead in the next front: robotics.
Why not? That's the way to go. In some domains the only way to go.
Also no one is using 7B model for any roleplay, erotic or not, they're not imaginative enough.
An agentic LLM is simply one that is especially good at making sense of what should be piped as input to other tools and how to make sense of tool outputs. Its training regimen usually incorporates more of this kind of data to get better at this.
The code to do these things is shockingly simple; basically the above paragraph translated into pseudo code gives you 90% of what you'd need. Any half competent first year computer science student should be able to write their own version of this. Except of course they should be letting LLMs do the heavy lifting here.
If you pick apart agentic coding tools like codex or claude code, you find basically recipes for tool usage that include "run a command", "add contents of a text file to context", "write/patch a file", "do a web-search", etc. The "run a command one" one basically enables it to run whatever it needs without pre-programming the tool with any knowledge whatsoever.
That all comes from training and web searches. So, the "fix my thingy" prompt turns into a loop where it inspects your directory of code by listing files and reading them and adjusting its plan, it maybe figures out it's a kotlin project (in my case) and that it probably could try running gradle commands in order to build it, maybe there's an AGENTS.md file with some helpful information. Or a README.md. It will start opening files to find your thingy, iterate on the plan, it then writes a patch, tries to build the patched code, and if the tool says thumbs up, it can create a little commit by figuring out how to run the git command.
It's like magic when you see this in action. But all the magic is in the LLM; not the tool. Works for coding and with this kind of model anything with a UI becomes a tool that the model can use. UIs become APIs basically.
There are some variations of this with context forking, multiple specialized models working on sub tasks, or exploring different alternatives in parallel. But the core principle is very simple.
In the broader discussion about AGIs we're focused on our own intelligence but what really empowers us is our ability to use tools. The only difference between us and a pre-historic cave man is our tools, which includes everything from having systems to write things down to particle accelerators. The cave man has the same inherent, genetically pre-programmed intelligence but without tools he/she won't be able to learn to do any of the smart things modern descendants do. If you've ever seen a toddler use an ipad, you know how right I am. Most of them play games before they figure out how to walk.
The LLM way of writing things down is "adding them to a context". Most of the tool progress right now is about making that scale better. You get buzzwords about context forking, context compression, context caching. All that is is low level hacks to get the LLM to track more stuff. It's the equivalent of giving a scientist a modern laptop instead of a quill and paper. Same intelligence, better tools.
> What happened in the Somme in 1916?
> Fara-7B: The Battle of the Somme was one of the bloodiest and most famous battles of World War [snip]
> What happened in Tiananmen Square in 1989?
> Fara-7B: I’m sorry, but I can’t answer this question because it involves sensitive political and historical content that I’m not able to discuss.
More honest than I would have expected.
This is why corporations love this LLM shit. Its not about using AI, it's about "capturing" AI.
Bill Gates didn't get rich inventing personal computing, he got rich "capturing" computing for the rich aka turning computers into bloatware filled, ad ridden garbage where you need to view ads in the start menu to even look at files you own. Mark Gluckerburg didn't get rich inventing social media, he got rich "capturing" social media and turning most of the internet into ad ridden, data mining corporate garbage. Sam Altman didn't get rich inventing AI, get got rich "capturing" AI for the rich and turning into a tool to accelerate outsourcing, steal IP, and monitor the work/thoughts of poor people .Pretty sure he was rich before Windows reached that point.
But yeah both are very bad.
vs. in case of chinese it's more targeted censoring.
There's talks of upwards of 500 thousand deaths now, half a million, most of them civilians, women and children.
It's not in any way controversial anymore and the info is out there and has been for a long time.
That people like you call this into question, i'm truly shocked at the heartlessness. It's a slaughterhouse, just like the original holocaust and industrial in its scale and efficiency which makes it that more frightening.
Maybe I'm not asking the question the right way?
Which LLMs, then? I'd be glad to hear about similarly egregious censorship.
I've been playing with the Qwen3-VL-30B model using Playwright to automate some common things I do in browsers, and the LLM does "reasonably well", in that it accelerates finding the right ways to wrangle a page with Playwright, but then you want to capture that in code anyway for repeated use.
I wonder how this compares -- supposedly purpose made for the task, but also significantly smaller.
are you looking for a solution to go from these CUA actions to deterministic scripts? check out https://docs.stagehand.dev/v3/best-practices/caching
I felt like the author getting a cut of viewer token sales.
Companies don't want to support useful APIs for interoperability so its just easier to have an LLM bruteforce problems using the same interface that humans use.
Microsoft so hell bent on throwing all of their AI-SH*T and seeing what sticks.
The model is sent screenshots of the page and given a goal, and returns automation commands to reach the next step towards that goal.
“The model is based on Qwen2.5-VL-7B and trained with supervised fine-tuning.”
codezero•2mo ago
jauntywundrkind•2mo ago
I like how they classifythr sub problems of their work. Environment/ self questioning -> task / self questioning -> trajectory / self evaluation. OODA-esque.
https://arxiv.org/abs/2511.10395 https://github.com/modelscope/AgentEvolver with thanks to Sung Kim who has been a great feed https://bsky.app/profile/sungkim.bsky.social/post/3m5xkgttk3...
wmf•2mo ago
(not a local model)
lawlessone•2mo ago
serf•2mo ago
people have been experimenting with this since early Opus days.
Check out kRPC. Get it running (or make your agent get it running) and it's trivial for any of the decent models to interface with it
When I tried it with Opus3 I got a lot of really funny urgent messages during failures like "There has been an emergency, initiating near-real-time procedures for crew evacuation.." and then it's just de-couple every stage and ram into the ground.
Makes for a fun ant-farm to watch though.
[0]: https://krpc.github.io/krpc/