"What you need" only includes software requirements.
So about 700 bucks for a 3090 on eBay
With a 3090 I guess you'd have to reduce context or go for a slightly more aggressive quantization level.
Summarizing llama-arch.cpp which is roughly 40k tokens I get ~50 tok/sec generation speed and ~14 seconds to first token.
For short prompts I get more like ~90 tok/sec and <1 sec to first token.
I didn't do anything fancy and found it to do much better than the experience I had with codex cli and similar quality to Claude Code if I used sonnet or opus.
Honestly the cli stuff was the hardest part but I chose not to use something like crossterm.
(As an aside, my "ideal" language mix would be a pairing of Rust with Python, though the PyO3 interface could be improved.)
Would also love to learn more about your Rust agent + Qwen3!
In python there are hidden sharp edges and depending on what dependencies you use you can get into deadlocks in production without ever knowing you were in danger.
Rust has traits to protect against this. Async in rust is great.
I'd do something like:
let (tx, rx) = std::sync::mpsc::channel(); thread::spawn(move || { // blocking request let response = reqwest::blocking::get(url).unwrap(); tx.send(response.text().unwrap()); });
Or
let (tx, mut rx) = tokio::sync::mpsc::channel(100); tokio::spawn(async move { let response = client.get(url).send().await; tx.send(response).await; });
I've heard of deadlocks when using aiohttp or maybe httpx (e.g. due to hidden async-related globals), but have never managed myself to get any system based on asyncio + concurrent.futures + urllib (i.e. stdlib-only) to deadlock, including w/ some mix of asyncio and threading locks.
If you have 32gb of memory you are not using, it is worth running for small tasks. Otherwise, I would stick with a cloud hosted model.
Begs the question of long-term support, etc...
edit: are you the author? You seem to post a lot from that blog and the blog author's other accounts.
Keep in mind that closed, proprietary models:
1) Use your data internally for training, analytics, and more - because "the data is the moat"
2) Are out of your control - one day something might work, another day it might fail because of a model update, a new "internal" system prompt, or a new guardrail that just simply blocks your task
4) Are built on the "biggest intellectual property theft" of this century, so they should be open and free ;-)
BoredPositron•5mo ago