Two Qwen3 models on one DGX Spark: the residency math

https://www.devashish.me/p/two-qwen3-models-on-one-dgx-spark

20•devashish86•2d ago

Comments

devashish86•2d ago

Author here. Quick context the post doesn't quite spell out:

The tool_choice="auto" failure on Qwen3-Next isn't a parser issue — the model reasons inside <think>, decides, and never emits the tool call. No error, just empty tool_calls. The fix was swapping the backbone from Thinking to Instruct, not tuning any parser flag.

The "load the bigger model first, size the smaller against actual residency" playbook generalizes to anything with shared CUDA framework overhead. The ~5 GiB framework floor shows up even at small gpu_memory_utilization values — plan against actuals, not targets.

shireboy•1h ago

I’ve been considering a move to local llm setup, having been underwhelmed coat vs value of various online offerings. But at the same time worried anything I get will be obsolete in a couple months. And I don’t want to have to babysit it. I really want some agents managing and creating side hustles for me and have some other things. I’m technical-have written my own harness and use gh copilot and grok daily and have a hosted openwebui+openrouter thing. I’m also torn between a 128g MacBook Pro or a framework, or spark or similar and lightweight laptop to access. Would love advice anyone has for (or against) going local. I have asked ai but have analysis paralysis as 5k would be a big investment for me so I want to make right choices

peddling-brink•51m ago

Well, if you are making side-hustle money now using online models that, critically, you could also run at home, then it sounds like it’s just a matter of numbers. Oh and, unless you spend a lot more than 5k, your local model will still be slower than the online model. What’s your estimated ROI?

Assuming that’s not true based on your phrasing, you’d be shooting yourself in the foot. Start using online models with the same quant at least benchmark as what you could run at home. Prepare for the at home model to be slower.

ericd•49m ago

You probably want to try renting some time on a dedicated box with roughly the specs you’re considering and running the open models for a bit to see if you would actually use them before dropping a lot on local hardware. A 128 gig MacBook Pro isn’t going to get you an amazing model, and certainly not amazing speed. GLM 5.2 wants something like 350+ gigs at fp4 iirc.

dzink•1h ago

Have you tried llama.cpp with unsloth and models suited to it? GLM flash? It seemed to allow more models to be tried soon after they are released. Haven’t tried for long term deployment though, that’s the next step.

pet_the_bird•45m ago

Highy anecdotal: I have tried various self-hosted models using both vllm and llama.cpp. I am in a situation where I have access to large amount of memory (~320 GB).

While experimenting with quantization I found that there is a non-trivial tradeoff between quality and memory footprint. Overall my experience follows the reported pattern of "2-bit is mwah, 4-bit half decent and 6-bit required for programming. Still, although MiniMax-m2.7 is useable with the 6-bit quantizations that unsloth provides, it felt like such a breath of fresh air when I used the reference full-size model.

I find it difficult to say why. I had mostly the same setup as before (parsing had to be slightly adjusted in Zed). Aside from not experiencing the thinking loops (where minimax would get stuck generating the same sentences over and over) there is little evidence of any real improvement (although the average thinking time felt shorter).

I would recommend against very low quantizations of GLM 5.0/5.1/5.2 or Kimi 2.5/2.6. Smaller models were more reliable, and therefore more useful.

verdverm•40m ago

I have tried llama-cpp, vllm is nicer (ray, handles queueing, doesn't have the cache invalidation bug for qwen/gemma models) and unsloth has toxic employees in their discord.

I've run 2 qwen/gemma @8bit with full context window side-by-side. Right now I have 4 models on my spark (qwen36moe, embedding, reranker, qwen3-1.7B) to support my markdown kb tool.

The setup is not as capable, but still good and gets better with models/algos. To me, it's more about the freedom to tinker, freedom from token bill anxiety, and potential right to compute should the government/oligarchy decides it gets to decide who can access which models.

roger_•12m ago

How about Qwen3.7? What sort of prefill/decode rates?

Beyond All Reason (Free Total Annihilation Inspired RTS)

The case against geometric algebra (2024)

Who Owns Your ATProto Identity? Hint: It's Probably Not You

David Ahl's Basic Computer Games Ported to C

A 3D voxel game engine written in APL

Google Hits 50% IPv6

Loupe – A iOS app that raises awareness about what native apps can see

Two Qwen3 models on one DGX Spark: the residency math

Running MicroVMs in Proxmox VE, the Easy Way

Renting a sewing machine from the library

Zigzag Decoding with AVX-512

Slow breathing modulates brain function and risk behavior

Epoll vs. io_uring in Linux

A tale of two path separators

Windows UI evolution: Clicking an unassociated file

Developers don't understand CORS (2019)

Rare medieval bookmark exceeds expectations at auction

15-minute at-home Lyme disease tick test

Cosmodial Sky Atlas

SMPTE Makes Its Standards Freely Accessible

Unauthorized alert sent to cell phones across Brazil

DOS Game "F-15 Strike Eagle II" reversing project needs DOS test pilots

Proportional-Integral-Derivative Controllers

UHF X11: X11 Built for VisionOS and Apple Vision Pro

The Great Intermediary Panic

Guide to the TD4 4-bit DIY CPU

Show HN: TownSquare, a tiny presence layer for websites

Whole cross-sectional human ultrasound tomography

Alice is impatient

I was wrong about the Midjourney ultra-sound scanner

Two Qwen3 models on one DGX Spark: the residency math

Comments

Beyond All Reason (Free Total Annihilation Inspired RTS)

The case against geometric algebra (2024)

Who Owns Your ATProto Identity? Hint: It's Probably Not You

David Ahl's Basic Computer Games Ported to C

A 3D voxel game engine written in APL

Google Hits 50% IPv6

Loupe – A iOS app that raises awareness about what native apps can see

Two Qwen3 models on one DGX Spark: the residency math

Running MicroVMs in Proxmox VE, the Easy Way

Renting a sewing machine from the library

Zigzag Decoding with AVX-512

Slow breathing modulates brain function and risk behavior

Epoll vs. io_uring in Linux

A tale of two path separators

Windows UI evolution: Clicking an unassociated file

Developers don't understand CORS (2019)

Rare medieval bookmark exceeds expectations at auction

15-minute at-home Lyme disease tick test

Cosmodial Sky Atlas

SMPTE Makes Its Standards Freely Accessible

Unauthorized alert sent to cell phones across Brazil

DOS Game "F-15 Strike Eagle II" reversing project needs DOS test pilots

Proportional-Integral-Derivative Controllers

UHF X11: X11 Built for VisionOS and Apple Vision Pro

The Great Intermediary Panic

Guide to the TD4 4-bit DIY CPU

Show HN: TownSquare, a tiny presence layer for websites

Whole cross-sectional human ultrasound tomography

Alice is impatient

I was wrong about the Midjourney ultra-sound scanner