Mac Studio M3 Ultra with 96 GB
Is it good for running large models locally?
Mac Studio M3 Ultra with 96 GB
Is it good for running large models locally?
Lower quants like that affect its output, making it less capable overall and letting it easily forget things.
Here's what I'd do with 96GB of RAM: Run Qwen 3.6 35b-a3b at Q8 for coding/agentic tasks. You'll get around 70tokens generated per second, the prefill is lightning fast in comparison, and you'll get a lot of work done. Qwen 3.6 27b is out now too, and I'm getting 17tok/sec token generation with a slower prefill.
The upshot is that you'll still have 20-40GB of RAM left for your workstation and development loads. Running Qwen 3.6 35b or 27b at Q8 quantization, the model at 128k context uses about 40GB of RAM; my OS and application load uses 20-30GB most of the time, for a total of 60-70. That's plenty of room in memory for you to work _and_ run inference.
You _may_ end up getting Deepseek 4 Flash running, but it'll be a lower quantization like Q2 or Q3, making it kind of dumb in comparison. And you may not have enough memory left over for any appreciable amount of context. Working with today's reasoning models needs context for it to generate and give out good answers. Doubly so for agentic/coding tasks.
Totally agree, context is everything for agentic coding.
Any other hardware reco that'll help run larger models?
bigyabai•14h ago
The M3 Ultra's GPU is a bit on the weak side for large-scale inference, so you'll be waiting on token prefill for most coding/agent workflows.
namegulf•12h ago
Have you tried any other models with this M3 Ultra?
bigyabai•11h ago
Apple's GPUs are just not very fast for inference. I'd stick to the smaller 7b-18b parameter range or MOE models like Qwen if you want a usable inference speed.
namegulf•11h ago
Any thoughts on M5?
They may be soon releasing a M5 model with mac studio/mini.
namegulf•11h ago
$4,699.00
But looks like we may need a NVIDIA AI Enterprise - DGX Spark License