But maybe this will change? Software issues somehow?
It also runs CUDA, which is useful
plus apparently some of the early benchmarks were made with ollama and should be disregarded
I'm running VLLM on it now and it was as simple as:
docker run --gpus all -it --rm \
--ipc=host --ulimit memlock=-1 \
--ulimit stack=67108864 \
nvcr.io/nvidia/vllm:25.09-py3
(That recipe from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?v... )And then in the Docker container:
vllm serve &
vllm chat
The default model it loads is Qwen/Qwen3-0.6B, which is tiny and fast to load.I should be allowed to do stupid things when I want. Give me an override!
IS_SANDBOX=0 claude --dangerously-skip-permissions
You can run that as root and Claude won't complain. ● Bash(free -h)
⎿ total used free shared buff/cache available
Mem: 119Gi 7.5Gi 100Gi 17Mi 12Gi 112Gi
Swap: 0B 0B 0B
That 119Gi is indeed gibibytes, and 119Gi in GB is 128GB.I'm looking forward to GLM 4.6 Air - I expect that one should be pretty excellent, based on experiments with a quantized version of its predecessor on my Mac. https://simonwillison.net/2025/Jul/29/space-invaders/
I'd be pissed if I paid this much for hardware and the performance was this lacklustre while also being kneecapped for training
ChrisArchitect•3h ago