It’s just 1 vCPU with 4 Gb ram, and you know what? It’s more than enough for these needs. I think hardware manufactures falsely convinced us that every professional needs beefy laptop to be productive.
Keeps the desk nice and tidy while “the beasts” roar in a soundproofed closet.
So while you don't need a ton of compute on the CPU you do need the ability address multiple PCIe lanes. A relatively low-spec AMD EPYC processor is fine if the motherboard exposes enough lanes.
There are larger/better models as well, but those tend to really push the limits of 96gb.
FWIW when you start pushing into 128gb+, the ~500gb models really start to become attractive because at that point you’re probably wanting just a bit more out of everything.
Smaller open source models are a bit like 3d printing in the early days; fun to experiment with but really not that valuable for anything other than making toys.
Text summarization, maybe? But even then I want a model that understands the complete context and does a good job. Even things like "generate one sentence about the action we're performing" I usually find I can just incorporate it into the output schema of a larger request instead of making a separate request to a smaller model.
> Asus made a crypto-mining motherboard that supports up to 20 GPUs
https://www.theverge.com/2018/5/30/17408610/asus-crypto-mini...
For LLMs you'll probably want a different setup, with some memory too, some m.2 storage.
Generally, scalability on consumer GPUs falls off between 4-8 GPUs for most. Those running more GPUs are typically using a higher quantity of smaller GPUs for cost effectiveness.
Source code: https://github.com/BinSquare/inferbench
That would help in latency-constrained workloads, but I don't think it would make much of a difference for AI or most HPC applications.
Of course prefill is going to be GPU bound. You only send a few thousand bytes to it, and don't really ask to return much. But after prefill is done, unless you use batched mode, you aren't really using your GPU for anything more that it's VRAM bandwidth.
It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled unless there are n_gpu users/tasks running in parallel. It's also known that some GPUs are faster in "prompt processing" and some in "token generation" that combining Radeon and NVIDIA does something sometimes. Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.
It takes appropriate backends with "tensor parallel" mode support, which splits the neural network parallel to the direction of flow of data, which also obviously benefit substantially from good node interconnect between GPUs like PCIe x16 or NVlink/Infinity Fabric bridge cables, and/or inter-GPU DMA over PCIe(called GPU P2P or GPUdirect or some lingo like that).
Absent those, I've read somewhere that people can sometimes see GPU utilization spikes walking over GPUs on nvtop-style tools.
Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities. Or simulating multiple different domains of brain such as speech center, visual cortex, language center, etc. communicating in tokens might be interesting in working around this problem.
This is pretty much what "agents" are for. The manager model constructs prompts and contexts that the delegated models can work on in parallel, returning results when they're done.
jonahbenton•2h ago