It is a little sad that they gave someone an uber machine and this was the best he could come up with.
Question answering is interesting but not the most interesting thing one can do, especially with a home rig.
The realm of the possible
Video generation: CogVideoX at full resolution, longer clips
Mochi or Hunyuan Video with extended duration
Image generation at scale:
FLUX batch generation — 50 images simultaneously
Fine-tuning:
Actually train something — show LoRA on a 400B model, or full fine-tuning on a 70B
but I suppose "You have it for the weekend" means chatbot go brrrrr and snark
Use them for something creative, write a short story on spec, generate images.
Or the best option: give it tools and let it actually DO something like "read my message history with my wife, find top 5 gift ideas she might have hinted at and search for options to purchase them" - perfect for a local model, there's no way in hell I'd feed my messages to a public LLM, but the one sitting next to me that I can turn off the second it twitches the wrong way? - sure.
Yeah, that's what I wanted to see too.
Seems like the ecosystem is rapidly evolving
But I mostly want to say thanks for everything you do. Your good vibes are deeply appreciated and you are an inspiration.
I would have expected that going from one node (which can't hold the weights in RAM) to two nodes would have increased inference speed by more than the measured 32% (21.1t/s -> 27.8t/s).
With no constraint on RAM (4 nodes) the inference speed is less than 50% faster than with only 512GB.
Am I missing something?
I don't think that's true. At least not without heavy performance loss in which case "just be memory mapped" is doing a lot of work here.
By that logic GPUs could run models much larger than their VRAM would otherwise allow, which doesn't seem to be the case unless heavy quantization is involved.
You'd need to be in a weirdly compute-limited situation before you can replace significant amounts of RAM with SSD, unless I'm missing something big.
> MoE architecture should help quite a bit here.
In that you're actually using a smaller model and swapping between them less frequently, sure.
I definitely would not be buying an M3 Ultra right now on my own dime.
Makes one wonder what apple uses for their own servers. I guess maybe they have some internal M-series server product they just haven’t bothered to release to the public, and features like this are downstream of that?
I guess they prefer that third parties deal with that. There’s rack mount shelves for Mac Minis and Studios.
- Why is the tooling so lame ?
- What do they, themselves, use internally ?
Stringing together mac minis (or a "Studio", whatever) with thunderbolt cables ... Christ.
Or do they have some real server-grade product coming down the line, and are releasing this ahead of it so that 3rd party software supports it on launch day?
behnamoh•1h ago
- Something like DGX QSFP link (200Gb/s, 400Gb/s) instead of TB5. Otherwise, the economies of this RDMA setup, while impressive, don't make sense.
- Neural accelerators to get prompt prefill time down. I don't expect RTX 6000 Pro speeds, but something like 3090/4090 would be nice.
- 1TB of unified memory in the maxed out version of Mac Studio. I'd rather invest in more RAM than more devices (centralized will always be faster than distributed).
- +1TB/s bandwidth. For the past 3 generations, the speed has been 800GB/s...
- The ability to overclock the system? I know it probably will never happen, but my expectation of Mac Studio is not the same as a laptop, and I'm TOTALLY okay with it consuming +600W energy. Currently it's capped at ~250W.
Also, as the OP noted, this setup can support up to 4 Mac devices because each Mac must be connected to every other Mac!! All the more reason for Apple to invest in something like QSFP.
tylerflick•1h ago
The 2019 i9 Macbook Pro has entered the chat.
burnt-resistor•1h ago
angoragoats•52m ago
This isn’t any different with QSFP unless you’re suggesting that one adds a 200GbE switch to the mix, which:
* Adds thousands of dollars of cost,
* Adds 150W or more of power usage and the accompanying loud fan noise that comes with that,
* And perhaps most importantly adds measurable latency to a networking stack that is already higher latency than the RDMA approach used by the TB5 setup in the OP.
fenced_load•33m ago
https://www.bhphotovideo.com/c/product/1926851-REG/mikrotik_...
zozbot234•52m ago
Apple Neural Engine is a thing already, with support for multiply-accumulate on INT8 and FP16. AI inference frameworks need to add support for it.
> this setup can support up to 4 Mac devices because each Mac must be connected to every other Mac!!
Do you really need a fully connected mesh? Doesn't Thunderbolt just show up as a network connection that RDMA is ran on top of?
fooblaster•43m ago
liuliu•23m ago
csdreamer7•16m ago
Or, Apple could pay for the engineers to add it.
Dylan16807•51m ago
M4 already hit the necessary speed per channel, and M5 is well above it. If they actually release an Ultra that much bandwidth is guaranteed on the full version. Even the smaller version with 25% fewer memory channels will be pretty close.
We already know Max won't get anywhere near 1TB/s since Max is half of an Ultra.