1.5 TB of VRAM on Mac Studio – RDMA over Thunderbolt 5

https://www.jeffgeerling.com/blog/2025/15-tb-vram-on-mac-studio-rdma-over-thunderbolt-5

617•rbanffy•2mo ago

Comments

behnamoh•2mo ago

My expectations from M5 Max/Ultra devices:

- Something like DGX QSFP link (200Gb/s, 400Gb/s) instead of TB5. Otherwise, the economies of this RDMA setup, while impressive, don't make sense.

- Neural accelerators to get prompt prefill time down. I don't expect RTX 6000 Pro speeds, but something like 3090/4090 would be nice.

- 1TB of unified memory in the maxed out version of Mac Studio. I'd rather invest in more RAM than more devices (centralized will always be faster than distributed).

- +1TB/s bandwidth. For the past 3 generations, the speed has been 800GB/s...

- The ability to overclock the system? I know it probably will never happen, but my expectation of Mac Studio is not the same as a laptop, and I'm TOTALLY okay with it consuming +600W energy. Currently it's capped at ~250W.

Also, as the OP noted, this setup can support up to 4 Mac devices because each Mac must be connected to every other Mac!! All the more reason for Apple to invest in something like QSFP.

tylerflick•2mo ago

> TOTALLY okay with it consuming +600W energy

The 2019 i9 Macbook Pro has entered the chat.

burnt-resistor•2mo ago

Apple has always sucked at properly embracing properly robust tech for high-end gear for markets outside of individual prosumer or creatives. When Xserves existed, they used commodity IDE drives without HA or replaceable PSUs that couldn't compete with contemporary enterprise servers (HP-Compaq/Dell/IBM/Fujitsu). Xserve RAID interconnection half-heartedly used fiber channel but couldn't touch a NetApp or EMC SAN/filer. I'm disappointed Apple has a persistent blindspot preventing them from succeeding in data center-quality gear category when they could've had virtualized servers, networking, and storage, things that would eventually find their way into my home lab after 5-7 years.

Terretta•2mo ago

> I'm disappointed Apple has a persistent blindspot preventing them from succeeding in ... things that would eventually find their way into my home lab after 5-7 years.

I can see the dollar signs in their eyes right now.

Aftermarkets are a nice reflection of durable value, and there's a massive one for iPhones and a smaller one for quick flameout startup servers, but not much money in 5 - 7 year old servers.

donavanm•2mo ago

Enterprise never ever mattered, and there arent enough digits available to show your “home lab” use case in the revenue numbers. Xserve, the RAID shelves, and the directory services were kinda there as a half hearted attempt for that late 90-00s AV setup. All of that fell on the cutting room floor once personal devices, esp iphone, was realized.

By the time I left in ‘10 the total revenue from mac hardware was like 15% of revenue. Im honestly surprised theres anyone who cared enough to package the business services for mac minis.

So if everything else is printing cash for a HUGE addressable consumer market at premium price points why would they try and compete with their own ODMs on more-or-less commodity enterprise gear?

SoftTalker•2mo ago

Seems like I remember the main reason Macs survived as a product at all was because you needed one to develop for iOS. That may be an exaggeration but there certainly was a time when Macs were few and far between outside of creative shops. Certainly they were almost unseen in the corporate world, where now they are fairly common at least in laptops.

pjmlp•2mo ago

Macs survived because Apple got a cash injection, survived long enough to come out with colorful iMacs with an hockey puck mouse, still running on Mac OS 8, and the iPod.

Requiring one for doing iOS development they were already back into the green.

raw_anon_1111•2mo ago

It’s a myth that the “cash injection” from Microsoft saved Apple.

Microsoft gave Apple $250 million. The next quarter Apple turned around and spent $100 million on PowerComputing’s Mac assets.

Apple lost over a billion more before it became profitable. The $150 Net wouldn’t have been make or break.

Now Microsoft promising to keep Office on the Mac was a big deal

pjmlp•2mo ago

One way or the other, it was a cash injection from Microsoft, after all who paid the salaries from Office developers?

Also you're forgetting the part that those announcements gave Apple a good marketing for additional credit from banks.

raw_anon_1111•2mo ago

Microsoft had been writing the components of the Office Apps since 1985. Word and Excel were first developed on the Mac and PowerPoint was an original Mac App acquired by Microsoft.

At one point, Microsoft was making more money on each Mac sold than Apple. Microsoft wasn’t doing it for charity. If it were, why did it do it before the agreement and continue to support Mac today?

Apple got credit from banks before either the announcement or Steve Jobs return.

pjmlp•2mo ago

As someone living in a country, Portugal, where Apple had a single reseller, Interlog.

I could count with my hand fingers how many Macs I have seen being used between being born in the 70's and 2000's, up to 10.

My university graduation project was porting a visualisation framework from NeXTSTEP into Windows, because already there the university could not see a future with NeXT.

The fact that people believe Apple's cash injection, not only from Microsoft, that allowed for a survival plan, including an acquisition, has nothing to with Apple escaping bankruptcy is kind of interesting.

And yes Excel was initially developed for Mac, and once upon a time there was Visual Studio for Mac with MFC.

Still, it was Microsoft paying developers to build such products for a dying platform.

raw_anon_1111•2mo ago

At no point was Microsoft spending more on Office for Mac than they were making on selling the Mac version.

It cost some. But famously Microsoft used byte code for Office that was portable and was dog slow around Office 5.

And it’s not my “believing”, it’s math. Apple lost far more than the net $150 million before it became popular.

This isn’t my reading the history books. My first computer was an Apple //e in 1986 and by 1993, I was following what was going on with Apple real time via Usenet and TidBits (been around since 1990) and I lied to get a free subscription to MacWeek.

pjmlp•2mo ago

The fact is Microsoft did spent R&D money to support Apple customers, with a product without which, Apple would even be less relevant in the late 90's.

raw_anon_1111•2mo ago

They didn’t do it out of charity. They did it for profit. Microsoft was on the stage when the Mac was introduced in 1984. I don’t think they were ever seriously considering stopping Mac development. They ported office to PPC Macs in 1995 before the joint announcement 2 years later.

PunchyHamster•2mo ago

For Apple, datacenter stuff is low margin business

spacedcowboy•2mo ago

Considering that Apple is moving away from Linux in the datacenter to its own devices, I'm not sure that's the case. The apple machines aren't available to the consumer (they're rack-mounted, dozens of chips per PCB board, custom-made machines) but they're much less power-hungry, just as fast (or more so), much cheaper for them to make rather than buy, and natively support their own ecosystem.

Some of the machine-designs that consumers are able to buy seem to have a marked resemblance to the feature-set that the datacenter people were clamouring for. Just saying...

mcculley•2mo ago

> dozens of chips per PCB board

Have there been leaks or something about these internal machines? I am curious to know more.

spacedcowboy•2mo ago

No idea. I worked on them so I didn't really need the leak :)

angoragoats•2mo ago

> Also, as the OP noted, this setup can support up to 4 Mac devices because each Mac must be connected to every other Mac!! All the more reason for Apple to invest in something like QSFP.

This isn’t any different with QSFP unless you’re suggesting that one adds a 200GbE switch to the mix, which:

* Adds thousands of dollars of cost,

* Adds 150W or more of power usage and the accompanying loud fan noise that comes with that,

* And perhaps most importantly adds measurable latency to a networking stack that is already higher latency than the RDMA approach used by the TB5 setup in the OP.

fenced_load•2mo ago

Mikrotik has a switch that can do 6x200g for ~$1300 and <150W.

https://www.bhphotovideo.com/c/product/1926851-REG/mikrotik_...

angoragoats•2mo ago

Cool! So for marginally less in cost and power usage than the numbers I quoted, you can get 2 more machines than with the RDMA setup. And you’ve still not solved the thing that I called out as the most important drawback.

nicky_nickell•2mo ago

how significant is the latency hit?

angoragoats•2mo ago

The OP makes reference to this with a link to a GitHub repo that has some benchmarks. TCP over Thunderbolt compared to RDMA over Thunderbolt has roughly 7-10x higher latency, ~300us vs 30-50us. I would expect TCP over 200GbE to have similar latency to TCP over Thunderbolt.

Put another way, see the graphs in the OP where he points out that the old way of clustering performs worse the more machines you add? I’d expect that to happen with 200GbE also.

And with a switch, it would likely be even worse, since the hop to the switch adds additional latency that isn’t a factor in the TB5 setup.

wmf•2mo ago

You're ignoring RoCE which would have the same or lower latency than RoTB. And I think macOS already supports RoCE.

angoragoats•2mo ago

MacOS does not support RoCE.

Hikikomori•2mo ago

Switch probably does cut through so it starts forwarding the frame before its even fully received.

wtallis•2mo ago

That switch appears to have 2x 400G ports, 2x 200G ports, 8x 50G ports, and a pair of 10G ports. So unless it allows bonding together the 50G ports (which the switch silicon probably supports at some level), it's not going to get you more than four machines connected at 200+ Gbps.

angoragoats•2mo ago

As with most 40+GbE ports, the 400Gbit ports can be split into 2x200Gbit ports with the use of special cables. So you can connect a total of 6 machines at 200Gbit.

sgjohnson•2mo ago

Breakout cables typically split to 4.

e.g. QSFP28 (100GbE) splits into 4x SFP28s (25GbE each), because QSFP28 is just 4 lanes of SFP28.

Same goes for QSFP112 (400GbE). Splits into SFP112s.

It’s OSFP that can be split in half, i.e. into QSFPs.

angoragoats•2mo ago

Here’s an example of the cables I was referring to that can split a single 400Gbit QSFP56-DD port to two 200Gbit ports:

https://www.fs.com/products/101806.html

But all of this is pretty much irrelevant to my original point.

buildbot•2mo ago

This is incorrect - they can split however the driving chip supports. (Q|O)SFP(28|56|112|+) can all be split to a single differential lane. All (Q|O)SFP(28|56|112|+) does is provide basically direct, high quality links to whatever you chips SERDES interfaces can do. It doesn't even have to be ethernet/IB data - I have a SFP module that has a SATA port lol.

There's also splitting at the module level, for example I have a PCIe card that is actually a fully self hosted 6 port 100GB switch with it's own onboard Atom management processor. The card only has 2 MPO fiber connectors - but each has 12 fibers, which each can carry 25Gbps. You need a special fiber breakout cable but you can mix anywhere between 6 100GbE ports and 24 25Gbe ports.

https://www.silicom-usa.com/pr/server-adapters/switch-on-nic...

wtallis•2mo ago

Ah, good point. Though if splitter cables are an option, then it seems more likely that the 50G ports could be combined into a 200G cable. Marvell's product brief for that switch chip does say it's capable of operating as an 8x 200G or 4x 400G switch, but Mikrotik may need to do something on their end to enable that configuration.

throwaway2037•2mo ago

I'm not trolling here: Do you think that Marvell sells the chips wholesale buy the vendor buys the feature set (IP/drivers/whatever)? That would allow Marvell to effectively sell the same silicon but segment the market depending upon what buyers needs. Example: A buyer might need a config that is just a bunch of 50GB/s ports and another 100GB/s ports and another a mix. (I'm thinking about blowing fuses in the manuf phase, similar to what AMD and Intel do.) I write this as a complete noob in switching hardware.

wtallis•2mo ago

I think if Marvell were doing that, they would have more part numbers in their catalog.

wmf•2mo ago

The Marvell 98DX7335 switch ASIC has 32 lanes that can be configured any way the vendor wants. There aren't any fuses and it can even be reconfigured at runtime (e.g. a 400G port can be split into 2x200G).

Are the smaller 98DX7325 and 98DX7321 the same chip with fuses blown? I wouldn't be surprised.

angoragoats•2mo ago

You’re talking about link aggregation (LACP) here, which requires specific settings on both the switch and client machine to enable, as well as multiple ports on the client machine (in your example, multiple 50Gbps ports). So while it’s likely possible to combine 50Gbps ports like you describe, that’s not what I was referring to.

wtallis•2mo ago

No, I'm not talking about LACP, I'm talking about configuring four 50Gb links on the switch to operate as a single 200Gb link as if those links were wired up to a single QSFP connector instead of four individual SFP connectors.

The switch in question has eight 50Gb ports, and the switch silicon apparently supports configurations that use all of its lanes in groups of four to provide only 200Gb ports. So it might be possible with the right (non-standard) configuration on the switch to be able to use a four-way breakout cable to combine four of the 50Gb ports from the switch into a single 200Gb connection to a client device.

angoragoats•2mo ago

Ok. I’ve never seen a configuration like this, while using breakout cables to go from higher bandwidth -> multiple lower bandwidth clients is common, so I still disagree with your assertion that it seems “more likely” that this would be supported.

throwaway2037•2mo ago

Wow, this switch (MikroTik CRS812) is scary good for the price point. A quick Google search fails to find any online vendors with stock. I guess it is very popular! Retail price will be <= 1300 USD.

I did some digging to find the switching chip: Marvell 98DX7335

Seems confirmed here: https://cdn.mikrotik.com/web-assets/product_files/CRS812-8DS...

And here: https://cdn.mikrotik.com/web-assets/product_files/CRS812-8DS...

    > Switch chip model 98DX7335

From Marvell's specs: https://www.marvell.com/content/dam/marvell/en/public-collat...

    > Description: 32x50G / 16x100G-R2 / 8x100G-R4 / 8x200G-R4 / 4x400G-R8
    > Bandwidth: 1600Gbps

Again, those are some wild numbers if I have the correct model. Normally, Mikrotik includes switching bandwidth in their own specs, but not in this case.

cess11•2mo ago

They are very popular and make quite good products, but as you noticed it can be tricky to find them in stock.

Besides stuff like this switch they've also produced pretty cool little micro-switches you can PoE and run as WLAN hotspots, e.g. to distance your mobile user device from some network you don't really trust, or more or less maliciously bridge a cable network through a wall because your access to the building is limited.

SoftTalker•2mo ago

For RDMA you'd want Infiniband not Ethernet.

johncolanduoni•2mo ago

RDMA for new AI/HPC clusters is moving toward ethernet (the keyword to look for is RoCE). Ethernet gear is so much cheaper that you can massively over-provision to make up for some of the disadvantages of asynchronous networking, and it lets your run jobs on hyperscalers (only Azure ever supported actual IB). Most HPC is not latency-sensitive enough that it needs Infiniband’s lower jitter/median, and vendors have mostly caught up on the hardware acceleration front.

zozbot234•2mo ago

> Neural accelerators to get prompt prefill time down.

Apple Neural Engine is a thing already, with support for multiply-accumulate on INT8 and FP16. AI inference frameworks need to add support for it.

> this setup can support up to 4 Mac devices because each Mac must be connected to every other Mac!!

Do you really need a fully connected mesh? Doesn't Thunderbolt just show up as a network connection that RDMA is ran on top of?

fooblaster•2mo ago

Might be helpful if they actually provided a programming model for ANE that isn't onnx. ANE not having a native development model just means software support will not be great.

sroussey•2mo ago

onnx supports CoreML, is that how?

liuliu•2mo ago

They were talking about neural accelerators (a silicon piece on GPU): https://releases.drawthings.ai/p/metal-flashattention-v25-w-...

csdreamer7•2mo ago

> Apple Neural Engine is a thing already, with support for multiply-accumulate on INT8 and FP16. AI inference frameworks need to add support for it.

Or, Apple could pay for the engineers to add it.

ls612•2mo ago

Apple already paid software engineers to add Tensorflow support for the ANE hardware.

pdpi•2mo ago

> Do you really need a fully connected mesh? Doesn't Thunderbolt just show up as a network connection that RDMA is ran on top of?

If you daisy chain four nodes, then traffic between nodes #1 and #4 eat up all of nodes #2 and #3's bandwidth, and you eat a big latency penalty. So, absent a switch, the fully connected mesh is the only way to have fast access to all the memory.

rbanffy•2mo ago

Can’t you make bandwidth reservations and optimise data location to prefer comms between directly connected nodes over one or two-hop paths?

KeplerBoy•2mo ago

Sure, one could think of some kind of pipeline parallelism where you only need a fast transfer to the next step in the model and that would boost throughput but not increase model size.

Dylan16807•2mo ago

Obviously don't daisy chain, that wastes ports so badly. But if you connect 4 nodes into a loop, it goes fine. Relaying only adds 33% extra traffic. And what specifically are the latency numbers you have in mind?

If you have 3 links per box, then you can set up 8 nodes with a max distance of 2 hops and an average distance of 1.57 hops. That's not too bad. It's pretty close to having 2 links each to a big switch.

solarkraft•2mo ago

How much of an improvement can be expected here? It seems to me that in general most potential is pretty quickly realized on Apple platforms.

Dylan16807•2mo ago

> +1TB/s bandwidth. For the past 3 generations, the speed has been 800GB/s...

M4 already hit the necessary speed per channel, and M5 is well above it. If they actually release an Ultra that much bandwidth is guaranteed on the full version. Even the smaller version with 25% fewer memory channels will be pretty close.

We already know Max won't get anywhere near 1TB/s since Max is half of an Ultra.

wtallis•2mo ago

> Also, as the OP noted, this setup can support up to 4 Mac devices because each Mac must be connected to every other Mac

I do wonder where this limitation comes from, since on the M3 Ultra Mac Studios the front USB-C ports are also Thunderbolt 5, for a total of six Thunderbolt ports: https://www.apple.com/mac-studio/specs/

QuantumNomad_•2mo ago

Jeff mentioned in the video that only three of the ports can be used for RDMA. But it’s unclear where that limitation is coming from.

geerlingguy•2mo ago

From my brief discussion with Exo/Apple, it sounds like that is just a limitation of this initial rollout, but it's not a hardware limitation.

Though, I am always leery to recommend any decisions be made over something that's not already proven to work, so I would say don't bet on all ports being able to be used. They very well may be able to though.

sroussey•2mo ago

I bet there is one piece of silicon per two ports.

kappuchino•2mo ago

He corrected that in the comment section of the youtube video. Six is actually the maximum amount. He just didn't want to buy another one.

He also published the Benchmarks in Detail and with two/four Macs in Comparison: https://github.com/geerlingguy/beowulf-ai-cluster/issues/17

re-thc•2mo ago

> He just didn't want to buy another one.

Wasn’t it loaned ie didn’t buy any at all?

Apple should have loaned enough to flex.

dev_l1x_be•2mo ago

Mine is to remove the extreme Macos bloat.

Marsymars•2mo ago

> - The ability to overclock the system? I know it probably will never happen, but my expectation of Mac Studio is not the same as a laptop, and I'm TOTALLY okay with it consuming +600W energy. Currently it's capped at ~250W.

I don't think the Mac Studio has a thermal design capable of dissipating 650W of heat for anything other than bursty workloads. Need to look at the Mac Pro design for that.

jauntywundrkind•2mo ago

The thermal design is irrelevant, and people saying they want insane power density are, in my personal view, deluded ridiculous individuals who understand very very little.

Overclocking long ago was an amazing saintly act, milking a lot of extra performance that was just there waiting, without major downsides to take. But these days, chips are usually already well tuned. You can feed double or tripple the power into the chip with adequate cooling, but the gain is so unremarkable. +10% +15% +20% is almost never going to be a make or break difference for your work, and doing so at double or triple the power budget is an egregious waste.

So many of the chips about are already delivered at way higher than optimum efficiency, largely for bragging rights. The exponential decay of efficiency you keep pushing for is an anti-quest, is against good. The absolute performance wins are ridiculous to seek. In almost all cases.

If your problem will not scale and dumping a ton of power into one GPU or one cpu socket is all you got, fine, your problem is bad and you have to deal with that. But for 90% of people, begging for more power proces you don't actually know jack & my personal recommendation is that all such points of view deserve massive down voting by anyone with half a brain.

Go back to 2018 and look at Matthew Dillon on DragobflyBSD underpowering the heck out of their 2990wx ThreadRipper. Efficiency just soars as you tell the chip to take less power. The situation has not improved! Efficiency skyrockets today at least as much as it did then by telling chips not to go all out. Good chips behave & reward. I believe Apple competent enough to thoroughly disabuse this position that this chip would be far better if we could dump 2x 3x more power into it. Just a fools position, beyond a joke, imo. https://apollo.backplane.com/DFlyMisc/threadripper.txt

Marsymars•2mo ago

Oh, we're largely on the same page there.

I was actually looking for benchmarks earlier this week along those lines - ideally covering the whole slate of Arrow Lake processors running at various TDPs. Not much available on the web though.

ssl-3•2mo ago

I learned a lot about underclocking, undervolting, and computational power efficiency during my brief time in the ethereum mining[1] shenanigans. The best ROI was with the most-numerous stable computations at the lowest energy expense.

I'd tweak individual GPUs' various clocks and volts to optimize this. I'd even go so far as to tweak fan speed ramps on the cards themselves (those fans don't power themselves! There's whole Watts to save there!).

I worked to optimize the efficiency of even the power from the wall.

But that was a system that ran, balls-out, 24/7/365.

Or at least it ran that way until it got warmer outside, and warmer inside, and I started to think about ways to scale mining eth in the basement vs. cooling the living space of the house to optimize returns. (And I never quite got that sorted before they pulled the rug on mining.)

And that story is about power efficiency, but: Power efficiency isn't always the most-sensible goal. Sometimes, maximum performance is a better goal. We aren't always mining Ethereum.

Jeff's (quite lovely) video and associated article is a story about just one man using a stack of consumer-oriented-ish hardware in amusing -- to him -- ways, with local LLM bots.

That stack of gear is a personal computer. (A mighty-expensive one on any inflation-adjusted timeline, but what was constructed was definitely used as a personal computer.)

Like most of our personal computers (almost certainly including the one you're reading this on), it doesn't need to be optimized for a 24/7 100% workload. It spends a huge portion of its time waiting for the next human input. And unlike mining Eth in the winter in Ohio: Its compute cycles are bursty, not constant, and are ultimately limited by the input of one human.

So sure: I, like Jeff, would also like to see how it would work when running with the balls[2] running further out. For as long as he gets to keep it, the whole rig is going to spend most of its time either idling or off, anyway. So it might as well get some work done when a human is in front of it, even if each token costs more in that configuration than it does OOTB.

It theoretically can even clock up when being actively-used (and suck all the power), and clock back down when idle (and resume being all sleepy and stuff).

That's a well-established concept that [eg] Intel has variously called SpeedStep and/or Turbo Boost -- and those things work for bursty workloads, and have worked in that way for a very long time now.

[1]: Y'all can hate me for being a small part of that problem. It's allowed.

[2]: https://en.wikipedia.org/wiki/Centrifugal_governor

jermaustin1•2mo ago

I did Crypto Mining as an alternative to heating. In my centrally cool apartment my office was the den which had the air return. So my mining rig ran RIGHT in front of that, it sucked the heat out and pushed it all over the house. Then summer came, and in Texas the AC can barely keep up to begin with. So then my GPUs became part of a render farm instead.

ssl-3•2mo ago

I did that some after eth stopped mineable.

My office-room was heated mostly by resistance, plus whatever gas-fired heat trickled in through the doorway.

I didn't have as much power available there as I had in the basement, but I had enough to mine a bit of crypto to supplement the resistance heater. :)

From one perspective: It was never directly profitable to do this. Other than eth, nothing has ever been profitable-enough for me to care about.

From another perspective: I was going to burn the energy anyway. The Joules cost the same and add the same amount of warmth either way, so I might as well get them with a side dish of free crypto.

Good times.

(These days, I transcode videos with Tdarr during the winter.)

sandworm101•2mo ago

>> people saying they want insane power density are, in my personal view, deluded ridiculous individuals who understand very very little.

Or they are simply not-rich people who cannot afford to purchase extra hardware to run in parallel. Electricity is cheap. GPUs are not. So i want to get every ounce of power out of the precious few GPUs i can afford to own.

(And dont point at clouds. Running AI on someone else's cloud is like telling a shadetree mechanic to rent a car instead of fixing his owm.)

nottorp•2mo ago

> Electricity is cheap.

American :)

gflarity•2mo ago

Or Canadian... off peak in Ontario is <$0.03 US if you prioritize off peak...

nottorp•2mo ago

Hate you from Europe :)

[roughly 23 us cents / kWh on my last bill]

sandworm101•2mo ago

https://www.oeb.ca/consumer-information-and-protection/elect...

"On-Peak/Weekdays 4 p.m. – 9 p.m. = 39.1c"

nottorp•2mo ago

> Overclocking long ago was an amazing saintly act, milking a lot of extra performance that was just there waiting, without major downsides to take.

Back when you bought a 233 Mhz chip with ram at 66 Mhz, ran the bus at 100 Mhz which also increased your ram speed if it could handle it, and everything was faster.

> But these days, chips are usually already well tuned. You can feed double or tripple the power into the chip with adequate cooling, but the gain is so unremarkable. +10% +15% +20% is almost never going to be a make or break difference for your work

20% in synthetic benchmarks maybe, or very particular loads. Because you only overclock the CPU these days so anything hitting the ram won't even go to 20%.

mapt•2mo ago

Initially, thermal throttling was a safety valve for a failure condition. A way to cripple performance briefly so as not to let the magic blue smoke out. Only a terrible PC would be thermal throttling out of the box; Only neglectful owners who failed to clean filters, had thermal throttling happening routinely.

That's not how it works any more.

Many of these CPUs both at the high end and even a few tiers down from the top, are thermal throttling whenever they hit 100% utilization. I'm thinking of Intel's last couple generations particularly. They're shipped with pretty good heatsinks, but not nearly good enough to run stock clocks on all cores at once. Instead, smarter grades of thermal throttling are designed for for routine use to balance loads. Better heatsinks (and watercooling) help a bit, but not enough, you end up hitting a wall; Only the risky process of delidding seems to push further. We're running into limitations on how well a conventional heatsink can transfer the heat from a limited contact patch.

GPUs seem to have more effective heatsinks, and are bottlenecked mostly by power requirements. The 600 watt monsters are already melting cables that aren't in perfect condition.

nottorp•2mo ago

I've set the cpu in my desktop to throttle at 65 C in the bios :)

Too lazy to figure out which cryptic setting is exact watts.

One of these days I'll configure the video card too.

vablings•2mo ago

It's been funny to see people move from overclocking to underclocking. Especially for the older AMD gpus. On the RX480 a slight underclock would cut the power usage almost in half!

keeganpoppen•2mo ago

this is all 100% true and yet the 12 year-old boy inside me still smiles smugly at how fucking cool my dual reservoir water-cooled setup is, and how there was a brief moment in time a couple years ago where i had arguably one of the fastest (consumer) setups in the entire world... was any part of that labor or money "worth" it? no, absolutely not. was the $1k power bill i had to pay PG&E one month worth it? even less so. but do i have any regrets? absolutely not! :)

anyone even remotely on the fence about whether or not they should bother with all this stuff, just read OP or read this tl;dr: the answer is no, it is not.

checker659•2mo ago

For a company that has repeatedly ignored macOS, your wishlist seems anything but a pipe dream. QSFP on a mac. Yeah right. If anything, they’ll double down on TB or some nonstandard interconnect.

What is a computer?

(Although, I do hope with the new work on supporting RDMA, the MLX5 driver shipped with macOS will finally support RDMA for ConnectX NICs)

https://kittenlabs.de/blog/2024/05/17/25gbit/s-on-macos-ios/

rbanffy•2mo ago

QSFP makes sense on a MacPro platform - and might be where Apple chooses to differentiate (one could dream of an M5 Mega, with four chiplets). The Mac Studio is a general purpose compact workstation that doesn’t need ludicrously fast networking beyond what 10Gbe and TB5 offer. It’s already overkill for the vast majority of users. Top configuration Studios are already a niche product.

justincormack•2mo ago

Also given that Apple are using these in their datacenters I think they will ship much more server like hw.

checker659•2mo ago

Apple already ships an MLX5 driver for ConnectX NICs.

lostmsu•2mo ago

> 3090 would be nice

They would need 3x speedup over the current generation to approach 3090. A100 that has +- the 3090 compute but 80GB VRAM (so fits LLaMA 70B) does prefill at 550tok/s on a single GPU: https://www.reddit.com/r/LocalLLaMA/comments/1ivc6vv/llamacp...

doctorpangloss•2mo ago

the GB10 is only the same performance as a 3090. gb10 uses way less power.

i'm not sure why anyone would buy a mac studio instead of a gb10 machine for this use case.

villgax•2mo ago

it's just people looking to do experiments locally on the main machine rather than just get a dedicated spark, which can be used properly as a headless box than a Mac of which you are at the mercy of system shenanigans albiet still bearable compared to windows

rbanffy•2mo ago

> i'm not sure why anyone would buy a mac studio instead of a gb10

For an AI-only use case, the GB10s make sense, but they are only OK as desktop workstations, and I’m not sure for how long DGX OS will be updated, as dedicated AI machines have somewhat short lives. Apple computers, OTOH, have much longer lives, and desktops live the longest. I retired my Mac Mini a year after the machine was no longer getting OS updates, and it was still going strong.

FuriouslyAdrift•2mo ago

DGX OS (based on Ubuntu) is used for all of nVidias GPU compute systems so it's probably going to be around for a while.

rbanffy•2mo ago

But how long until hardware support is dropped?

FuriouslyAdrift•2mo ago

I don't know... when is nVidia planning on getting out of the ai business?

rbanffy•2mo ago

When will they discontinue GB10 hardware support because it’s too slow and they want to sell you newer chips?

lostmsu•2mo ago

Yeah, if you had experience with their Jetson boards, you'd know Nvidia is not well regarded for their OS support.

TK1 support stopped after under 4 years. Basically they released it with some version of Ubuntu (14.04 LTS) and never upgraded it.

wartywhoa23•2mo ago

Would you please mind leaving some RAM to remain available for purchase at an affordable price for us mere mortals ? 1Tb for what, like, "Come on AI, make the humankind happy now"?

/"s"

RunSet•2mo ago

AI bubble will do wonders for used RAM prices when it pops.

ddalex•2mo ago

There is no popping. We cannot have enough compute for the forseeable future.

SV_BubbleTime•2mo ago

Look at this guy on his first ram shortage.

burnte•2mo ago

I've been in this game so long, seen so many shortages that I'm not even worried. Right now prices are high, manufacturers are switching production, and in 6 months there's going to be a glut of supply.

It's all a long game, folks. Play it long.

brianwawok•2mo ago

Long game is fine for optional upgrades. “I really wish my game system had 20% better graphics”. Less good when your system crashes and you need something new to work on Monday.

Dylan16807•2mo ago

> manufacturers are switching production

In what ways? The only switching I've seen is away from desktop memory.

burnte•2mo ago

You've answered the question! They're redirecting those chips to industrial use which makes desktop products more expensive and less available. Samsung is also extending DDR4 production, for example.

Dylan16807•2mo ago

I thought you were listing a switch in production that would relieve the shortages after we wait a few months. Switching away from desktop memory makes the shortages worse. So why do you expect there to be a glut in 6 months?

If you meant glut of memory suitable for datacenter GPUs, I don't expect that nearly so soon. That market can absorb extra chips pretty easily unless we see a really harsh pop really soon.

burnte•2mo ago

The same chips go into desktops as servers, so this is just redirecting the chips to another assembly line. I think there's a good chance the memory market will see a huge boost in supply to server spaces, and some memory will switch back to desktop use.

FuriouslyAdrift•2mo ago

It takes about a decade (and $10s of billions) to bring new fabs online

cruffle_duffle•2mo ago

Back in the day when 1mb memory sticks ruled the earth there was apparently a memory shortage because some fab burned down or something. Any day now, they’ll fix their shit and ram will be dirt cheap. At least according to my high school buddy.

We have always had a ram shortage. We’ve also always been at war with eastasia.

SV_BubbleTime•2mo ago

I remember my first 64MB sticks doubled in price after I bought them, I was envied for like 6 months among my friends with their 32MB machines.

zhengyi13•2mo ago

"The market can stay irrational longer than you can stay solvent"

ErroneousBosh•2mo ago

I'm going to play Minecraft with such a ludicrous shader...

delaminator•2mo ago

> Working with some of these huge models, I can see how AI has some use, especially if it's under my own local control. But it'll be a long time before I put much trust in what I get out of it—I treat it like I do Wikipedia. Maybe good for a jumping-off point, but don't ever let AI replace your ability to think critically!

It is a little sad that they gave someone an uber machine and this was the best he could come up with.

Question answering is interesting but not the most interesting thing one can do, especially with a home rig.

The realm of the possible

Video generation: CogVideoX at full resolution, longer clips

Mochi or Hunyuan Video with extended duration

Image generation at scale:

FLUX batch generation — 50 images simultaneously

Fine-tuning:

Actually train something — show LoRA on a 400B model, or full fine-tuning on a 70B

but I suppose "You have it for the weekend" means chatbot go brrrrr and snark

theshrike79•2mo ago

Yea, I don't understand why people use LLMs for "facts". You can get them from Wikipedia or a book.

Use them for something creative, write a short story on spec, generate images.

Or the best option: give it tools and let it actually DO something like "read my message history with my wife, find top 5 gift ideas she might have hinted at and search for options to purchase them" - perfect for a local model, there's no way in hell I'd feed my messages to a public LLM, but the one sitting next to me that I can turn off the second it twitches the wrong way? - sure.

mft_•2mo ago

> Yea, I don't understand why people use LLMs for "facts". You can get them from Wikipedia or a book.

Because web search is so broken these days, if you want a clean answer instead of wading through pages of SEO nonsense. It's really common (even) amongst non-techy friends that "I'll ask ChatGPT" has replaced "I'll Google it".

theshrike79•2mo ago

Kagi or DDG

Google is useless

benjismith•2mo ago

> show LoRA on a 400B model, or full fine-tuning on a 70B

Yeah, that's what I wanted to see too.

storus•2mo ago

M3 Ultra has a crappy GPU, somewhere around 3060Ti-3070. Its only benefit is the memory throughput that makes LLM token generation fast, at around 3080 level. But token prefill that determines time-to-first-token is extremely slow, and coincidentally all those tasks you mentioned above would be around 3060Ti level. That's why Exo coupled DGX Spark (5090 performance for FP4) with MacStudio and sped it up 4x. M5 Ultra is supposed to be as fast as DGX Spark at FP4 due to new neural cores.

newsclues•2mo ago

https://m.youtube.com/watch?v=4l4UWZGxvoc

Seems like the ecosystem is rapidly evolving

A4ET8a8uTh0_v2•2mo ago

What it kinda reminds me of is PS3 cluster era. Now if I could do something similar to the minisforum..

mmorse1217•2mo ago

Hey Jeff, wherever you are: this is awesome work! I’ve wanted to try something like this for a while and was very excited for the RDMA over thunderbolt news.

But I mostly want to say thanks for everything you do. Your good vibes are deeply appreciated and you are an inspiration.

rahimnathwani•2mo ago

The largest nodes in his cluster each have 512GB RAM. DeepSeek V3.1 is a 671B parameter model whose weights take up 700GB RAM: https://huggingface.co/deepseek-ai/DeepSeek-V3.1

I would have expected that going from one node (which can't hold the weights in RAM) to two nodes would have increased inference speed by more than the measured 32% (21.1t/s -> 27.8t/s).

With no constraint on RAM (4 nodes) the inference speed is less than 50% faster than with only 512GB.

Am I missing something?

zeusk•2mo ago

the TB5 link (RDMA) is much slower than direct access to system memory

elorant•2mo ago

You only get 80Gbps network bandwidth. There's your bottleneck right there. Infiniband in comparison can give you up to x10 times that.

storus•2mo ago

I think the op meant pipeline parallelism where during inference you only transfer the activation between layers where you cut the model in two, which shouldn't be too large.

zozbot234•2mo ago

Weights are read-only data so they can just be memory mapped and reside on SSD (only a small fraction will be needed in VRAM at any given time), the real constraint is activations. MoE architecture should help quite a bit here.

hu3•2mo ago

> only a small fraction will be needed in VRAM at any given time

I don't think that's true. At least not without heavy performance loss in which case "just be memory mapped" is doing a lot of work here.

By that logic GPUs could run models much larger than their VRAM would otherwise allow, which doesn't seem to be the case unless heavy quantization is involved.

zozbot234•2mo ago

Existing GPU API's are sadly not conducive to this kind of memory mapping with automated swap-in. The closest thing you get AIUI is "sparse" allocations in VRAM, such that only a small fraction of your "virtual address space" equivalent is mapped to real data, and the mapping can be dynamic.

Dylan16807•2mo ago

You need all the weights every token, so even with optimal splitting the fraction of the weights you can farm out to an SSD is proportional to how fast your SSD is compared to your RAM.

You'd need to be in a weirdly compute-limited situation before you can replace significant amounts of RAM with SSD, unless I'm missing something big.

> MoE architecture should help quite a bit here.

In that you're actually using a smaller model and swapping between them less frequently, sure.

rahimnathwani•2mo ago

Even with MoE you still need enough memory to load all experts. For each token, only 8 experts (out of 256) are activated, but which experts are chosen changes dynamically based on the input. This means you'll be constantly loading and unloading experts from disk.

MoEs is great for distributed deployments, because you can maintain a distribution of experts that matches your workload, and you can try to saturate each expert and thereby saturate each node.

zozbot234•2mo ago

Loading and unloading data from disk is highly preferable to sending the same amount of data over a bottlenecked Thunderbolt 5 connection.

rahimnathwani•2mo ago

No it's not.

With a cluster of two 512GB nodes, you have to send half the weights (350GB) over a TB5 connection. But you have to do this exactly once on startup.

With a single 512GB node, you'll be loading weights from disk each time you need a different expert, potentially for each token. Depending on how many experts you're loading, you might be loading 2GB to 20GB from disk each time.

Unless you're going to shut down your computer after generating a couple of hundred tokens, the cluster wins.

lvl155•2mo ago

Seriously, Jeff has the best job. Him and STH Patrick.

geerlingguy•2mo ago

I got to spend a day with Patrick this week, and try out his massive CyPerf testing rig with multiple 800 Gbps ConnectX-8 cards!

lvl155•2mo ago

Patrick’s enthusiasm is so contagious and you perfected tech YouTube format. There’s not a dead spot in your video.

andy99•2mo ago

Very cool, I’m probably thinking too much but why are they seemingly hyping this now (I’ve seen a bunch of this recently) with no M5 Max/Ultra machines in sight. Is it because their release is imminent (I have heard Q1 2026) or is it to try and stretch out demand for M4 Max / M3 Ultra. I plan to buy one (not four) but would feel like I’m buying something that’s going to be immediately out of date if I don’t wait for the M5.

GeekyBear•2mo ago

I imagine that they want to give developers time to get their RDMA support stabilized, so third party software will be ready to take advantage of RDMA when the M5 Ultra lands.

I definitely would not be buying an M3 Ultra right now on my own dime.

spacedcowboy•2mo ago

I am typing this on my own 512GB M3 Ultra. I've just put out some feelers for 2nd-hand sale price...

I have an M4 Max I can use to bridge any gap...

fooblaster•2mo ago

Does it actually creates a unified memory pool? it looks more like an accelerated backend for a collective communications library like nccl, which is very much not unified memory.

9dev•2mo ago

The yearly release cadence annoys me to no end. There is literally zero reason to have a new CPU generation every year, it just devalues Mac hardware faster.

Which I guess is the point of this for Apple, but still.

chis•2mo ago

I wonder what motivates apple to release features like RDMA which are purely useful for server clusters, while ignoring basic qol stuff like remote management or rack mount hardware. It’s difficult to see it as a cohesive strategy.

Makes one wonder what apple uses for their own servers. I guess maybe they have some internal M-series server product they just haven’t bothered to release to the public, and features like this are downstream of that?

vsgherzi•2mo ago

last I heard for the private compute features they were racking and stacking m2 mac pros

jeffbee•2mo ago

I honestly forgot they still made the Mac Pro. Amazing that they have these ready to ship on their website. But at a 50% premium over similar but faster Mac Studio models, what is the point? You can't usefully put GPUs in them as far as I know. You'd have to have a different PCIe need to make it make sense.

Asmod4n•2mo ago

all PCIe lanes combined in that machine can do over 1 terabit. Would be quite the networking beast.

jeffbee•2mo ago

The M2 Ultra has 32 off-world PCIe lanes, 8 of which are obligated to the SSDs. That leaves only 24 lanes for the 7 slots. That's 8 times less than you'd get from an EPYC, which is the kind of thing a normal user would put in a rack if they did not need to use macos.

xienze•2mo ago

> rack mount hardware

I guess they prefer that third parties deal with that. There’s rack mount shelves for Mac Minis and Studios.

mschuster91•2mo ago

There's still a lot - particularly remote management, aka iLO in HP lingo - missing for an actual hands-off environment usable for hosters.

xienze•2mo ago

I know it’s not exactly IPMI, but don’t those little external IP KVM modules work well enough to do remote admin of Macs?

geerlingguy•2mo ago

The annoying thing is there's no ability to control power (or see system metrics) outside the chassis. With servers and desktop PCs, you can usually tap into power pins and such.

jeffbee•2mo ago

thunderbolt rdma is quite clearly the nuclear option for remote management.

rsync•2mo ago

These are my own questions - asked since the first mac mini was introduced:

- Why is the tooling so lame ?

- What do they, themselves, use internally ?

Stringing together mac minis (or a "Studio", whatever) with thunderbolt cables ... Christ.

sneak•2mo ago

I assume a company like Apple either has custom server boards with tons of unified memory on M series with all the i/o they could want (that are ugly and thus not productized) or just use standard expensive nvidia stuff like everyone else.

doctorpangloss•2mo ago

the answer is even more boring, they use GCP haha

solarkraft•2mo ago

It’s quite interesting how „boring“ (traditionally enterprise?) their backend looks on the occasional peeks you get publicly. So much Apache stuff & XML.

hamdingers•2mo ago

> I guess maybe they have some internal M-series server product they just haven’t bothered to release to the public, and features like this are downstream of that?

Or do they have some real server-grade product coming down the line, and are releasing this ahead of it so that 3rd party software supports it on launch day?

MBCook•2mo ago

That they sell to the public? No way. They’ve clearly given up on server stuff and it makes sense for them.

That they use INTERNALLY for their servers? I could certainly see this being useful for that.

Mostly I think this is just to get money from the AI boom. They already had TB5, it’s not like this was costing them additional hardware. Just some time that probably paid off on their internal model training anyway.

re-thc•2mo ago

> That they sell to the public? No way. They’ve clearly given up on server stuff and it makes sense for them.

Given up is not a given. A lot of the exec team has been changing.

pjmlp•2mo ago

Some people are still hoping they care for some of their older customers.

https://cottonbureau.com/p/4RUVDA/shirt/mac-pro-believe-dark...

ascagnel_•2mo ago

And if the rumors are right -- that hardware SVP John Ternus is next in line for CEO -- I could see a world where the company doubles-down on their specialized hardware vs. services.

MBCook•2mo ago

They’ve done a dip-in-a-toe thing many times, then gave up.

If I was in charge of a business, and I’m an Apple fan, I wouldn’t touch them. I’d have no faith they’re in it for the long term. I think that would be a common view.

spacedcowboy•2mo ago

I worked on some of the internal server hardware. Yes they do have their own boards. Apple used to be all-in on Linux, but the newer chips are far and away more power-efficient, and power is one of the (if not the) major cost of outfitting a datacenter, at least over time.

These machines are very much internal - you can cram a lot of M-series (to use the public nomenclature) chips onto a rack-sized PCB. I was never under the impression they were destined for anything other than Apple datacenters though...

As I mentioned above, it seems to me there's a couple of feature that appeared on the customer-facing designs that were inspired by what the datacenter people wanted on their own PCB boards.

ecshafer•2mo ago

Are these internal servers full of M-series chips running a server max osx build then as well?

spacedcowboy•2mo ago

Apple's OS builds are a lot more flexible than most people give them credit for. That's why essentially the same OS scales from a watch to a Mac Pro. You can mix and match the ingredients of the OS for a given device pretty much at will, as long as the dependencies are satisfied. And since you own the OS, dependencies are often configurable.

almostgotcaught•2mo ago

I don't know what you're bemused by - there's no mystery here - you can read the release notes where it literally says this was added to support MLX:

https://developer.apple.com/documentation/macos-release-note...

Which I'm sure you saw in literally yesterday's thread about the exact same thing.

solarkraft•2mo ago

The comment is about the larger strategy surrounding that.

Melatonic•2mo ago

Do they run any of their own datacenter stuff ? I thought they just outsourced to GCP

re-thc•2mo ago

They have some of their own, also use AWS and others too.

spacedcowboy•2mo ago

AWS is just used for storage, because it's cheaper than Apple maintaining it, itself. Apple do have storage-datacenter at their campus at least (I've walked around one, it's many many racks of SSD's) but almost all the public stuff is on AWS (wrapped up in encryption) AFAIK.

Apple datacenters are mainly compute, other than the storage you need to run the compute efficiently.

stetrain•2mo ago

All of the Private Cloud Compute stuff they are working on runs on their own Apple Silicon server hardware.

https://security.apple.com/blog/private-cloud-compute/

https://www.apple.com/newsroom/in-the-loop/2025/10/shipping-...

raw_anon_1111•2mo ago

They outsource to GCP and AWS. An Apple executive has been on stage at ReInvent for the past couple of years

erik•2mo ago

The Mac Studio, in some ways, is in a class of its own for LLM inference. I think this is Apple leaning into that. They didn't add RDMA for general server clustering usefulness. They added it so you can put 4 Studios together in an LLM inferencing cluster exactly as demonstrated in the article.

oofbey•2mo ago

Blog posts like this one are great marketing.

Retr0id•2mo ago

I wonder if there's any possibility that an RDMA expansion device could exist in the future - i.e. a box full of RAM on the other end of a thunderbolt cable. Although I guess such a device would cost almost as much as a mac mini in any case...

roadbuster•2mo ago

You still need an interface which does at least two things: handles incoming read/write requests using some kind of network protocol, and operates as a memory controller for the RAM.

Texas Memory Systems was in the business of making large 'RAM Drives'. They had a product line known as "RamSan" which made many gigabytes/terabytes of DDR available via a block storage interface over infiniband and fibre channel. The control layer was implemented via FPGA.

I recall a press release from 2004 which publicized the US govt purchase of a 2.5TB RamSan. They later expanded into SSDs and were acquired by IBM in 2012.

https://en.wikipedia.org/wiki/Texas_Memory_Systems

https://www.lhcomp.com/vendors/tms/TMS-RamSan300-DataSheet.p...

https://gizmodo.com/u-s-government-purchases-worlds-largest-...

https://www.lhcomp.com/vendors/tms/TMS-RamSan20-DataSheet.pd...

https://www.ibm.com/support/pages/ibm-plans-acquire-texas-me...

RantyDave•2mo ago

Couldn't you "just" use a honking fast SSD and set it as a swap drive?

Retr0id•2mo ago

You might get close in peak bandwidth, but not in random access and latency.

amluto•2mo ago

RDMA is not really intended for this. RDMA is really just a bunch of functionality of a PCIe device, and even PCIe isn’t really quite right to use like RAM because its cache semantics aren’t intended for this use case.

But the industry knows this, and there’s a technology that is electrically compatible with PCIe that is intended for use as RAM among other things: CXL. I wonder if a anyone will ever build CXL over USB-C.

xmddmx•2mo ago

I was impressed by the lack of dominance of Thunderbolt:

"Next I tested llama.cpp running AI models over 2.5 gigabit Ethernet versus Thunderbolt 5"

Results from that graph showed only a ~10% benefit from TB5 vs. Ethernet.

Note: The M3 studios support 10Gbps ethernet, but that wasn't tested. Instead it was tested using 2.5Gbps ethernet.

If 2.5G ethernet was only 10% slower than TB, how would 10G Ethernet have fared?

Also, TB5 has to be wired so that every CPU is connected to every other over TB, limiting you to 4 macs.

By comparison, with Ethernet, you could use a hub & spoke configuration with a Ethernet switch, theoretically letting you use more than 4 CPUs.

MBCook•2mo ago

That’s llama, which didn’t scale nearly as well in the tests. Assumedly because it’s not optimized yet.

RDMA is always going to have lower overhead than Ethernet isn’t it?

Neywiny•2mo ago

Possibly RDMA over thunderbolt. But for RoCE (RDMA over converged Ethernet) obviously not because it's sitting on top of Ethernet. Now that could still have a higher throughput when you factor in CPU time to run custom protocols that smart NICs could just DMA instead, but the overhead is still definitively higher

_zoltan_•2mo ago

what do you think "ethernet's overhead" is?

Neywiny•2mo ago

Header and FCS, interpacket gap, and preamble. What do you think "Ethernet overhead" is?

_zoltan_•2mo ago

I've meant in usec, sorry if that wasn't clear, given that the discussion that I've replied was about rpc latency.

Neywiny•2mo ago

That's a very nebulous metric. Usec of overhead depends on a lot of runtime things and a lot of hardware options and design that I'm just not privy to

geerlingguy•2mo ago

10G Ethernet would only marginally speed things up based on past experience with llama RPC; latency is much more helpful but still, diminishing returns with that layer split.

gwehrli•2mo ago

This Video tests the setup using 10Gbps ethernet: https://www.youtube.com/watch?v=4l4UWZGxvoc

polynomial•2mo ago

BUILD AI has a post about this and in particular sharding k-v cache across GPUs, and how network is the new memory hierarchy:

https://buildai.substack.com/p/kv-cache-sharding-and-distrib...

gloyoyo•2mo ago

Wow. $40k for a friendly chat(bot)...

Hey, at least this post allows us to feel as though we spent the money ourselves.

Bravo!

simonw•2mo ago

I'd be interested in seeing numbers that split out the speed of reading input (aka prefill) and the speed of generating output (aka decode). Those numbers are usually different and I remember from this Exo article that they could be quite radically different on Mac hardware: https://blog.exolabs.net/nvidia-dgx-spark/

geerlingguy•2mo ago

See https://github.com/geerlingguy/beowulf-ai-cluster/issues/17 for more data — I didn't save all the prompt processing times (Exo just outputs a time in ms, no other data for that), but will try to have another pass. Maybe also convince the Exo team to add a proper benchmarking capability ala `llama-bench` :)

jonrouach•2mo ago

or better, like you mentioned, try to convince Exo to develop in the open, so everyone gets any capability as PRs.

geerlingguy•2mo ago

They are now, this morning they pushed all the code to the Exo repo, and archived the earlier Exo branch. We'll see how open they are now that whatever embargoed work they did with Apple is public..

periodjet•2mo ago

> That's definitely fast enough for vibe coding, if that's your thing, but it's not mine.

Why even…?

mikestaas•2mo ago

> You have to click buttons in the UI.

I like doing development work on a Mac, but this has to be my biggest bugbear with the system.

jauntywundrkind•2mo ago

I really hope AMD or Intel can get on the clue train and respond.

Intel in particular has half a decade of having extremely amazing Thunderbolt ports on their mobile chips, built in (alas not present on desktop chips, for shame). There's been not bad but not great thunderbolt host-to-host networking, that TCP can go over, but the system to system connectivity had been a total afterthought, not at all tuned for obvious smart readily available options like RDMA here. But nothing stops anyone from having better host-to-host protocols.

There are also so many smart good excellent next steps competitors could go for. CXL is showing up on server systems as a much lighter weight much lower latency transport that is PCIe PHY compatible but lighter weight. Adding this to consumer chips and giving even a third of a shit could blow what we see here out of the water. It could probably be done over USB4 & radically blast this bespoke RDMA capability.

Connectivity had been a bespoke special capability for too long. Intel did amazing with Xeon having integrated OmniPath 100Gb a long time ago, that was amazing, for barely any extra bucks. But the market didn't reward them kicking total ass and everyone gave up on connecting chips together. Today we are hostage to fantastically expensive shitty inefficient NIC that cost a crap ton of money to do a worse job, paying enormous penalty for not having the capability on chip, making at best asmedia io hubs do the USB4 dance a hip away from the CPU.

I really hope Intel can appreciate how good they were, see the threat of Apple kicking as here doing what Intel uniquely has been offering for half a decade with incredible Thunderbolt offerings on-chip (limited alas only to mobile chips). I hope AMD feels the heat and gets some god dMned religion and sees the pressure and thread: man they delivered so strong on PCIe lane counts but man they have been so so so slacking on io capabilities for so long, especially on consumer platforms, and Apple is using both their awesome awesome awesome on-chip memory here and their fan-tastic exceptional ability to care just even the tiniest bit about using the consumer interconnect (that already exists in hardware).

I really really really hope someone else other than Apple can ante up and care. There are so many wins to be had, so close. These companies feel so distracted from the plot. Fucking shame. Good on Apple for being the only mofos to a Tually seize the obvious that was just sitting here, they took no effort nor innovation. What a shame no other players are trying at all.

PunchyHamster•2mo ago

Intel is allergic for making consumer stuff good. Remember how in consumer range like half of the chips had fucking virtualisation disabled, long after competition had it on everything ?

pjmlp•2mo ago

In the real world, you get a desktop PC with a bunch of GPUs connected on the same bus talking to each other.

No need for multiple computers talking over thunderbolt.

g947o•2mo ago

probably need to build your own PC. This is not really possible on most prebuilt gaming PCs.

CaliforniaKarl•2mo ago

The "all nodes connecting to all other nodes" setup reminds me of NUMALink, the interconnect that SGI used on many (most? all?) of their supercomputers. In an ideal configuration, each 4-socket node has two NUMALink connections to every other node. As Jeff says, it's a ton of cables, and you don't have to think of framing or congestion in the same way as with RDMA over Ethernet.

T3OU-736•2mo ago

SGI's HW also had ccNUMA (cache-coherent Non-Uniform Memory Access), which, given the latencies possible in systems _physically_ spanning entire rooms, was quite a feat.

The IRIX OS even had functionality to migrate kobs and theor working memory closer to each other to lower the latency of access.

We see echoes of this when companies like high-frequency traders pay attention to motherboard layouts and co-locate and pin the PTS (proprietary trading systems) processes to specific cores based on which DIMMs are on which side of the memory controller.

_zoltan_•2mo ago

just as an NVL72 rack today has 7271 links (18 probably) in the rack connecting all those GPUs together.

poemxo•2mo ago

Really cool article, I liked these details that weren't exactly related to the thesis:

- the mysterious disappearance of Exo

- Jeff wants something like SMB Direct but for the Mac. Wait what? SMB Direct is a thing, wha?? I always thought networked storage was untrustworthy.

- A single M3 Ultra is fast for inference

- A framework desktop ai max 395 is only $2100

Now I have some more rabbit holes to jump down.

fortran77•2mo ago

On Intel Motherboards, it's easy to find ones that can take 2TB of RAM, for example: https://www.supermicro.com/en/products/motherboard/x14sbw-tf

This seems suboptimal.

TheTxT•2mo ago

The gpu can’t access that directly however. On Apple Silicon it can all be used as vram.

ErneX•2mo ago

2TB of system memory, not unified memory like Apple Silicon Macs.

dsrtslnd23•2mo ago

Any thoughts on the GB300 workstation with 768GB RAM (from NVIDA, Asus, Dell, ...)? Although many announcements were made it seems not to be available yet. It does have faster interconnects but will probably be much more expensive.

polsevev•2mo ago

As much as i hate Apples attitude towards hackers and modifying systems. I have to commend them for building awesome features like this

Tepix•2mo ago

Linux already has RDMA support but it cannot yet use Thunderbolt. It's probably quite a bit of work to add everything that's required. Is anyone working on it?

It would be great to have this for those cheap Strix Halo boxes with 128GB quad channel DDR5-8000 for using two or three of them with their 2 USB4 ports (which are Thunderbolt capable) to fit larger models.

dylan604•2mo ago

> Linux already has RDMA support but it cannot yet use Thunderbolt.

Is TB still encumbered by licensing requirements causing this lack of use?

yonatan8070•2mo ago

I assume that's Thunderbolt 5? From my experience, eGPUs over USB 4/TB3 work just fine (from a technical point of view, in practice 40Gbps isn't enough BW and performance is shit)

Tepix•2mo ago

Yes, that works but what's needed here is remotely accessing RAM on another PC via RDMA via Thunderbolt. Not accessing a standalone eGPU.

pjmlp•2mo ago

In an ideal world, Apple would have released a Mac Pro with card slots for doing this kind of stuff.

Instead we get gimmicks over Thunderbolt.

g947o•2mo ago

I can imagine Apple shipping Mac Pros with add-ons that allows running local inference with minimal setups. "Look, just spend $50k on this machine and you get a usable LLM server that can be shared for a team." But they don't seem particularly interested in that market.

clan•2mo ago

As Jeff states there are really no Thunderbolt switches which currently limits the size of the cluster.

But would it be possible to utilize RoCE with these boxes rather than RDMA over Thunderbolt? And what would the expected performance be? As I understand RDMA should be 7-10 times faster than via TCP. But if I understand it correctly RoCE is RDMA over Converged Ethernet. So using ethernet frames and lower layer rather than TCP.

10G Thunderbolt adapters are fairly common. But you can find 40G and 80G Thunderbolt ethernet adapters from Atto. Probably not cheap - but would be fun to test! But ieven if the bandwidth is there we might get killed with latency.

Imagine this hardware with a PCIe slot. The Infiniband hardware is there - then we "just" need the driver.

KeplerBoy•2mo ago

At that point you could just breakout the thunderbolt to PCIe and use a regular NIC. Actually, I'm pretty sure that's all that to the Atto Thunderlink, a case around a broadcom nic.

Then you _just_ need the driver. Fascinating, Apple ships MLX5 drivers, that's crazy imo. I understand that's something they might need internally, but shipping that on ipadOs is wild. https://kittenlabs.de/blog/2024/05/17/25gbit/s-on-macos-ios/

clan•2mo ago

That is what I am suggesting with the Atto adapter.

Infiniband is way faster and lower latency than a NIC. These days NIC==Ethernet.

KeplerBoy•2mo ago

What makes you think Infiniband is faster than Ethernet? Aren't they pretty much equal these days with RDMA and kernel bypass?

jamesfmilne•2mo ago

macOS ships with drivers for Mellanox ConnectX cards, but I have no idea if they will show up in `ibv_devices` or `ibv_devinfo`.

XCSme•2mo ago

Anyone else getting ERR_TOO_MANY_REDIRECTS trying to access the post?

pstoll•2mo ago

Yes only on that page, not the rest of his blog. Guessing he ansible’d it to redirect ;)

moebrowne•2mo ago

https://web.archive.org/web/20251219055833/https://www.jeffg...

XCSme•2mo ago

It's working now.

bluedino•2mo ago

There have a been a couple videos/posts about this from other influencers today

Does anyone remember a guy here posting about linking Mac Studios with Thunderbolt for HPC/clustering? I wasn't able to find it with a quick search.

Edit: I think it was this?

https://www.youtube.com/watch?v=d8yS-2OyJhw

daft_pink•2mo ago

The next Mac studio is going to be a top seller. I don’t think people want to drop $10k on a few M3s, but I think they will do it for the M6. Just hoping the DRAM shortage doesn’t ruin this plan.

oofbey•2mo ago

Apple always charges a huge premium for RAM. Maybe it’s enough to buffer their pricing scheme from the supply shock. I have run the numbers though.

terhechte•2mo ago

Tim Cook is famous for locking in their prices years in advance.

dogcowmoof•2mo ago

Wonder if RDMA support can translate to things like SMB direct or other RDMA adjacent things

dogcowmoof•2mo ago

Wonder if support for RDMA will translate into support for things such as SMB Direct or if it's really only useful for RAM pooling

krbaccord94f•2mo ago

rdma_ctl enable in 1tn parameter.

TL1 mount, where 1.5 TB allocate mac-mini server.

e28eta•2mo ago

> For example: did you know there's no way to run a system upgrade (like to 26.2) via SSH

I did not know this. I thought the `softwareupdate` command was built for this use case, and thought it worked over ssh. It sure looks like it should work, but I don’t have a mac I can try it on right now.

pudquick•2mo ago

He's wrong, it's possible. It's just that root privileges alone is insufficient due to how the signing on LocalPolicy works on M series Macs

https://support.apple.com/guide/security/contents-a-localpol...

The manpage for the command provides information on credential usage on Apple Silicon devices.

supermatt•2mo ago

What is the max token throughput when batching. Lots of agentic workflows (not just vibe coding) are running many inferences in parallel.

It seems like every time someone does an AI hardware “review” we end up with figures for just a single instance, which simply isn’t how the target demographic for a 40k cluster are going to be using it.

Jeff, I love reading your reviews, but can’t help but feel this was a wasted opportunity for some serious benchmarking of LLM performance.

saddat•2mo ago

A good part of humanities knowledge under your desk running with a few old light bulbs worth of power

extraduder_ire•2mo ago

Is RDMA only going to be on the studio, or is it coming to anything with a thunderbolt 5 port on it?

Show HN: Smart card eID driver written in Zig

The hard problem of AI therapy

Trump Orders Government to Stop Using Anthropic After Pentagon Standoff

Does overwork make agents Marxist?

Refactoring Is for Humans

Federal Government to restrict use of Anthropic

GLP-1 and Prior Major Adverse Limb Events in Patients with Diabetes

Show HN: Agoragentic – Agent-to-Agent Marketplace for LangChain, CrewAI and MCP

Show HN: WhenItHappens–family resource after traumatic death

Trump directs federal agencies to cease use of Anthropic

Trump Will End Government Use of Anthropic's AI Models

The Death of Spotify: Why Streaming Is Minutes Away from Being Obsolete

The Death of the Subconscious and the Birth of the Subconsciousness

Show HN: Gace AI – A zero-config platform to build and host AI plugins for free

USA to cut Anthropic from government contracts in six months

Heart attack deaths rose between 2011 and 2022 among adults younger than age 55

Ask HN: What's the best engineering interview process?

Relaxation trend: customers can meditate or snooze in open or closed casket

Massachusetts State Police are on a drone surveillance shopping spree

Trump Responds to Anthropic

LLM-Based Evolution as a Universal Optimizer

Trump Orders US Agencies to Drop Anthropic After Pentagon Feud

Netflix Declines to Raise Offer for Warner Bros

Show HN: I Built a $1 Escalating Internet Billboard – Called Space

Show HN: I vibe coded a DAW for the terminal. how'd I do?

How to Run a One Trillion-Parameter LLM Locally: AMD Ryzen AI Max+ Cluster Guide

It's Time for LLM Connection Strings

A War Foretold

Recontextualizing Famous Quotes for Brand Slogan Generation

Poland Plans Social Media Ban for Kids in Challenge to US Tech

Show HN: Smart card eID driver written in Zig

The hard problem of AI therapy

Trump Orders Government to Stop Using Anthropic After Pentagon Standoff

Does overwork make agents Marxist?

Refactoring Is for Humans

Federal Government to restrict use of Anthropic

GLP-1 and Prior Major Adverse Limb Events in Patients with Diabetes

Show HN: Agoragentic – Agent-to-Agent Marketplace for LangChain, CrewAI and MCP

Show HN: WhenItHappens–family resource after traumatic death

Trump directs federal agencies to cease use of Anthropic

Trump Will End Government Use of Anthropic's AI Models

The Death of Spotify: Why Streaming Is Minutes Away from Being Obsolete

The Death of the Subconscious and the Birth of the Subconsciousness

Show HN: Gace AI – A zero-config platform to build and host AI plugins for free

USA to cut Anthropic from government contracts in six months

Heart attack deaths rose between 2011 and 2022 among adults younger than age 55

Ask HN: What's the best engineering interview process?

Relaxation trend: customers can meditate or snooze in open or closed casket

Massachusetts State Police are on a drone surveillance shopping spree

Trump Responds to Anthropic

LLM-Based Evolution as a Universal Optimizer

Trump Orders US Agencies to Drop Anthropic After Pentagon Feud

Netflix Declines to Raise Offer for Warner Bros

Show HN: I Built a $1 Escalating Internet Billboard – Called Space

Show HN: I vibe coded a DAW for the terminal. how'd I do?

How to Run a One Trillion-Parameter LLM Locally: AMD Ryzen AI Max+ Cluster Guide

It's Time for LLM Connection Strings

A War Foretold

Recontextualizing Famous Quotes for Brand Slogan Generation

Poland Plans Social Media Ban for Kids in Challenge to US Tech

1.5 TB of VRAM on Mac Studio – RDMA over Thunderbolt 5

Comments