frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Beginning January 2026, all ACM publications will be made open access

https://dl.acm.org/openaccess
1290•Kerrick•9h ago•142 comments

1.5 TB of VRAM on Mac Studio – RDMA over Thunderbolt 5

https://www.jeffgeerling.com/blog/2025/15-tb-vram-on-mac-studio-rdma-over-thunderbolt-5
143•rbanffy•2h ago•46 comments

We pwned X, Vercel, Cursor, and Discord through a supply-chain attack

https://gist.github.com/hackermondev/5e2cdc32849405fff6b46957747a2d28
565•hackermondev•5h ago•218 comments

Trained LLMs exclusively on pre-1913 texts

https://github.com/DGoettlich/history-llms
136•iamwil•2h ago•46 comments

Texas is suing all of the big TV makers for spying on what you watch

https://www.theverge.com/news/845400/texas-tv-makers-lawsuit-samsung-sony-lg-hisense-tcl-spying
470•tortilla•2d ago•253 comments

GPT-5.2-Codex

https://openai.com/index/introducing-gpt-5-2-codex/
353•meetpateltech•6h ago•207 comments

How China built its ‘Manhattan Project’ to rival the West in AI chips

https://www.japantimes.co.jp/business/2025/12/18/tech/china-west-ai-chips/
188•artninja1988•6h ago•188 comments

Skills for organizations, partners, the ecosystem

https://claude.com/blog/organization-skills-and-directory
226•adocomplete•8h ago•138 comments

Classical statues were not painted horribly

https://worksinprogress.co/issue/were-classical-statues-painted-horribly/
550•bensouthwood•12h ago•263 comments

Great ideas in theoretical computer science

https://www.cs251.com/
29•sebg•2h ago•1 comments

T5Gemma 2: The next generation of encoder-decoder models

https://blog.google/technology/developers/t5gemma-2/
91•milomg•5h ago•16 comments

AI vending machine was tricked into giving away everything

https://kottke.org/25/12/this-ai-vending-machine-was-tricked-into-giving-away-everything
60•duggan•3h ago•2 comments

Show HN: Picknplace.js, an alternative to drag-and-drop

https://jgthms.com/picknplace.js/
110•bbx•2d ago•61 comments

Delty (YC X25) Is Hiring an ML Engineer

https://www.ycombinator.com/companies/delty/jobs/MDeC49o-machine-learning-engineer
1•lalitkundu•4h ago

Firefox will have an option to disable all AI features

https://mastodon.social/@firefoxwebdevs/115740500373677782
259•twapi•6h ago•226 comments

Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn)

https://github.com/vivienhenz24/fuzzy-canary
117•misterchocolat•2d ago•87 comments

How did IRC ping timeouts end up in a lawsuit?

https://mjg59.dreamwidth.org/73777.html
122•dvaun•1d ago•13 comments

FunctionGemma 270M Model

https://blog.google/technology/developers/functiongemma/
148•mariobm•6h ago•39 comments

The Scottish Highlands, the Appalachians, Atlas are the same mountain range

https://vividmaps.com/central-pangean-mountains/
82•lifeisstillgood•5h ago•21 comments

I've been writing ring buffers wrong all these years (2016)

https://www.snellman.net/blog/archive/2016-12-13-ring-buffers/
58•flaghacker•2d ago•22 comments

Meta Segment Anything Model Audio

https://ai.meta.com/samaudio/
137•megaman821•2d ago•19 comments

How to hack Discord, Vercel and more with one easy trick

https://kibty.town/blog/mintlify/
106•todsacerdoti•5h ago•18 comments

Your job is to deliver code you have proven to work

https://simonwillison.net/2025/Dec/18/code-proven-to-work/
619•simonw•10h ago•525 comments

The Code That Revolutionized Orbital Simulation [video]

https://www.youtube.com/watch?v=nCg3aXn5F3M
5•surprisetalk•4d ago•0 comments

Show HN: Learning a Language Using Only Words You Know

https://simedw.com/2025/12/15/langseed/
29•simedw•3d ago•8 comments

TRELLIS.2: state-of-the-art large 3D generative model (4B)

https://github.com/microsoft/TRELLIS.2
57•dvrp•2d ago•12 comments

Using TypeScript to obtain one of the rarest license plates

https://www.jack.bio/blog/licenseplate
140•lafond•10h ago•145 comments

The Legacy of Nicaea

https://hedgehogreview.com/web-features/thr/posts/the-legacy-of-nicaea
33•diodorus•5d ago•23 comments

Local WYSIWYG Markdown, mockup, data model editor powered by Claude Code

https://nimbalyst.com
14•wek•4h ago•5 comments

Please just try HTMX

http://pleasejusttryhtmx.com/
427•iNic•10h ago•361 comments
Open in hackernews

1.5 TB of VRAM on Mac Studio – RDMA over Thunderbolt 5

https://www.jeffgeerling.com/blog/2025/15-tb-vram-on-mac-studio-rdma-over-thunderbolt-5
142•rbanffy•2h ago

Comments

behnamoh•1h ago
My expectations from M5 Max/Ultra devices:

- Something like DGX QSFP link (200Gb/s, 400Gb/s) instead of TB5. Otherwise, the economies of this RDMA setup, while impressive, don't make sense.

- Neural accelerators to get prompt prefill time down. I don't expect RTX 6000 Pro speeds, but something like 3090/4090 would be nice.

- 1TB of unified memory in the maxed out version of Mac Studio. I'd rather invest in more RAM than more devices (centralized will always be faster than distributed).

- +1TB/s bandwidth. For the past 3 generations, the speed has been 800GB/s...

- The ability to overclock the system? I know it probably will never happen, but my expectation of Mac Studio is not the same as a laptop, and I'm TOTALLY okay with it consuming +600W energy. Currently it's capped at ~250W.

Also, as the OP noted, this setup can support up to 4 Mac devices because each Mac must be connected to every other Mac!! All the more reason for Apple to invest in something like QSFP.

tylerflick•1h ago
> TOTALLY okay with it consuming +600W energy

The 2019 i9 Macbook Pro has entered the chat.

burnt-resistor•1h ago
Apple has always sucked at properly embracing properly robust tech for high-end gear for markets outside of individual prosumer or creatives. When Xserves existed, they used commodity IDE drives without HA or replaceable PSUs that couldn't compete with contemporary enterprise servers (HP-Compaq/Dell/IBM/Fujitsu). Xserve RAID interconnection half-heartedly used fiber channel but couldn't touch a NetApp or EMC SAN/filer. I'm disappointed Apple has a persistent blindspot preventing them from succeeding in data center-quality gear category when they could've had virtualized servers, networking, and storage, things that would eventually find their way into my home lab after 5-7 years.
angoragoats•1h ago
> Also, as the OP noted, this setup can support up to 4 Mac devices because each Mac must be connected to every other Mac!! All the more reason for Apple to invest in something like QSFP.

This isn’t any different with QSFP unless you’re suggesting that one adds a 200GbE switch to the mix, which:

* Adds thousands of dollars of cost,

* Adds 150W or more of power usage and the accompanying loud fan noise that comes with that,

* And perhaps most importantly adds measurable latency to a networking stack that is already higher latency than the RDMA approach used by the TB5 setup in the OP.

fenced_load•1h ago
Mikrotik has a switch that can do 6x200g for ~$1300 and <150W.

https://www.bhphotovideo.com/c/product/1926851-REG/mikrotik_...

angoragoats•21m ago
Cool! So for marginally less in cost and power usage than the numbers I quoted, you can get 2 more machines than with the RDMA setup. And you’ve still not solved the thing that I called out as the most important drawback.
nicky_nickell•4m ago
how significant is the latency hit?
wtallis•11m ago
That switch appears to have 2x 400G ports, 2x 200G ports, 8x 50G ports, and a pair of 10G ports. So unless it allows bonding together the 50G ports (which the switch silicon probably supports at some level), it's not going to get you more than four machines connected at 200+ Gbps.
angoragoats•53s ago
As with most 40+GbE ports, the 400Gbit ports can be split into 2x200Gbit ports with the use of special cables. So you can connect a total of 6 machines at 200Gbit.
zozbot234•1h ago
> Neural accelerators to get prompt prefill time down.

Apple Neural Engine is a thing already, with support for multiply-accumulate on INT8 and FP16. AI inference frameworks need to add support for it.

> this setup can support up to 4 Mac devices because each Mac must be connected to every other Mac!!

Do you really need a fully connected mesh? Doesn't Thunderbolt just show up as a network connection that RDMA is ran on top of?

fooblaster•1h ago
Might be helpful if they actually provided a programming model for ANE that isn't onnx. ANE not having a native development model just means software support will not be great.
liuliu•51m ago
They were talking about neural accelerators (a silicon piece on GPU): https://releases.drawthings.ai/p/metal-flashattention-v25-w-...
csdreamer7•44m ago
> Apple Neural Engine is a thing already, with support for multiply-accumulate on INT8 and FP16. AI inference frameworks need to add support for it.

Or, Apple could pay for the engineers to add it.

pdpi•13m ago
> Do you really need a fully connected mesh? Doesn't Thunderbolt just show up as a network connection that RDMA is ran on top of?

If you daisy chain four nodes, then traffic between nodes #1 and #4 eat up all of nodes #2 and #3's bandwidth, and you eat a big latency penalty. So, absent a switch, the fully connected mesh is the only way to have fast access to all the memory.

Dylan16807•1h ago
> +1TB/s bandwidth. For the past 3 generations, the speed has been 800GB/s...

M4 already hit the necessary speed per channel, and M5 is well above it. If they actually release an Ultra that much bandwidth is guaranteed on the full version. Even the smaller version with 25% fewer memory channels will be pretty close.

We already know Max won't get anywhere near 1TB/s since Max is half of an Ultra.

delaminator•1h ago
> Working with some of these huge models, I can see how AI has some use, especially if it's under my own local control. But it'll be a long time before I put much trust in what I get out of it—I treat it like I do Wikipedia. Maybe good for a jumping-off point, but don't ever let AI replace your ability to think critically!

It is a little sad that they gave someone an uber machine and this was the best he could come up with.

Question answering is interesting but not the most interesting thing one can do, especially with a home rig.

The realm of the possible

Video generation: CogVideoX at full resolution, longer clips

Mochi or Hunyuan Video with extended duration

Image generation at scale:

FLUX batch generation — 50 images simultaneously

Fine-tuning:

Actually train something — show LoRA on a 400B model, or full fine-tuning on a 70B

but I suppose "You have it for the weekend" means chatbot go brrrrr and snark

theshrike79•1h ago
Yea, I don't understand why people use LLMs for "facts". You can get them from Wikipedia or a book.

Use them for something creative, write a short story on spec, generate images.

Or the best option: give it tools and let it actually DO something like "read my message history with my wife, find top 5 gift ideas she might have hinted at and search for options to purchase them" - perfect for a local model, there's no way in hell I'd feed my messages to a public LLM, but the one sitting next to me that I can turn off the second it twitches the wrong way? - sure.

benjismith•53m ago
> show LoRA on a 400B model, or full fine-tuning on a 70B

Yeah, that's what I wanted to see too.

newsclues•1h ago
https://m.youtube.com/watch?v=4l4UWZGxvoc

Seems like the ecosystem is rapidly evolving

mmorse1217•1h ago
Hey Jeff, wherever you are: this is awesome work! I’ve wanted to try something like this for a while and was very excited for the RDMA over thunderbolt news.

But I mostly want to say thanks for everything you do. Your good vibes are deeply appreciated and you are an inspiration.

rahimnathwani•1h ago
The largest nodes in his cluster each have 512GB RAM. DeepSeek V3.1 is a 671B parameter model whose weights take up 700GB RAM: https://huggingface.co/deepseek-ai/DeepSeek-V3.1

I would have expected that going from one node (which can't hold the weights in RAM) to two nodes would have increased inference speed by more than the measured 32% (21.1t/s -> 27.8t/s).

With no constraint on RAM (4 nodes) the inference speed is less than 50% faster than with only 512GB.

Am I missing something?

zeusk•1h ago
the TB5 link (RDMA) is much slower than direct access to system memory
elorant•1h ago
You only get 80Gbps network bandwidth. There's your bottleneck right there. Infiniband in comparison can give you up to x10 times that.
zozbot234•1h ago
Weights are read-only data so they can just be memory mapped and reside on SSD (only a small fraction will be needed in VRAM at any given time), the real constraint is activations. MoE architecture should help quite a bit here.
hu3•1h ago
> only a small fraction will be needed in VRAM at any given time

I don't think that's true. At least not without heavy performance loss in which case "just be memory mapped" is doing a lot of work here.

By that logic GPUs could run models much larger than their VRAM would otherwise allow, which doesn't seem to be the case unless heavy quantization is involved.

zozbot234•34m ago
Existing GPU API's are sadly not conducive to this kind of memory mapping with automated swap-in. The closest thing you get AIUI is "sparse" allocations in VRAM, such that only a small fraction of your "virtual address space" equivalent is mapped to real data, and the mapping can be dynamic.
Dylan16807•50m ago
You need all the weights every token, so even with optimal splitting the fraction of the weights you can farm out to an SSD is proportional to how fast your SSD is compared to your RAM.

You'd need to be in a weirdly compute-limited situation before you can replace significant amounts of RAM with SSD, unless I'm missing something big.

> MoE architecture should help quite a bit here.

In that you're actually using a smaller model and swapping between them less frequently, sure.

rahimnathwani•31m ago
Even with MoE you still need enough memory to load all experts. For each token, only 8 experts (out of 256) are activated, but which experts are chosen changes dynamically based on the input. This means you'll be constantly loading and unloading experts from disk.

MoEs is great for distributed deployments, because you can maintain a distribution of experts that matches your workload, and you can try to saturate each expert and thereby saturate each node.

zozbot234•21m ago
Loading and unloading data from disk is highly preferable to sending the same amount of data over a bottlenecked Thunderbolt 5 connection.
rahimnathwani•5m ago
No it's not.

With a cluster of two 512GB nodes, you have to send half the weights (350GB) over a TB5 connection. But you have to do this exactly once on startup.

With a single 512GB node, you'll be loading weights from disk each time you need a different expert, potentially for each token. Depending on how many experts you're loading, you might be loading 2GB to 20GB from disk each time.

Unless you're going to shut down your computer after generating a couple of hundred tokens, the cluster wins.

lvl155•1h ago
Seriously, Jeff has the best job. Him and STH Patrick.
geerlingguy•1h ago
I got to spend a day with Patrick this week, and try out his massive CyPerf testing rig with multiple 800 Gbps ConnectX-8 cards!
andy99•1h ago
Very cool, I’m probably thinking too much but why are they seemingly hyping this now (I’ve seen a bunch of this recently) with no M5 Max/Ultra machines in sight. Is it because their release is imminent (I have heard Q1 2026) or is it to try and stretch out demand for M4 Max / M3 Ultra. I plan to buy one (not four) but would feel like I’m buying something that’s going to be immediately out of date if I don’t wait for the M5.
GeekyBear•1h ago
I imagine that they want to give developers time to get their RDMA support stabilized, so third party software will be ready to take advantage of RDMA when the M5 Ultra lands.

I definitely would not be buying an M3 Ultra right now on my own dime.

fooblaster•1h ago
Does it actually creates a unified memory pool? it looks more like an accelerated backend for a collective communications library like nccl, which is very much not unified memory.
chis•1h ago
I wonder what motivates apple to release features like RDMA which are purely useful for server clusters, while ignoring basic qol stuff like remote management or rack mount hardware. It’s difficult to see it as a cohesive strategy.

Makes one wonder what apple uses for their own servers. I guess maybe they have some internal M-series server product they just haven’t bothered to release to the public, and features like this are downstream of that?

vsgherzi•1h ago
last I heard for the private compute features they were racking and stacking m2 mac pros
xienze•1h ago
> rack mount hardware

I guess they prefer that third parties deal with that. There’s rack mount shelves for Mac Minis and Studios.

mschuster91•14m ago
There's still a lot - particularly remote management, aka iLO in HP lingo - missing for an actual hands-off environment usable for hosters.
jeffbee•1h ago
thunderbolt rdma is quite clearly the nuclear option for remote management.
rsync•1h ago
These are my own questions - asked since the first mac mini was introduced:

- Why is the tooling so lame ?

- What do they, themselves, use internally ?

Stringing together mac minis (or a "Studio", whatever) with thunderbolt cables ... Christ.

hamdingers•1h ago
> I guess maybe they have some internal M-series server product they just haven’t bothered to release to the public, and features like this are downstream of that?

Or do they have some real server-grade product coming down the line, and are releasing this ahead of it so that 3rd party software supports it on launch day?

Retr0id•44m ago
I wonder if there's any possibility that an RDMA expansion device could exist in the future - i.e. a box full of RAM on the other end of a thunderbolt cable. Although I guess such a device would cost almost as much as a mac mini in any case...
xmddmx•32m ago
I was impressed by the lack of dominance of Thunderbolt:

"Next I tested llama.cpp running AI models over 2.5 gigabit Ethernet versus Thunderbolt 5"

Results from that graph showed only a ~10% benefit from TB5 vs. Ethernet.

Note: The M3 studios support 10Gbps ethernet, but that wasn't tested. Instead it was tested using 2.5Gbps ethernet.

If 2.5G ethernet was only 10% slower than TB, how would 10G Ethernet have fared?

Also, TB5 has to be wired so that every CPU is connected to every other over TB, limiting you to 4 macs.

By comparison, with Ethernet, you could use a hub & spoke configuration with a Ethernet switch, theoretically letting you use more than 4 CPUs.

polynomial•21m ago
BUILD AI has a post about this and in particular sharding k-v cache across GPUs, and how network is the new memory hierarchy:

https://buildai.substack.com/p/kv-cache-sharding-and-distrib...

gloyoyo•12m ago
Wow. $40k for a friendly chat(bot)...

Hey, at least this post allows us to feel as though we spent the money ourselves.

Bravo!