frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma (2009)

https://rhodesmill.org/brandon/2009/commands-with-comma/
264•theblazehen•2d ago•88 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
27•AlexeyBrin•1h ago•5 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
708•klaussilveira•15h ago•208 comments

Reinforcement Learning from Human Feedback

https://arxiv.org/abs/2504.12501
11•onurkanbkrc•54m ago•1 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
973•xnx•21h ago•559 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
75•jesperordrup•6h ago•32 comments

Making geo joins faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
135•matheusalmeida•2d ago•35 comments

Ga68, a GNU Algol 68 Compiler

https://fosdem.org/2026/schedule/event/PEXRTN-ga68-intro/
14•matt_d•3d ago•3 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
68•videotopia•4d ago•7 comments

Welcome to the Room – A lesson in leadership by Satya Nadella

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
40•kaonwarb•3d ago•30 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
241•isitcontent•16h ago•26 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
45•helloplanets•4d ago•46 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
239•dmpetrov•16h ago•128 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
341•vecti•18h ago•152 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
506•todsacerdoti•23h ago•248 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
390•ostacke•22h ago•99 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
306•eljojo•18h ago•189 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
361•aktau•22h ago•186 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
430•lstoll•22h ago•284 comments

Cross-Region MSK Replication: K2K vs. MirrorMaker2

https://medium.com/lensesio/cross-region-msk-replication-a-comprehensive-performance-comparison-o...
3•andmarios•4d ago•1 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
25•bikenaga•3d ago•12 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
71•kmm•5d ago•10 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
96•quibono•4d ago•22 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
26•1vuio0pswjnm7•2h ago•18 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
271•i5heu•18h ago•221 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
35•romes•4d ago•3 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1080•cdrnsf•1d ago•464 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
309•surprisetalk•3d ago•45 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
65•gfortaine•13h ago•30 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
154•vmatsiiako•21h ago•73 comments
Open in hackernews

Benchmark Framework Desktop Mainboard and 4-node cluster

https://github.com/geerlingguy/ollama-benchmark/issues/21
203•geerlingguy•6mo ago
https://www.jeffgeerling.com/blog/2025/i-clustered-four-fram...

Comments

jeffbee•6mo ago
I had been hoping that these would be a bit faster than the 9950X because of the different memory architecture, but it appears that due to the lower power design point the AI Max+ 395 loses across the board, by large margins. So I guess these really are niche products for ML users only, and people with generic workloads that want more than the 9950X offers are shopping for a Threadripper.
dijit•6mo ago
Sounds about right.

I’m struggling to justify the cost of a Threadripper (let alone pro!) for a AAA game studio though.

I wonder who can justify these machines. High frequency trading? data science? shouldn’t that be done on servers?

jeffbee•6mo ago
Yeah I don't get it either. To get marginally more resources than the 9950X you have to make a significant leap in price to a $1500+ CPU on a $1000 motherboard.
wpm•6mo ago
No, you get it.

This is pure market segmentation. If you need that little bit extra, you’re forced to compromise, or to open your wallet big time, and AMD is betting on people who “really” need that slightly extra oomph to pay.

toast0•6mo ago
The premium is about getting more i/o. More memory channel, a lot more lanes. Maybe also a very high power limit.

If you need that in a single system, you gotta pay up. Lower tier SP6 processors are actually pretty reasonably priced, boards are still spendy though.

sliken•6mo ago
Doubly so when a $650 Epyc motherboard and a $650 Epyc CPU gets you 3x the threadipper's bandwidth and 1.5 the thread ripper pro's bandwidth.

Seems like threadripper is too low volume to be price competitive with Epyc and there's a relatively price insensitive workstation market out there.

kadoban•6mo ago
Threadripper very rarely seems to make any sense. The only times it seems like you want it are for huge memory support/bandwidth and/or a huge number of pcie slots. But it's not cheap or supported enough compared to epyc to really make sense to me any time I've been specing out a system along those lines.
StrangeDoctor•6mo ago
I bought a threadripper pro system out of desperation, trying to get secondhand PCIe 80G A100s to run locally. The huge rebar allocations confused/crashed every Intel/AMD system I had access to.

I think the Xeon systems should have worked and that it was actually a motherboard bios issue, but I had seen a photo of it running in a threadripper and prayed I wasn’t digging an even deeper hole.

kadoban•6mo ago
Yeah, that makes sense if you just have ~proof that some configuration works and want to just be done with it.
jeffbee•6mo ago
This is why a business like Puget Systems, or a line like HP Z Workstations, persist. You know in advance that your rig will work.
sliken•6mo ago
I've been tempted, but had a hard time finding a case where I needed more than the 9950x but less than a single socket epyc. Especially since the epyc motherboards are cheaper, CPUs are cheaper, and the Epycs have 3x the memory bandwidth of the thread ripper and 1.5x the memory bandwidth of the Threadripper pro.
rtkwe•6mo ago
It also seems like the tools aren't there to fully utilize them. Unless I misunderstood he was running off CPU only for all the test so there's still the iGPU and NPU performance that's not been utilized in these tests.
geerlingguy•6mo ago
No, only a couple initial tests with Ollama used CPU. I ran most tests on Vulkan / iGPU, and some on ROCm (read further down the thread).

I found it difficult to install ROCm on Fedora 42 but after upgrading to Rawhide it was easy, so I re-tested everything with ROCm vs Vulkan.

Ollama, for some silly reason, doesn't support Vulkan even though I've used a fork many times to get full GPU acceleration with it on Pi, Ampere, and even this AMD system... (moral of the story just stick with llama.cpp).

edwinjones•6mo ago
Sadly, the reason they give is subjectively terrible:

https://x.com/ollama/status/1952783981000446029

No experimental flag option, no "you can use the fork that works fine but we don't have capacity to support this" just a hard "no, we think it's unreliable". I guess they just want you to drop them and use llama.cpp.

geerlingguy•6mo ago
Yeah, my conspiracy theory is Nvidia is somehow influencing the decision. If you can do Vulkan with Ollama, it opens up people to using Intel/AMD/other iGPUs and you might not be incentivized to buy an Nvidia GPU.

ROCm support is not wonderful. It's certainly worse for an end user to deal with than Vulkan, which usually 'just works'.

edwinjones•6mo ago
I agree. AMD should just go all in on vulkan I think, The ROCm compatibility list is terrible compared to...every modern device and probably some ancient gpus that can be made to work with vulkan as well.

Considering they created mantle, you would think it would be the obvious move too.

MindSpunk•6mo ago
Vulkan is Mantle. Vulkan was developed out of the original Mantle API that AMD brought to Khronos. What do you mean "AMD should just go all in on Vulkan"? They've been "all in" on Vulkan from the beginning because they were one of the lead authors of the API.
dagmx•6mo ago
Vulkan is a derivative of mantle sure, but it is quite different than what Mantle was.

There was a period in between where AMD basically EOL’d mantle and Vulkan wasn’t even in the works yet.

edwinjones•6mo ago
I would say vulkan derives from mantle, mantle development stopped some time ago.
zozbot234•6mo ago
iGPUs (and NPUs) are not very useful for LLM inference, they only help somewhat in the prompt pre-processing phase. The CPU has worse bulk compute but far better access to system memory bandwidth, so it wins in token generation where that's the main factor.

My conspiracy theory is that it would help if contributors kept the Vulkan Compute proposed support up to date with new Ollama versions; no maintainer wants to deal with out-of-date pull req's.

jcastro•6mo ago
Hi Jeff, I'm a linux ambassador for Framework and I have one of these units. It'd be interesting if you would install ramalama in fedora and test that. I've been using that out of the box as a drop in replacement for ollama and everything was GPU accelerated out of the box. It pulls rocm from a container and just figures it out, etc. Would love to see actual numbers though.

Great work on this!

nicolaslem•6mo ago
I gave it a try on my Strix Halo laptop on Arch. ROCm and everything else worked out of the box with two commands:

    uv tool install ramalama
    ramalama serve <model>
sliken•6mo ago
Across the board, by a large margin? Phoronix ran 200 benchmarks on the 9950x vs 395x max and found a difference of less than 5%. Not bad considering the average power use was 91 watts vs 154 watts.

If you need the memory bandwidth the strix halo looks good, if you are cache friendly and don't care about using almost double the power than the 9950x is a better deal.

jeffbee•6mo ago
The way Phoronix weights an average score for a machine is ridiculous, because there aren't any users who do all of machine learning, fluid dynamics, video compression, database hosting, and software development, and games on the same machine. I looked at the applications that matter to me and the 9950X wins by 40% in those.
mhitza•6mo ago
I've ran a comparison benchmark for the smaller models https://gist.github.com/mhitza/f5a8eeb298feb239de10f9f60f841...

Comparing it against the RTX 4000 SFF Ada (20GB) which is around $1.2k (if you believe the original price on the nvidia website https://marketplace.nvidia.com/en-us/enterprise/laptops-work...). Which I have access to on a Hetzner GEX44.

I'm going to ballpark it between 2.5-3x faster than the desktop. Except for the tg128 test, where the difference is "minimal" (but I didn't do the math).

yencabulator•6mo ago
The whole point of these integrated memory designs is to go beyond that 20 GB VRAM.
throwdbaaway•6mo ago
Actually, you can combine them. When compared to Mac Studio, the main advantage of these Strix Halo boxes is that you still add a bunch of egpu over usb4/oculink, for better PP.
Tsiklon•6mo ago
I see Wendell of Level1Techs combines the two in his video on this system.

Theoretically you can have the best of both worlds if you don’t mind running an Occulink E-GPU enclosure

https://youtu.be/L-xgMQ-7lW0

reissbaker•6mo ago
Thanks for the excellent writeup. I'm pleasantly surprised that ROCm worked as well as it did — for the price these aren't bad for LLM workloads and some moderate gaming. (Apple is probably still the king of affordable at-home inference, but for games... Amazing these days but Linux is so much better.)
mulmen•6mo ago
I switched to Fedora Sway as my daily driver nearly two years ago. A Windows title wasn’t working on my brand new PC. I switched to Steam+Proton+Fedora and it worked immediately. Valve now offers a more stable and complete Windows API through Proton than Microsoft does through Windows itself.
xemdetia•6mo ago
I was about to be annoyed until you said you got preprod units. I guess I'll have to build on this when my desktop shows up.
iamtheworstdev•6mo ago
for those who are already in the field and doing these things - if I wanted to start running my own local LLM.. should I find an Nvidia 5080 GPU for my current desktop or is it worth trying one of these Framework AMD desktops?
wmf•6mo ago
If you think the future is small models (27B) get Nvidia; if you think larger models (70-120B) are worth it then you need AMD or Apple.
yencabulator•6mo ago
I wonder how much MoE will disrupt this. qwen3:30b-a3b is pretty good even on pure CPU, but a lot smarter than a 3B parameter model. If the CPU-GPU bottleneck isn't too tight, a large model might be able to sustainably cache the currently active experts in GPU RAM.
hengheng•6mo ago
The recent qwen3 models run fine on CPU + GPU, and so does gpt-oss. LM Studio and Ollama are turnkey solutions where the user has to know nothing about memory management. But finding benchmarks for these hybrid setups is astonishingly difficult.

I keep thinking that the bottleneck has to be CPU RAM, and for a large model the difference would be minor. For example with an 100 GByte model such as quantised gpt-oss-120B, I imagine that going from 10G to 24G would scale up my tk/s like 1/90 -> 1/76, so 20% advantage? But I can't find much on the high-level scaling math. People seem to either create calculators that oversimplify, or they seem too deep into the weeds.

I'd like a new anandtech please.

whizzter•6mo ago
Doesn't matter, people will always find ways to eat RAM despite finding more clever ways to do things.
yencabulator•6mo ago
MoE eats the same amount of RAM but accesses less of it.
sliken•6mo ago
More accurately, same amount of ram but accessed in a cache friendly manner with greater locality.
loudmax•6mo ago
The short answer is that the best value is a used RTX 3090 (the long answer being, naturally, it depends). Most of the time, the bottleneck for running LLMs on consumer grade equipment is memory and memory bandwidth. A 3090 has 24GB of VRAM, while a 5080 only has 16GB of VRAM. For models that can fit inside 16GB of VRAM, the 5080 will certainly be faster than the 3090, but the 3090 can run models that simply won't fit on a 5080. You can offload part of the model onto the CPU and system RAM, but running a model on a desktop CPU is an enormous drag, even when only partially offloaded.

Obviously an RTX 5090 with 32GB of VRAM is even better, but they cost around $2000, if you can find one.

What's interesting about this Strix Halo system is that it has 128GB of RAM that is accessible (or mostly accessible) to the CPU/GPU/APU. This means that you can run much larger models on this system than you possibly could on a 3090, or even a 5090. The performance tests tend to show that the Strix Halo's memory bandwidth is a significant bottleneck though. This system might be the most affordable way of running 100GB+ models, but it won't be fast.

cpburns2009•6mo ago
Just a point of clarification. I believe the 128GB Strix Halo can only allocate up to 96GB of RAM to the GPU.
geerlingguy•6mo ago
108 GB or so under Linux.

The BIOS allows pre-allocating 96 GB max, and I'm not sure if that's the maximum for Windows, but under Linux, you can use `amdttm.pages_limit` and `amdttm.page_pool_size` [1]

[1] https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...

amstan•6mo ago
I have been doing a couple of tests with pytorch allocations, it let me go as high as 120GB [1] (assuming the allocations were small enough) without crashing. The main limitation was mostly remaining system memory:

    htpc@htpc:~% free -h
                   total        used        free      shared  buff/cache   available
    Mem:           125Gi       123Gi       920Mi        66Mi       1.6Gi       1.4Gi
    Swap:           19Gi       4.0Ki        19Gi
[1] https://bpa.st/LZZQ
cpburns2009•6mo ago
Thanks for the correction. I was under the impression the GPU memory had to be preallocated in the BIOS, and 96 GB was the maximum number I read about.
sliken•6mo ago
Some older software stacks require static allocation in BIOS, but things are moving pretty quickly and allow dynamic allocation. Newer versions (or patches to) pytorch, ollama, and related, which I think might depend on a newer kernel (6.13 or so). Does seem like there's been quite a bit of progress in the last month.
lhl•6mo ago
In Linux, you can allocate as much as you want with `ttm`:

In 4K pages for example:

    options ttm pages_limit=31457280
    options ttm page_pool_size=15728640
This will allow up to 120GB to be allocated and pre-allocate 60GB (you could preallocate none or all depending on your needs and fragmentation size. I believe `amdgpu.vm_fragment_size=9` (2MiB) is optimal.
behohippy•6mo ago
Used 3090s have been getting expensive in some markets. Another option is dual 5060ti 16 gig. Mine are lower powered, single 8 pin power, so they max out around 180W. With that I'm getting 80t/s on the new qwen 3 30b a3b models, and around 21t/s on Gemma 27b with vision. Cheap and cheerful setup if you can find the cards at MSRP.
KronisLV•6mo ago
For comparison, at work we got a pair of Nvidia L4 GPUs: https://www.techpowerup.com/gpu-specs/l4.c4091

That gives us a total TDP of around 150W, 48 GB of VRAM and we can run Qwen 3 Coder 30B A3B at 4bit quantization with up to 32k context at around 60-70 t/s with Ollama. I also tried out vLLM, but the performance surprisingly wasn't much better (maybe under bigger concurrent load). Felt like sharing the data point, because of similarity.

Honestly it's a really good model, even good enough for some basic agentic use (e.g. with Aider, RooCode and so on), MoE seems the way to go for somewhat limited hardware setups.

Ofc obviously not recommending L4 cards cause they have a pretty steep price tag. Most consumer cards feel a bit power hungry and you'll probably need more than one to fit decent models in there, though also being able to game with the same hardware sounds pretty nice. But speaking of getting more VRAM, the Intel Arc Pro B60 can't come soon enough (if they don't insanely overprice it), especially the 48 GB variety: https://www.maxsun.com/products/intel-arc-pro-b60-dual-48g-t...

behohippy•5mo ago
Yeah 48g, sub 200W seems like a sweet spot for a single card setup. Then you can stack as deep as you want to get the size of model you want for whatever you want to pay for the power bill.
codazoda•5mo ago
I've hatched a plan to build a light-weight AI model on a $149 mini-pc and host it from my bedroom.

I wonder if I could follow that up by buying a 3090 (jumping the price by $1000 plus whatever I plug it into) and contrasting the difference. Could be an eye opening experiment for me.

Here's the write up of my plan for the cheap machine if anyone is interested.

https://joeldare.com/my_plan_to_build_an_ai_chat_bot_in_my_b...

Havoc•6mo ago
Jeff - check out the distributed-llama project...you should be able to distribute over entire cluster
burnte•6mo ago
He mentioned that in the video.
yjftsjthsd-h•6mo ago
https://github.com/b4rtaz/distributed-llama ?
geerlingguy•6mo ago
I've been testing Exo (seems dead), llama.cpp RPC (has a lot of performance limitations) and distributed-llama (faster but has some Vulkan quirks and only works with a few models).

See my AI cluster automation setup here: https://github.com/geerlingguy/beowulf-ai-cluster

I was building that through the course of making this video, because it's insane how much manual labor people put into building home AI clusters :D

jvanderbot•6mo ago
So, TL;DR?

I saw mixed results but comments suggest very good performance relative to other at-home setups. Can someone summarize?

geerlingguy•6mo ago
I put most of the top-line numbers and some graphs on my blog: https://www.jeffgeerling.com/blog/2025/i-clustered-four-fram...
jvanderbot•6mo ago
Great! As always fantastic writeup
mixmastamyk•6mo ago
Impressive yet disappointing due to software. Better use for now is for a compact render farm.
adolph•6mo ago
The Framework Desktop has at least two M.2 connectors for NVME. I wonder if an interconnect with higher performance than Ethernet or Thunderbolt could be established using one of the M.2 to connect to PCIe via Oculink?
nrp•6mo ago
There is also a PCIe x4 slot that you can use for other high throughput network options.
adolph•6mo ago
I missed that. Too bad it is under the power cables. I’d be hard to fit something in there using the stock case.
wpm•6mo ago
The stock case doesn’t have a PCIe slot cut out anyways
sliken•6mo ago
Sure, so $100 for an ITX case.
syntaxing•6mo ago
Kinda bummed, I get why he used Ollama but I feel like using llama cpp directly would provide better and more consistent results
mkl•6mo ago
As the article describes, most of this was done with llama.cpp, not Ollama.
syntaxing•6mo ago
Ahh good catch, I didn’t notice if you scroll lower, he has the llama cpp results. The ollama-benchmark repo name is a misnomer.
geerlingguy•6mo ago
I'm slowly migrating all my testing to https://github.com/geerlingguy/beowulf-ai-cluster
RossBencina•6mo ago
I heard that ik_llama.cpp performs better for CPU use: https://github.com/ikawrakow/ik_llama.cpp/
nektro•6mo ago
no compilation tests?
geerlingguy•6mo ago
Those are in my SBC-reviews repo: https://github.com/geerlingguy/sbc-reviews/issues/80
nektro•6mo ago
thanks!
_joel•6mo ago
> usually resulting in one word repeating ad infinitum

I've had that using gemini (via windsurf). Doesn't seem to happen with other models. No idea if there's any correlation but it's an interesting failure mode.

mattnewton•6mo ago
This is usually a symptom of greedy sampling (always picking the most probable token) on smaller models. It's possible that configuration had different sampling defaults, ie. was not using top p or temperature. I'm not familiar with distributed-llama but from searching the git repo it looks like it at least takes a --temperature flag and probably has one for top p.

I'd recommend rerunning the benchmarks with the sampling methods explicitly configured the same in each tool. It's tempting to benchmark with all the nondeterminism turned off, but I think it's less useful since in practice for any model you're self hosting for real work you're going to probably want top-p sampling or something like it and you want to benchmark the implementation of that too.

I've never seen gemini do this though, that'd be kinda wild if they shipped something that samples that way. I wonder if windsurf was sending a different config over the api or if this was a different bug.

mrbungie•6mo ago
Yep, sometimes Gemini for some reason ends up in what I call "ergodic self-flagellation".

Here are some examples: https://www.reddit.com/r/GeminiAI/comments/1lxqbxa/i_am_actu...

justinclift•6mo ago
I've seen that occasionally with one of the deepseek models when using the default Ollama context size of 4096, rather than whatever the model's preferred context size was.

After having that happen, I switched my stuff to check the model's preferred context size, then set the context size to match, before using any given model.

oblio•6mo ago
Those numbers are better than I was expecting.
hinkley•6mo ago
Jeff! Someone needs to make Framework MBs work in a blade arrangement, and you seem to be the likely person to get it done.
chickensong•6mo ago
Related: https://www.printables.com/model/1075458-framework-mainboard...
lifeinthevoid•6mo ago
Setup looks very sexy.
sliken•6mo ago
Apparently the frameworks desktop's 5g bit network isn't fast enough to scale well with LLM inference workloads, even for a modest GPU. Anyone know what kind of network is required to scale well for a single modest GPU?
geerlingguy•6mo ago
In the case of llama.cpp's RPC mode, the network isn't the limiting factor for inference, but for distributing layers to nodes.

I was monitoring the network while running various models, and for all models, the first step was to copy over layers (a few gigabytes to 100 or so GB for the huge models), and that would max out the 5 Gbps connection.

But then while warming up and processing, there were only 5-10 Mbps of traffic, so you could do it over a string and tin cans, almost.

But that's a limitation of the current RPC architecture, it can't really parallelize processing, so as I noted in the post and in my video, it kinda uses resources round-robin style, and you can only get worse performance across the entire cluster than on a single node for any model you can fit on the single node.

rtkwe•6mo ago
No network interconnect is going to scale well until you get into the expensive enterprise realm where infiniband and other direct connect copper/fiber reigns. The issue is less raw bandwidth but latency. Network is inherently 100x+ slower than memory access so when you start sharing a memory intensive workload like an LLM across a normal network it's going to crater your performance unless the work can be somewhat chunked to keep communication between nodes on the network to a minimum.
sliken•6mo ago
Really? Seems like scaling is pretty tolerant of latency, but very bandwidth intensive. Thus the move from IB to various flavors of ethernet (for AMD's GPUs, Tenstorrent, various others). Not to mention broadcom pushing various 50 and 100tbit ethernet switching chips for AI.

Even 25gbit these days is pretty affordable for home, if it scaled 5x better than 5gbit that might be enough to make larger models MUCH more practical.

rtkwe•6mo ago
It heavily depends on the workload, if one node needs to interact commonly with the memory on another node, like calculating the output of the weights stored on node the other node for the LLM, it's going to be dog slow because it has to wait 100x as long as it does for local. If you can batch the work into chunks that mostly get processed on one node then get passed to another then it can be parallelized easily.

eg if the individual layers of your model can fit on one node and the output can be pipelined so work can continue cascading through the various nodes it'd do well. But because the current word changes the next word a lot on LLMs you can't pipeline it. But you can see it in this [0] image from the attached blog post when he was testing llama.cpp, each node processes a batch of work and passes it off to the next node then goes idle.

[0] https://www.jeffgeerling.com/sites/default/files/images/fram...

jauntywundrkind•6mo ago
Good news: USB4 mandates a direct host-to-host connectivity! Something it brought in from Thunderbolt. Hypothetically that should be 40Gbit connections, readily available.

There's some folks who use this for clustering. Here's a reddit around Mac systems. The top link is to a really not great hub-spoke model usage (not everyone has BGP skills alas). I've linked to that, 19m35s in. https://www.reddit.com/r/MacStudio/comments/1mc1z0s/anyone_c... https://youtu.be/Ju0ndy2kwlw?t=19m35s

I do hope that CXL 3.1 with its host to host capability makes gluess scale out easier. It's hyped as being for accelerators and attached memory, but having a much lower overhead RDMA capable fabric in every PCIe+CXL port is very very alluring. Can't come soon enough! Servers at first and maybe I'm hopelessly naive here but I do sort of expect it to show up on consumer too.

bawana•6mo ago
Mem bandwidth sucks compared to Mac Studio ultra 3. And you cant add gpus easily although as an apu it is impressive and way better than nvidias gold box. Wendell said it better. Im waiting for the Mac Studio ultra 5
jauntywundrkind•6mo ago
Yes. And this is so crucial! It's still a huge leap forward for x86. Quad-channel (4x) DDR5-8000 is both double the (client/non-server) lane count, and at a blisteringly high clock rate. That's very impressive.

Upcoming Zen6 Epyc was just confirmed to go to 12-> 16 channel. That'll be very good to see. The Strix Halo successor Medusa Halo is supposed to be 6x channel. (Most of these rumors/leaks via Moore's Law is Dead fwiw). It's absolutely needed to scale to more cores. But still seems so short of what AI demands.

I really can't congratulate Apple enough for being deadly serious about memory bandwidth. What is just gobsmacking to me is that no one else has responded, half a decade latter. Put the ram on chip! DDR, not crazy expensive HBM. The practice of building super chips out of 4x chips, getting scalability that way, feels so obvious too, is so commendable!

Different end of the spectrum, but Intels tablet-size Lakefield had Package-on-Package (PoP) ram, and pretty fast for its day (4266MHz). But it didn't scale up the width, like Apple has.

Its hard to see x86 seem so stuck, be so unable to make what feels like such a necessary push to

jauntywundrkind•6mo ago
> For networking, I expected more out of the Thunderbolt / USB4 ports, but could only get 10 Gbps.

I really wish we saw more testing of USB subsystems! With PCIe being so limited, there's such allure to having two USB4 ports! But will they work?

IIRC we saw similar very low bandwidth on Apple's ARM chips too. This was during M1 or so; dunno if things got better with that chip or future ones! Presumably so or I feel like we'd be hearing about it, but also, these things can just go so hidden!

It was really cool back in Ryzen 1 era seeing their CPU get some USB on the cpu itself, not have to go through the IO/peripheral Hub (southbridge?), with its limited connection to the CPU. There's a great up breakout chart here, showing both the 1800x and the various chipsets available: relishable data. https://www.techpowerup.com/cpu-specs/ryzen-7-1800x.c1879

I feel like there's been some recent improvements to USB4/thunderbolt in the kernel, to really insure all lanes get used. But I'm struggling to find a reference/link. What kernel was this tested against? If nothing else, it's be great to poke around at debugfs, to make sure it's getting all the lanes configured. https://www.phoronix.com/news/Linux-6.13-USB-Changes

OrangeMusic•5mo ago
Can you imagine a Beowulf cluster of these?