frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

What Football Will Look Like in the Future

https://www.sbnation.com/a/17776-football/chapter-1
1•danielwmayer•1m ago•0 comments

A simple MCP attack leaks entire SQL database

https://twitter.com/gen_analysis/status/1937590879713394897
5•rezhv•1m ago•0 comments

Listopia

https://listopia-dhv.pages.dev/
1•spaquet•2m ago•0 comments

Raymond Laflamme (1960-2025)

https://scottaaronson.blog/?p=8949
2•mathgenius•2m ago•0 comments

How Muppets Break Free from their Puppeteers [video]

https://www.youtube.com/watch?v=t86ZjhGxwAY
1•dxs•5m ago•0 comments

Janka Hardness Test

https://en.wikipedia.org/wiki/Janka_hardness_test
1•mathiasandre•5m ago•0 comments

A Lisp adventure on the calm waters of the dead C

https://mihaiolteanu.me/language-abstractions
1•Bogdanp•8m ago•0 comments

Why Binary Won and Nothing Else Even Got Close

https://b0a04gl.site/blog/why-binary-won-and-nothing-else-even-got-close
1•b0a04gl•9m ago•0 comments

Small business AI lagging, one channels Sherlock Holmes knocking out grunt work

https://www.cnbc.com/2025/06/24/small-business-ai-use.html
1•Bluestein•10m ago•0 comments

Ask HN: Alternatives to Headspace Sleepcasts?

1•dot1x•10m ago•0 comments

Why Factories Are Having Trouble Filling 400k Open Jobs

https://www.nytimes.com/2025/06/23/business/factory-jobs-workers-trump.html
2•geox•15m ago•1 comments

Logo Mashups (2020)

https://www.olivierbruel.com/2020/07/logomashups/
2•gaws•16m ago•0 comments

Trust takes time (2024)

https://ruben.verborgh.org/blog/2024/10/15/trust-takes-time/
2•smj-edison•16m ago•0 comments

Divorce Kings of the Caribbean

https://story-bureau.com/divorce-caribbean-style/
2•bryanrasmussen•17m ago•0 comments

Bezos 'forced to move Venice wedding party'

https://www.telegraph.co.uk/world-news/2025/06/23/protest-venice-huge-banner-jeff-bezos-wedding-tourism-tax/
2•Bluestein•19m ago•0 comments

The same solidjs codebase to build our AI coding tool for many platforms

https://www.usejolt.ai/blog/a-solid-approach-to-building-client-apps
4•carloskelly13•20m ago•0 comments

Making electronic dance music in 1990 with budget home computer [video]

https://www.youtube.com/watch?v=6OaBkvwx7Hw
2•tie-in•20m ago•0 comments

Korean students seek 'digital undertakers' amid US visa social media screening

https://www.koreaherald.com/article/10515737
2•djoldman•20m ago•0 comments

Structural and functional characterization of human sweet taste receptor

https://www.nature.com/articles/s41586-025-09302-6
2•Bluestein•21m ago•0 comments

The Résumé is dying, and AI is holding the smoking gun

https://arstechnica.com/ai/2025/06/the-resume-is-dying-and-ai-is-holding-the-smoking-gun/
3•pseudolus•23m ago•0 comments

A War Thunder Player Leaked Classified Military Info (Again)

https://www.gamespot.com/articles/a-war-thunder-player-leaked-classified-military-info-again/1100-6532669/
3•speckx•25m ago•0 comments

Scientists breed mushrooms to build versatile substitutes for comm materials

https://phys.org/news/2025-06-nature-toolkit-scientists-mushrooms-versatile.html
3•PaulHoule•25m ago•0 comments

Companies Are Suing Honest Reviewers and It's Going to Get Ugly [video]

https://www.youtube.com/watch?v=RNonfByE9xc
3•LorenDB•25m ago•0 comments

Early US Intel assessment suggests strikes on Iran did not destroy nuclear sites

https://www.cnn.com/2025/06/24/politics/intel-assessment-us-strikes-iran-nuclear-sites
8•jbegley•26m ago•1 comments

Show HN: Rotta-Rs, Deep Learning Framework in Rust Release 0.0.3

https://github.com/araxnoid-code/ROTTA-rs
4•araxnoid•26m ago•0 comments

Practical tips to optimize documentation for LLMs, AI agents, and chatbots

https://biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide#12-the-human-element-ai-as-a-tool-not-an-end
2•dgarcia360•27m ago•0 comments

Forbidden secrets of ancient X11 scaling technology revealed

https://flak.tedunangst.com/post/forbidden-secrets-of-ancient-X11-scaling-technology-revealed
28•todsacerdoti•32m ago•5 comments

Subsecond: A runtime hotpatching engine for Rust hot-reloading

https://docs.rs/subsecond/0.7.0-alpha.1/subsecond/index.html
4•varbhat•33m ago•0 comments

Watermarking Autoregressive Image Generation

https://arxiv.org/abs/2506.16349
2•fzliu•34m ago•0 comments

Bridging Cinematic Principles and Generative AI for Automated Film Generation

https://arxiv.org/abs/2506.18899
5•jag729•35m ago•2 comments
Open in hackernews

Basic Facts about GPUs

https://damek.github.io/random/basic-facts-about-gpus/
175•ibobev•7h ago

Comments

kittikitti•5h ago
This is a really good introduction and I appreciate it. When I was building my AI PC, the deep dive research into GPU's took a few days but this lays it out in front of me. It's especially great because it touches on high-value applications like generative artificial intelligence. A notable diagram from the page that I wasn't able to find represented well elsewhere was the memory hierarchy of the A100 GPU's. The diagrams were very helpful. Thank you for this!
b0a04gl•5h ago
been running llama.cpp and vllm on same 4070, trying to batch more prompts for serving. llama.cpp was lagging bad once I hit batch 8 or so, even though GPU usage looked fine. vllm handled it way better.

later found vllm uses paged kv cache with layout that matches how the GPU wants to read fully coalesced without strided jumps. llama.cpp was using a flat layout that’s fine for single prompt but breaks L2 access patterns when batching.

reshaped kv tensors in llama.cpp to interleave ; made it [head, seq, dim] instead of [seq, head, dim], closer to how vllm feeds data into fused attention kernel. 2x speedup right there w.r.t same ops.

GPU was never the bottleneck. it was memory layout not aligning with SM’s expected access stride. vllm just defaults to layouts that make better use of shared memory and reduce global reads. that’s the real reason it scales better per batch.

this took its own time of say 2+days and had to dig under the nice looking GPU graphs to find real bottlenecks, it was widly trial and error tbf,

> anybody got idea on how to do this kinda experiment in hot reload mode without so much hassle??

jcelerier•3h ago
did you do a PR to integrate these changes back into llama.cpp ? 2x speedup would be absolutely wild
zargon•3h ago
Almost nobody using llama.cpp does batch inference. I wouldn’t be surprised if the change is somewhat involved to integrate with all of llama.cpp’s other features. Combined with lack of interest and keeping up with code churn, that would probably make it difficult to get included, with the number of PRs the maintainers are flooded with.
tough•3h ago
if you open a PR, even if it doesnt get merged, anyone with the same issue can find it, and use your PR/branch/fix if it suits better their needs than master
zargon•1h ago
Yeah good point. I have applied such PRs myself in the past. Eventually the code churn can sometimes make it too much of a pain to maintain them, but they’re useful for a while.
buildxyz•2h ago
Any speed up that is 2x is definitely worth fixing. Especially since someone has already figured out the issue and performance testing [1] shows that llamacpp* is lagging behind vLLM by 2x. This is a positive for all running LLMs locally using llamacpp.

Even if llamacpp isnt used for batch inference now, this can allow those to finally run llamacpp for batching and on any hardware since vLLM supports only select hardware. Maybe finally we can stop all this gpu api software fragmentation and cuda moat as llamacpp benchmarks have shown Vulkan to be as or more performant than cuda or sycl.

[1] https://miro.medium.com/v2/resize:fit:1400/format:webp/1*lab...

menaerus•1h ago
So, what exactly is batch inference workload and how would someone running inference on local setup benefit from it? Or how would I even benefit from it if I had a single machine hosting multiple users simultaneously?

I believe batching is a concept only useful when during the training or fine tuning process.

zargon•54m ago
Batch inference is just running multiple inferences simultaneously. If you have simultaneous requests, you’ll get incredible performance gains, since a single inference doesn’t leverage any meaningful fraction of a GPU’s compute capability.

For local hosting, a more likely scenario where you could use batching is if you had a lot of different data you wanted to process (lots of documents or whatever). You could batch them in sets of x and have it complete in 1/x the time.

A less likely scenario is having enough users that you can make the first user wait a few seconds while you wait to see if a second user submits a request. If you do get a second request, then you can batch them and the second user will get their result back much faster than if they had had to wait for the first user’s request to complete first.

Most people doing local hosting on consumer hardware won’t have the extra VRAM for the KV cache for multiple simultaneous inferences though.

menaerus•43m ago
Wouldn't batching the multiple inference requests from multiple different users with multiple different contexts simultaneously impact the inference results for each of those users?
zozbot234•2h ago
It depends, if the optimization is too hardware-dependent it might hurt/regress performance on other platforms. One would have to find ways to generalize and auto-tune it based on known features of the local hardware architecture.
amelius•1h ago
Yes, easiest is to separate it into a set of options. Then have a bunch of Json/yaml files, one for each hw configuration. From there, the community can fiddle with the settings and share new settings if new hardware is released.
tough•3h ago
did you see yesterday nano-vllm [1] from a deepseek employee 1200LOC and faster than vanilla vllm?

1. https://github.com/GeeeekExplorer/nano-vllm

Gracana•2h ago
Is it faster for large models, or are the optimizations more noticeable with small models? Seeing that the benchmark uses a 0.6B model made me wonder about that.
leeoniya•3h ago
try https://github.com/ikawrakow/ik_llama.cpp
SoftTalker•5h ago
Contrasting colors. Use them!
jasonjmcghee•5h ago
If the author stops by- the links and the comments in the code blocks were the ones that I had to use extra effort to read.

It might be worth trying to increase the contrast a bit.

The content is really great though!

cubefox•4h ago
The website seems to use alpha transparency for text. A grave, contrast-reducing, sin.
xeonmc•2h ago
It’s just liquid-glass text and you’ll get used to it soon enough.
currency•4h ago
The author might be formatting for and editing in dark mode. I use edge://flags/#enable-force-dark and the links are readable.
Yizahi•2h ago
font-weight: 300;

I'm 99% sure that author had designed this website on an Apple Mac with so called "font smoothing" enabled, which makes all regular fonts artificially "semi-bold". So to make a normal looking font, Mac designers use this thinner font weight and then Apple helpfully makes it kinda "normal".

https://news.ycombinator.com/item?id=23553486

neuroelectron•21m ago
Jfc
elashri•4h ago
Good article summarizing good chunk of information that people should have some idea about. I just want to comment that the title is a little bit misleading because this is talking about the very choices that NVIDIA follows in developing their GPU archs which is not what always what others do.

For example, the arithmetic intensity break-even point (ridge-point) is very different once you leave the NVIDIA-land. If we take AMD Instinct MI300, it has up to 160 TFLOPS FP32 paired with ~6 TB/s of HBM3/3E bandwidth gives a ridge-point near 27 FLOPs/byte which is about double that of the A100’s 13 FLOPs/byte. The larger on-package HBM (128 – 256 GB) GPU memory also shifts the practical trade-offs between tiling depth and occupancy. Although this is very expensive and does not have CUDA (which can be good and bad at the same time).

apitman•4h ago
Unfortunately Nvidia GPUs are the only ones that matter until AMD starts taking their computer software seriously.
fooblaster•4h ago
They are. It's just not at the consumer hardware level.
have-a-break•2h ago
You could argue it's all the nice GPU debugging tools nVidia provides which makes GPU programming accessible.

There are so many potential bottlenecks (normally just memory access patterns, but without tools to verify you have to design and run manual experiments).

tucnak•2h ago
This misconception is repeated time and time again; software support of their datacenter-grade hardware is just as bad. I've had the displeasure of using MI50, MI100 (a lot), MI210 (very briefly.) All three are supposedly enterprise-grade computing hardware, and yet, it was a pathetic experience with a myriad of disconnected components which had to be patched, & married with a very specific kernel version to get ANY kind of LLM inference going.

Now, the last of it I bothered with was 9 months ago; enough is enough.

fooblaster•26m ago
this hardware is ancient history. mi250 and mi300 are much better supported
tucnak•2h ago
Unfortunately, GPU's are old news now. When it comes to perf/watt/dollar, TPU's are substantially ahead for both training and inference. There's a sparsity disadvantage with the trailing-edge TPU devices such as v4 but if you care about large-scale training of any sort, it's not even close. Additionally, Tenstorrent p300 devices are hitting the market soon enough, and there's lots of promising stuff is coming on Xilinx side of the AMD shop: the recent Versal chips allow for AI compute-in-network capabilities that puts NVIDIA Bluefield's supposed programmability to shame. NVIDIA likes to say Bluefield is like a next-generation SmartNIC, but compared to actually field-programmable Versal stuff, it's more like 100BASE-T cards from the 90s.

I think it's very naive to assume that GPU's will continue to dominate the AI landscape.

menaerus•18m ago
So, where does one buy a TPU?
eapriv•4h ago
Spoiler: it’s not about how GPUs work, it’s about how to use them for machine learning computations.
oivey•2h ago
It’s a pretty standard run down of CUDA. Nothing to do with ML other than using relu in an example and mentioning torch.
neuroelectron•21m ago
ASCII diagrams, really?