frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: AISlop, a CLI for catching AI generated code smells

https://github.com/scanaislop/aislop
47•Heavykenny•56m ago•35 comments

Tulip mania: when a single flower was worth more than a house (2025)

https://dutchreview.com/culture/tulip-mania-netherlands/
70•dotcoma•2h ago•62 comments

The UK Government's Low Value Purchase System Is a Waste of Time

https://shkspr.mobi/blog/2026/05/the-uk-governments-low-value-purchase-system-is-a-waste-of-time/
78•ColinWright•2h ago•40 comments

Please Use AI

https://shawnsmucker.substack.com/p/please-use-ai
86•garycomtois•43m ago•12 comments

Claude Opus 4.8

https://www.anthropic.com/news/claude-opus-4-8
1652•craigmart•21h ago•1286 comments

Bricks and Minifigs Stole a Man's $200k Lego Collection

https://mybricklog.com/blog/bricks-minifigs-corporate-stole-old-mans-200000-lego-collection
1152•philips•19h ago•508 comments

Local Git Remotes

https://cblgh.org/posts/local-git-remotes/
28•surprisetalk•1h ago•21 comments

High Density Living, 2000 Years Ago: Inside the Roman Apartment Building

https://commonedge.org/high-density-living-2000-years-ago-inside-the-roman-apartment-building/
25•surprisetalk•2h ago•5 comments

Is This Sustainable?

https://jamiehurst.co.uk/2026-05-24_ai-sustainable
61•ColinEberhardt•4h ago•46 comments

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/
102•NicoConstant•4h ago•51 comments

Cedana (YC S23) Is Hiring

https://www.ycombinator.com/companies/cedana/jobs/d1vYocG-forward-deployed-engineer-ai-hpc
1•neelm•2h ago

Claude Code – Everything You Can Configure That the Docs Don't Tell You

https://buildingbetter.tech/p/i-read-the-claude-code-source-code
243•ankitg12•12h ago•50 comments

Orchestrating AI code review at scale

https://blog.cloudflare.com/ai-code-review/
70•pramodbiligiri•3d ago•22 comments

An Obsessive Focus on UX: Pilot's Pressure-Regulating Kire-Na Highlighter

https://www.core77.com/posts/143832/An-Obsessive-Focus-on-UX-Pilots-Pressure-Regulating-Kire-Na-H...
27•surprisetalk•3d ago•5 comments

I made a million dollar product from my dorm room (2025)

https://nick.winans.io/blog/nice-nano/
492•mattrighetti•18h ago•74 comments

We should be more tired than the model

https://vickiboykis.com/2026/05/28/we-should-be-more-tired-than-the-model/
67•tosh•2h ago•69 comments

Let's compile Quake like it's 1997

https://fabiensanglard.net/compile_like_1997/
113•goranmoomin•11h ago•41 comments

Poll: How often do you check "newest"?

7•ColinWright•2h ago•2 comments

Volkswagen blocks Home Assistant by requiring client assertion

https://github.com/robinostlund/homeassistant-volkswagencarnet/issues/967
291•Kwastie•8h ago•145 comments

Even (very) noisy LLM evaluators are useful for improving AI agents

https://www.tensorzero.com/blog/even-very-noisy-llm-evaluators-are-useful-for-improving-ai-agents/
10•GabrielBianconi•2d ago•0 comments

HeidiSQL – Lightweight MariaDB, MySQL, SQL Server, PostgreSQL and SQLite Manager

https://github.com/HeidiSQL/HeidiSQL
76•peter_d_sherman•11h ago•26 comments

Italians and Dutch share the same gestural instinct for teaching

https://www.mpi.nl/news/italians-and-dutch-share-same-gestural-instinct-teaching
95•vi_sextus_vi•12h ago•41 comments

Ten Basic Clouds

https://www.noaa.gov/jetstream/clouds/ten-basic-clouds
167•nopg•4d ago•44 comments

Is AI causing a repeat of Front end's Lost Decade?

https://mastrojs.github.io/blog/2026-05-23-is-AI-causing-a-repeat-of-frontends-lost-decade/
137•xyzal•3h ago•140 comments

Wterm – Terminal Emulator for the Web

https://wterm.dev/
21•m3h•5h ago•2 comments

Nitpicking the shell history scene in 'Tron: Legacy'

https://www.chiark.greenend.org.uk/~sgtatham/quasiblog/tron-legacy/
289•speckx•19h ago•99 comments

Headway Therapy Patients Forced to Scan Their Faces to Keep Getting Care

https://www.404media.co/headway-therapy-facial-scan-biometric-data-identity-verification/
6•pavel_lishin•18m ago•0 comments

Cars collect a startling amount of data about you

https://www.bbc.com/future/article/20260513-your-car-is-spying-on-you-its-about-to-get-worse
442•1vuio0pswjnm7•11h ago•231 comments

Show HN: Continue? Y/N: A 60-second game about AI agent permission fatigue

https://llmgame.scalex.dev
355•Wirbelwind•1d ago•144 comments

Digital Identity Management in Norway Is a Catastrophe

https://www.uio.no/english/research/research-news/articles/2026/digital-id-management-is-a-catast...
50•giuliomagnifico•3h ago•23 comments
Open in hackernews

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/
100•NicoConstant•4h ago

Comments

ilaksh•3h ago
Could be amazing, but it's hard to judge if it will really work with say a 27 B model or larger. We can already get pretty good speed with a 2B model.
gaeld•3h ago
thanks! we explain how it scales to larger models in the last section the OP blog post
bcjdjsndon•20m ago
Shame you stopped short of actually benchmarking that scale though, eh?
mungoman2•3h ago
This looks very interesting. Possible to get those rates without exotic hardware.

But I have to say that the comparison is not really fair. Comparison is done with a 2 B model vs frontier models that are likely 100s of times larger. Also taalas with their 15000 tok/s inference are suspiciously missing from the comparison.

We need to see the comparison with this framework and useful models, which at present seems to mean ~30 B.

cyanydeez•3h ago
likely the small model makes whatever fuzzer they designed to poke the gpus much faster optimizations.

they seem to think it scales up because theyre shortening the stack.

kirtivr•3h ago
They got 1K tok/s with Deepseek v4 Pro. That's kinda cool..
gaeld•3h ago
Thanks. To be fair, this number is what we expect to get once we port DeepSeek V4 in our engine on the upcoming generation of GPUs!
gaeld•3h ago
Great points.

We strived to be fair as possible in the benchmark, but it's indeed not perfect. Taalas should have been added in the dedicated hardware section, even though they use 3-bit quantization when we are on FP16 (to be fair in both directions) and they burn the model directly on the card.

Our tech preview is about the speed (hence the small dense model, it was easier to implement).

The math checks out though to allow support for large frontier MoE models at similar speeds: - At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB). - DeepSeek V4 Flash has 13B in mixed FP4/FP8, so let's say ballpark around 3x bigger than 4GB - so in theory we could reach >1,000 tok/s on it with MI300X/H200 and up to 4k on next generation GPUs.

Check out the math at the end of our blog post:

https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...

Imustaskforhelp
LoganDark•3h ago
I feel the comparison to Groq is unfair. They're running much larger models (orders of magnitude) and still reaching competitive speeds.
gaeld•3h ago
Fair point - this tech preview is about the speed (hence the small dense model, it was easier to implement).

The math checks out though to allow support for large frontier MoE models at similar speeds.

At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB).

DeepSeek V4 Flash has 13B in mixed FP4/FP8.

Check out the math at the end of our blog post: https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...

867-5309•3h ago
> Standard GPUs

> 8× NVIDIA H200

imputation•3h ago
Everyone beholden to a data center or subject to the installation on the corner of your property of course. Keep up with the times... /s
Oras•3h ago
as not custom chips like Grog and Cerebras. Did you expect a single GPU chip to reach 3k tps?
embedding-shape•3h ago
I think many would assume "not enterprise" or "not datacenter grade" when someone says "Standard GPUs", but maybe that specific phrase have a specific meaning I'm not familiar with.

Edit: I just tried a 4B model on a RTX Pro 6000, getting ~500 tok/s with llama.cpp not even trying to optimize or change anything, just default settings. I'm sure with vLLM it'd be a lot faster already, still before manually tuning configs. I wouldn't call that card "Standard GPU" either FWIW, but it makes the claimed performance numbers feel not as exciting, especially given the hardware they were using.

ismailmaj•3h ago
I expected a 4090, maybe 2. I did not expect 8xH200 for a 2B model.
gaeld•2h ago
Great points, let me clarify:

- model size: 2B is just for this preview (it was faster to implement), our article explains how we expect to support large frontier MoE at 1,000 to 5,000 tokens/s

- reaching 500 tok/s, or even up to ~1,000 tok/s, on a consumer GPU card is possible with existing inference engines like vLLM. But there is a ceiling.

The hard part comes we you try to be faster than that: these frameworks won't scale higher just by adding GPUs or using faster GPUs. There is a "glass ceiling" due to microseconds lost everywhere in the stack (grid syncs, inter-GPU comms, kernel launches, CPU sampling, etc.).

All our work at Kog is about removing these bottlenecks.

kirtivr•3h ago
I can think of real time video, shader generation, real time worldbuilding type problems could require such a high token throughput.

For instant code generatio, 400-500 tok/s should be sufficient, though most frontier models give us closer to 70 tok/s.

Gomotono•3h ago
That sounds a little bit like the 64kb memory is enough, then someone invented electron ;P

But joke aside, I think we don't even know yet what is possible if you hit very fast very high token / second numbers if your whole ecosystem behind it can handle it.

You could literaly implement the same solution 100x and benchmark all of them and get only the best result.

You could build and architecture a whole stack in parallel.

You could do massive thinking token / chain of thought.

You could let the LLM analyse everything around you while you type. Like it could tell you that this might create a bug in a different file and why.

We could start doing some type of monte-carlo search with this.

0-bad-sectors•3h ago
When I read "Standard GPUs" in the title I got excited for a second then I read the article itself..
Oras•3h ago
what did you have in mind when you read "Standard GPUs"?
gaeld•3h ago
I guessed you thought about consumer GPUs. We are about standard datacenter GPUs indeed.
deflator•23m ago
What a lot of use on here are salivating for is the ability to run these on prosumer hardware at home. So we tend to jump to the conclusion that "standard" means "consumer-grade" because that's what we want to see. Still, very cool work!
yjftsjthsd-h•1h ago
The GPU in my desktop. (A normal-ish decent gaming machine that runs LLMs and txt2img well enough.)

In contrast, not enterprise GPUs that cost as much as a car.

bcjdjsndon•20m ago
You know, Radeon 9800 pro ago
roosgit•
irishcoffee•3h ago
NVIDIA H200 Is not a standard GPU. 8 of them in a box with a cpu and ram costs close to the same as a house.

I am 100% all about using local models instead of sending someone else all my data and paying for the privilege of doing so, this article is misleading.

I can get a 27b model to kick out 40 tok/s on 16 gb vram. This is the area ripe for development.

If you can’t connect a monitor, it isn’t a standard GPU, at least not in the way people have spoken about GPUs until a few years ago.

gaeld•3h ago
I guessed you thought about consumer GPUs. We are about standard datacenter GPUs indeed.

Sorry for the confusion

embedding-shape•3h ago
Do you think maybe changing your articles title from "Real-time LLM Inference on Standard GPUs" to "Real-time LLM Inference on Standard Datacenter GPUs" might make sense here? Given more people seem confused by the title than not, and you could clear this up relatively easily, at least on your website although might be late to fix the HN title.
gaeld•2h ago
YES - I just updated the title of our article according to your suggestion.
irishcoffee•2h ago
Oh, it isn't confusing, it is misleading. A standard GPU lets you connect a monitor. A datacenter GPU lets you do headless math.
gaeld•3h ago
Follow-up reading the most technical and research people here:

Monokernel deep dive (GPU Engineering): http://blog.kog.ai/building-a-single-kernel-latency-optimize...

Delayed Tensor Parallelism (research): http://blog.kog.ai/delayed-tensor-parallelism-for-faster-tra...

To try the speed on the playground: http://playground.kog.ai

CastFX•2h ago
Looks super promising! A couple of questions:

For new open weights models, will you need to adapt model code and optimization for your inference engine by hand?

It's true that BS=1 is king when it comes to agentic workflows, however these kinds of system serve multiple requests concurrently with dynamic batching. Do you think it will scale as well ?

Any plans to release it open source?

Congratz again for the release

gaeld•1h ago
Thanks a lot! Much appreciated.

To answer your questions:

- yes, we rewrite the whole model code (while keeping the same logic) in CUDA/HIP and assembly, in order to optimize by hand for each GPU type. It's quite tedious for sure, but I guess this is the price to pay to get this kind of results.

- the batching question is a great one. In agentic systems, there is probably a trade-off between sequential thinking/iterations vs parallel exploration of multiple solutions. Also, there could just be multiple independent tasks running in parallel, depending on the use case.

We plan to support a small amount of batching, but it quickly becomes a trade-off vs speed. Pick one for your use case, I guess.

Also to consider: because we answer requests much faster, we are also able to process lots of them without needing high batches - and scaling on multiple nodes is possible.

- open sourcing: maybe, maybe not. I'm still undecided on this. We are a small startup and I'm told that giving our IP away might be shooting ourselves in the feet. On the other side, I think it could be of great benefit to the community and for us... we'll see

robmccoll•2h ago
Making these claims on a 2B parameter model seems a bit like seeing linear scalability from 1 to 4 cores and then assuming 256 cores will give you a 256x speedup. Or demonstrating massive improvement on datasets that fit in cache and then assuming the same improvements will be present on problem sizes that span the memory of multiple machines. Something tells me that scaling to larger models will be more difficult than assumed.
gaeld•2h ago
Yeah, I agree: I'm actually not expecting it to be easy, and there will certainly be several unknown unknowns we'll discover along the way.

Our process has been, and will continue to be, a sequence of (tedious) R&D experiments where the GPU never behaves as expected when pushed to its limits in ways no-one really tested before (I still have nightmares of the L3 cache cross-IOD bottlenecks on MI300X).

IMHO, we did solve the multi-GPU memory bandwidth scaling problem, and thus the linear scaling of the size of the model towards infinity. But the main difficulties will come from keeping the speed, with steady and continuous memory streaming, while implementing the much more complex architecture of modern frontier MoEs (attention compression tricks, hash layers, routing logic, etc.)

bartkappenburg•2h ago
Is this the new gateway to a "Model On a Chip"? Is it possible to etch the weights on silicon and get a very efficient way to use a LLM?
ekianjo•1h ago
Title is pure bait. Where is Datacenter GPU gone?
Hfuffzehn•1h ago
That's really nice of them.

That means Jensen can add another 30 times faster when comparing Rubin to Blackwell without having to actually do anything.

Hopefully that means he won't have any problem to make another 150 billion in profit in the next year.

Sorry for the sarcasm. Looks like interesting work.

frankensteins•46m ago
I have a naive question here - first, the token speed is very impressive. but why this is the highlight? I would prefer the actual performance.
bcjdjsndon•25m ago
H200 isn't a standard GPU at all
paul-rohan•17m ago
I had to test it myself to believe this unreal inference speed.

each time getting 3300+ tps.

cataflam•3m ago
Congrats gaeld and team

The demo is very impressive!

disclaimer: I've known the founder for a while, as legitimate as it gets in deep tech, real years of research and engineering behind this, not vaporware

•
1h ago
Your playground/write-up is very interesting and I would be really interested when you can have something like Deepseek V4 Flash model (49B) running as you are suggesting.

I haven't read the article at the moment and I will try to read them hopefully but I wish to ask a question regarding, can this approach be done for say trillion or large parameter models as well or is there some wall which gets hit that makes it valuable for only smaller parameter model.

That being said, its still really incredible because in future, because these small models are really getting good for many use cases and speed becomes their bottleneck, with greater speeds at consumer hardware, I think its gonna be amazing work!

gaeld•56m ago
Thanks for the comment and the question!

The last section of the article lays out the scaling laws that apply when porting this approach to another model. In a nutshell, DeepSeek V4 Pro with 49B active params is close to the upper bound.

Also worth noting that our results are currently for standard datacenter GPUs. On consumer hardware, though the same low-level optimization approach applies, the bandwidth limitations will cap the achievable speed.

hirako2000•2h ago
Fallacies look interesting ? Like if we aren't getting dubious claims every day ?
bcjdjsndon•21m ago
That doesn't clarify anything lol. It's a bit click baity.
WithinReason•2h ago
so what would be the above-standard GPUs then that they are excluding? Cerebras is not GPU
bcjdjsndon•24m ago
> Did you expect a single GPU chip to reach 3k tps?

Did the article headline not say Standard GPU?

3h ago
Yeah, it should have been "Datacenter GPUs" or "Nvidia and AMD GPUs".
gaeld
•
2h ago
I updated the article title accordingly
bcjdjsndon•18m ago
Standard != Datacentre