frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Launch HN: IonRouter (YC W26) – High-throughput, low-cost inference

https://ionrouter.io
24•vshah1016•2h ago
Hey HN — I’m Veer and my cofounder is Suryaa. We're building Cumulus Labs (YC W26), and we're releasing our latest product IonRouter (https://ionrouter.io/), an inference API for open-source and fine tuned models. You swap in our base URL, keep your existing OpenAI client code, and get access to any model (open source or finetuned to you) running on our own inference engine.

The problem we kept running into: every inference provider is either fast-but-expensive (Together, Fireworks — you pay for always-on GPUs) or cheap-but-DIY (Modal, RunPod — you configure vLLM yourself and deal with slow cold starts). Neither felt right for teams that just want to ship.

Suryaa spent years building GPU orchestration infrastructure at TensorDock and production systems at Palantir. I led ML infrastructure and Linux kernel development for Space Force and NASA contracts where the stack had to actually work under pressure. When we started building AI products ourselves, we kept hitting the same wall: GPU infrastructure was either too expensive or too much work.

So we built IonAttention — a C++ inference runtime designed specifically around the GH200's memory architecture. Most inference stacks treat GH200 as a compatibility target (make sure vLLM runs, use CPU memory as overflow). We took a different approach and built around what makes the hardware actually interesting: a 900 GB/s coherent CPU-GPU link, 452GB of LPDDR5X sitting right next to the accelerator, and 72 ARM cores you can actually use.

Three things came out of that that we think are novel: (1) using hardware cache coherence to make CUDA graphs behave as if they have dynamic parameters at zero per-step cost — something that only works on GH200-class hardware; (2) eager KV block writeback driven by immutability rather than memory pressure, which drops eviction stalls from 10ms+ to under 0.25ms; (3) phantom-tile attention scheduling at small batch sizes that cuts attention time by over 60% in the worst-affected regimes. We wrote up the details at cumulus.blog/ionattention.

On multimodal pipelines we get better performance than big players (588 tok/s vs. Together AI's 298 on the same VLM workload). We're honest that p50 latency is currently worse (~1.46s vs. 0.74s) — that's the tradeoff we're actively working on.

Pricing is per token, no idle costs: GPT-OSS-120B is $0.02 in / $0.095 out, Qwen3.5-122B is $0.20 in / $1.60 out. Full model list and pricing at https://ionrouter.io.

You can try the playground at https://ionrouter.io/playground right now, no signup required, or drop your API key in and swap the base URL — it's one line. We built this so teams can see the power of our engine and eventually come to us for their finetuned model needs using the same solution.

We're curious what you think, especially if you're running finetuned or custom models — that's the use case we've invested the most in. What's broken, what would make this actually useful for you?

Comments

GodelNumbering•1h ago
As an inference hungry human, I am obviously hooked. Quick feedback:

1. The models/pricing page should be linked from the top perhaps as that is the most interesting part to most users. You have mentioned some impressive numbers (e.g. GLM5~220 tok/s $1.20 in · $3.50 out) but those are way down in the page and many would miss it

2. When looking for inference, I always look at 3 things: which models are supported, at which quantization and what is the cached input pricing (this is way more important than headline pricing for agentic loops). You have the info about the first on the site but not 2 and 3. Would definitely like to know!

2uryaa•7m ago
Thank you for the feedback! I think we will definitely redo the info on the frontpage to reorg and show quantizations better. For reference, Kimi, GLM, Minimax are NVFP4. The rest are FP8. But I will make this more obvious on the site itself.
Oras•1h ago
The problem is well articulated and nice story for both cofounders.

One thing I don’t get is why would anyone use a direct service that does the same thing as others when there are services such as openrouter where you can use the same model from different providers? I would understand if your landing page mentioned fine-tuning only and custom models, but just listing same open source models, tps and pricing wouldn’t tell me how you’re different from other providers.

I remember using banana.dev a few years ago and it was very clear proposition that time (serverless GPU with fast cold start)

I suppose positioning will take multiple iterations before you land on the right one. Good luck!

2uryaa•8m ago
Hey Oras, thank you for the feedback! I think we definitely could list on OpenRouter but as you point out, our end goal is to host finetuned models for individuals. The IonRouter product is mostly to showcase our engine. In the backend, we are multiplexing finetuned and open source models on a homogenous fleet of GPUs. So if you feel no performance difference on our cloud, we're already proving what we set out to show.

I do think we will lean harder into the hosting of fine-tuned models though, this is a good insight.

reactordev•1h ago
“Pricing is per token, no idle costs: GPT-OSS-120B is $0.02 in / $0.095 out, Qwen3.5-122B is $0.20 in / $1.60 out. Full model list and pricing at https://ionrouter.io.”

Man you had me panicking there for a second. Per token?!? Turns out, it’s per million according to their site.

Cool concept. I used to run a Fortune 500’s cloud and GPU instances hot and ready were the biggest ask. We weren’t ready for that, cost wise, so we would only spin them up when absolutely necessary.

nylonstrung•1h ago
Unless I misunderstood it seems like this is trailing the pareto frontier in cost and speed.

Compare to providers like Fireworks and even with the openrouter 5% charge it's not competitive

erichocean•51m ago
> what would make this actually useful for you?

A privacy policy that's at least as good as Vertex.ai at Google.

Otherwise it's a non-starter at any price.

Oras•43m ago
What's unique about Vertex's privacy policy?
cmrdporcupine•46m ago
Very cool, I see that "Deploy your finetunes, custom LoRAs, or any open-source model on our fleet." is "Book a call" -- any sense of what pricing will actually look like here, since this seems like it's kind of where your approach wins out, the ability to swap in custom model easier/cheaper?

Just curious how close we are to a world where I can fine tune for my (low volume calls) domain and then get it hosted. Right now this is not practical anywhere I've seen, at the volumes I would be doing it at (which are really hobby level).

Malus – Clean Room as a Service

https://malus.sh
848•microflash•7h ago•330 comments

Bubble Sorted Amen Break

https://parametricavocado.itch.io/amen-sorting
180•eieio•3h ago•64 comments

Reversing memory loss via gut-brain communication

https://med.stanford.edu/news/all-news/2026/03/gut-brain-cognitive-decline.html
142•mustaphah•4h ago•37 comments

ATMs didn't kill bank teller jobs, but the iPhone did

https://davidoks.blog/p/why-the-atm-didnt-kill-bank-teller
230•colinprince•6h ago•280 comments

An old photo of a large BBS (2022)

https://rachelbythebay.com/w/2022/01/26/swcbbs/
108•xbryanx•1h ago•64 comments

The Met Releases High-Def 3D Scans of 140 Famous Art Objects

https://www.openculture.com/2026/03/the-met-releases-high-definition-3d-scans-of-140-famous-art-o...
152•coloneltcb•5h ago•31 comments

Runners Are Discovering It's Surprisingly Easy to Churn Butter on Their Runs

https://www.runnersworld.com/news/a70683169/how-to-make-butter-while-running/
31•randycupertino•1h ago•10 comments

Show HN: OneCLI – Vault for AI Agents in Rust

https://github.com/onecli/onecli
86•guyb3•4h ago•34 comments

Launch HN: IonRouter (YC W26) – High-throughput, low-cost inference

https://ionrouter.io
24•vshah1016•2h ago•9 comments

Show HN: Understudy – Teach a desktop agent by demonstrating a task once

https://github.com/understudy-ai/understudy
56•bayes-song•4h ago•15 comments

Bringing Chrome to ARM64 Linux Devices

https://blog.chromium.org/2026/03/bringing-chrome-to-arm64-linux-devices.html
10•ingve•54m ago•9 comments

AI error jails innocent grandmother for months in North Dakota fraud case

https://www.grandforksherald.com/news/north-dakota/ai-error-jails-innocent-grandmother-for-months...
8•rectang•12m ago•1 comments

WolfIP: Lightweight TCP/IP stack with no dynamic memory allocations

https://github.com/wolfssl/wolfip
61•789c789c789c•5h ago•6 comments

Converge (YC S23) Is Hiring a Founding Platform Engineer (NYC, Onsite)

https://www.runconverge.com/careers/founding-platform-engineer
1•thomashlvt•4h ago

Dolphin Progress Release 2603

https://dolphin-emu.org/blog/2026/03/12/dolphin-progress-report-release-2603/
267•BitPirate•11h ago•44 comments

Big data on the cheapest MacBook

https://duckdb.org/2026/03/11/big-data-on-the-cheapest-macbook
253•bcye•9h ago•233 comments

Show HN: Axe – A 12MB binary that replaces your AI framework

https://github.com/jrswab/axe
108•jrswab•7h ago•71 comments

US private credit defaults hit record 9.2% in 2025, Fitch says

https://www.marketscreener.com/news/us-private-credit-defaults-hit-record-9-2-in-2025-fitch-says-...
140•JumpCrisscross•8h ago•287 comments

Are LLM merge rates not getting better?

https://entropicthoughts.com/no-swe-bench-improvement
81•4diii•9h ago•86 comments

The Road Not Taken: A World Where IPv4 Evolved

https://owl.billpg.com/ipv4x/
31•billpg•5h ago•53 comments

Full Spectrum and Infrared Photography

https://timstr.website/blog/fullspectrumphotography.html
36•alter_igel•4d ago•12 comments

The Cost of Indirection in Rust

https://blog.sebastiansastre.co/posts/cost-of-indirection-in-rust/
63•sebastianconcpt•3d ago•30 comments

Show HN: Rudel – Claude Code Session Analytics

https://github.com/obsessiondb/rudel
118•keks0r•7h ago•72 comments

NASA's DART spacecraft changed an asteroid's orbit around the sun

https://www.sciencenews.org/article/spacecraft-changed-asteroid-orbit-nasa
83•pseudolus•3d ago•45 comments

Kotlin creator's new language: talk to LLMs in specs, not English

https://codespeak.dev/
255•souvlakee•6h ago•214 comments

Italian prosecutors seek trial for Amazon, 4 execs in alleged $1.4B tax evasion

https://www.reuters.com/world/italian-prosecutors-seek-trial-amazon-four-execs-over-alleged-14-bl...
208•amarcheschi•5h ago•54 comments

DDR4 Sdram – Initialization, Training and Calibration

https://www.systemverilog.io/design/ddr4-initialization-and-calibration/
39•todsacerdoti•2d ago•8 comments

The Emotional Labor Behind AI Intimacy (2025) [pdf]

https://data-workers.org/wp-content/uploads/2025/12/The-Emotional-Labor-Behind-AI-Intimacy-1.pdf
44•beepbooptheory•4h ago•14 comments

Claude now creates interactive charts, diagrams and visualizations

https://claude.com/blog/claude-builds-visuals
143•adocomplete•5h ago•87 comments

Apple's MacBook Neo makes repairs easier and cheaper than other MacBooks

https://arstechnica.com/gadgets/2026/03/more-modular-design-makes-macbook-neo-easier-to-fix-than-...
136•GeekyBear•4h ago•78 comments