frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

Show HN: Sipp – Run small local LLMs in browser 3x faster

https://www.sipp.sh
3•jjhartmann•1h ago
Hi HN! Sipp is an open-source AI inference library for running local models in browsers with up to 3x faster decode speeds than alternative libraries.

My background is in HCI (human-computer interaction) and graphics programming. Me along with my co-founder have been experimenting and thinking a lot about what the next user experience will look like when tokens are commodified to the point of being essentially “free.” A motivation for us was to try to move beyond the chat app and information retrieval use cases that are dominant now, and figure out how AI could instead act as a continuous and silent hand that helps the user indirectly, subtly monitoring their intent and dynamically generating or shifting the UI to meet their needs.

In our explorations, we ran into two pain points: 1) when running AI in the browser, performance wasn’t good enough for real-time applications, and model loading and caching were issues; 2) when trying to run locally on desktop, there weren’t any good solutions for embedding a model for use through platforms like Electron or Tauri.

So a few months ago, we started contributing to the llama.cpp project to explore browser inference. At the time, it didn’t have great WebGPU support. We helped get the backend working across most modern GPUs (16 bit fp is required) with fairly good ops coverage. Simultaneously, we started building up Sipp from the ground up using Rust and C++ to create a unified library around how local and cloud inference can be used together in real applications and use cases.

The result, Sipp achieves up to a 3x speedup in overall token decoding compared to alternative libs. But we believe there is more room for improvement here. Our goal is to investigate more bespoke inference pipelines per model architecture, which will allow us to further optimize for both compute and memory constraints more effectively.

Sipp uses a single, unified client API. A large concern for us was how to create a library that effectively bridges both local and cloud inference in a simple way. We wanted developers to start by running a small local model in the browser via WebGPU (or via other backends), and then scale the exact same code path to a self-hosted gateway (CUDA/Vulkan/Metal) or a cloud provider just by changing the endpoint. This enables the benefits of running locally for small tasks while letting you offload to a provider or cloud for tasks that require more intelligence.

Currently, our primary focus has been optimizing the browser experience, but we are actively working on creating additional backends for running an LLM locally via our client API, which works across Node, Rust, and Python right now.

What we are most excited about next is pushing our existing backends even further. As anyone working with a high-performance system knows, raw algorithms and efficient matmul are only half the battle. A significant portion of bottlenecks in real-time systems comes from inefficient memory management and VRAM<>RAM transfer costs.

We believe there is a massive opportunity to extract even more performance through aggressive kernel fusion. By creating “bespoke” kernels tailored to specific model architectures, we can drastically minimize intermediate memory copies. Our next goal is to see exactly how close we can push local inference to the theoretically “perfect” decode and prefill speeds for consumer hardware.

We have a live, simple chat demo running entirely on-device on our site, as well as a benchmarking tool if you want to test or verify the performance differences on your own hardware.

We’d love for you to tear into the code, run the benchmarks, and tell us where we can improve. I’ll be here all day to answer any questions about the architecture, kernel fusion, or HCI in the age of local LLMs.

More in depth tech info is here: https://dev.to/constant_chen_/sipp-a-local-first-runtime-for...

Comments

jjhartmann•1h ago
For testing / experimentation. Check out the

Benchmarking Tool: https://benchmark.sipp.sh/

and

Chat App: https://chat.sipp.sh/

Attaky – The ultimate modular ecosystem for everyone

https://attaky.com/
1•LorenDB•29s ago•0 comments

Drastically Reduce Stress with a Work Shutdown Ritual – Cal Newport

https://calnewport.com/drastically-reduce-stress-with-a-work-shutdown-ritual/
1•ankitg12•3m ago•0 comments

The AI Data Centre Legal Case That Could Eradicate Civil Rights

https://read.misalignedmag.com/the-ai-data-centre-legal-case-that-could-eradicate-civil-rights-c2...
1•lcubw•4m ago•0 comments

Why big AI labs are hiring so many philosophers

https://www.economist.com/science-and-technology/2026/06/24/why-big-ai-labs-are-hiring-so-many-ph...
2•Brajeshwar•4m ago•0 comments

What does your eval measure?

https://shash42.substack.com/p/what-does-your-benchmark-actually
1•shash42•4m ago•0 comments

Show HN: Tuip – CLI / TUI for checking SaaS vendors' statuses

https://github.com/ikan31/tuip
1•ahme•5m ago•0 comments

Loops Burn Tokens

https://www.wheresyoured.at/cargo-culture/
1•felixdoerp•7m ago•0 comments

Show HN: Gifhub, bug hunter that shows instead of tells

https://github.com/press-pass/gifhub
1•spmartin823•8m ago•0 comments

The Bargain. Or what America forgot and Europe still keeps

https://idle.news/blog/the-forgotten-bargain/
1•umilio•8m ago•0 comments

The Xteink X4 E-Ink Reader

https://blog.omgmog.net/post/xteink-x4-e-ink-reader/
1•felixdoerp•8m ago•0 comments

Sentrup – AI Customer Support Platform

1•sentrup•10m ago•0 comments

Exploiting vulnerabilities in Johnson and Johnson web apps

https://eaton-works.com/2026/06/24/jnj-webapp-hacks/
2•EatonZ•10m ago•0 comments

Show HN: Cutlistor – Instant cut list optimizer with 3D Model and PDF Import

https://www.cutlistor.com
1•xiyan•11m ago•0 comments

I crawled 827 employers' career sites to measure ATS market share

https://resumegeni.com/research/ats-market-share-2026
1•blakec•11m ago•0 comments

Germany's Kai Havertz: 'I make runs that look pointless but I'm creating space'

https://www.theguardian.com/football/2026/jun/24/kai-havertz-germany-world-cup-2026-interview
1•bookofjoe•12m ago•0 comments

Ask HN: How much coding should beginners learn in the AI era?

2•JohnDSDev•13m ago•0 comments

Show HN: Empowering codex/Claude Code with Aswath Damodaran valuation thinking

https://github.com/stockvaluation-io/stockvaluation_io
1•pradeep1177•13m ago•0 comments

Building a LoFi Radio

https://cieslak.dev/en/blog/2026-06-24-lofi/
1•cieslak•15m ago•1 comments

Show HN: Metaspec: The DpANS3R Common Lisp Spec in S-Expr and HTML Format

https://metaspec.dev/#
1•dlowe-net•16m ago•0 comments

Show HN: Browser based tool for programming ch57x macro-pads

https://pollrobots.com/cheese-tax.html
1•pacaro•17m ago•0 comments

Create cross-platform mobile apps with Ruby

https://ruflet.dev/
2•AdamMusaAly•18m ago•0 comments

Show HN: (Spotlight/Raycast for Web Search not local) && (compare AI responses)

https://uberninja.co/
1•healersource•19m ago•0 comments

How to Measure the ROI of FDE

https://jaygoel.com/posts/building-an-fde-motion/
2•memset•21m ago•0 comments

Show HN: LinkedIn Remote jobs by technology and country Map. Joint effort.

https://remote-trends.com/
1•Hanqaqa•21m ago•0 comments

Seoul: AWS and Google Cloud Kept Failing the Same Network Path?

https://webbynode.com/articles/seoul-vs-seoul
1•gsgreen•22m ago•0 comments

Human Dignity – On the Perils of Indifference

https://www.nubero.ch/blog/017/
2•ChrisArchitect•24m ago•0 comments

Claude Agents in Notion

https://www.notion.com/help/use-claude-agents-in-notion
1•alvis•24m ago•0 comments

Fable – Is it ever coming back?

https://www.youtube.com/watch?v=cOxC0t8DqYk
2•peter422•24m ago•0 comments

Retracted: Paper claiming immunochemotherapy more effective in morning

https://www.nature.com/articles/s41591-026-04508-1
4•connorboyle•25m ago•0 comments

Agentic Design Patterns

https://blog.danwald.me/agentic-design-patterns
1•danwald•25m ago•1 comments