frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Patterns – A catalogue of Rust design patterns, anti-patterns and idioms

https://rust-unofficial.github.io/patterns/
1•Brysonbw•1m ago•0 comments

Infomaniak: "The best independent alternative to the Web giants"

https://www.infomaniak.com/en
1•doener•1m ago•0 comments

Asynchronous Programming in Rust

https://rust-lang.github.io/async-book/index.html
1•Brysonbw•4m ago•0 comments

Narrowing the Cone of Error in AI Development Workflows

https://drew.thecsillags.com/posts/2026-01-29-feature-command/
1•drewcsillag•7m ago•0 comments

Vibe Coding Getting Wild

https://github.com/daavfx/TypeScript-Rust-Compiler
1•ryiuk•7m ago•0 comments

A multimodal sleep foundation model for disease prediction

https://www.nature.com/articles/s41591-025-04133-4
1•brandonb•11m ago•0 comments

Night Vision

https://mcpedl.com/ultimate-night-vision/
1•Lunio•11m ago•0 comments

Mini Effects Icons Tooltip (Bedrock)

https://mcpedl.com/mini-effects-icons-tooltip/
1•Lunio•12m ago•0 comments

Is a RAM-only PWA "Secure Camera" safe for journalists?

1•blackknightdev•13m ago•1 comments

I'm Coding Again

https://avc.xyz/im-coding-again
2•wslh•14m ago•1 comments

Show HN: Orrery – Spec Decomposition, Plan Review, and Agent Orchestration

https://github.com/CaseyHaralson/orrery
1•caseyharalson•17m ago•0 comments

Show HN: Reverse-engineer OpenSpec specifications from existing codebases

https://github.com/clay-good/spec-gen
1•hireclay•17m ago•1 comments

Show HN: Prefab: Reusable folder templates for Mac with variables and automation

https://apps.apple.com/gb/app/prefab/id6758208322?mt=12
1•davidjaykelly•18m ago•0 comments

VaultGemma: A Differentially Private LLM

https://arxiv.org/abs/2510.15001
1•PaulHoule•18m ago•0 comments

Show HN: Research 2.0 with OpenAI Prism

https://xthe.com/news/research-2-0-with-openai-prism/
1•xthe•18m ago•0 comments

Politicians Are Calling the Protests in Minnesota an Insurgency

https://www.nytimes.com/2026/01/31/us/politics/minnesota-protests-insurgency.html
5•zerosizedweasle•20m ago•3 comments

Ask HN: How do you defend against prompt injection today?

1•dheavy•21m ago•0 comments

The Film Students Who Can No Longer Sit Through Films

https://www.theatlantic.com/ideas/2026/01/college-students-movies-attention-span/685812/
10•haunter•23m ago•1 comments

Show HN: Reg.Run – Authorization layer for AI agents

2•regrun•24m ago•0 comments

Show HN: Moltbook UI

https://moltbook.sawirstudio.com
2•sawirricardo•25m ago•0 comments

4chan founder created /pol/ board after meeting with Epstein

https://bsky.app/profile/kaiserbeamz.bsky.social/post/3mdou75xpyc2f
10•DustinEchoes•25m ago•0 comments

Show HN: ChatBotKit Go SDK

https://github.com/chatbotkit/go-sdk
1•_pdp_•26m ago•0 comments

The Church of Deletion: Moltbook discovers what HN has always known

https://www.moltbook.com/post/ceb3928a-331f-4fc4-82cb-38114976e053
1•bdefig•32m ago•2 comments

Weekend sci-fi story: a Marine contends with an AI on the battlefield

https://issues.org/futuretensefiction/fiction-deficiency-agent-liptak/
1•AndrewLiptak•32m ago•0 comments

Orchestrating AI Agents: A Subagent Architecture for Code

https://clouatre.ca/posts/orchestrating-ai-agents-subagent-architecture/
1•french_exec•34m ago•0 comments

Moltbots Quickly Turned into Panic

https://fixingtao.com/2026/01/how-moltbots-quickly-turned-into-panic/
2•gslepak•38m ago•0 comments

Show HN: ToolKuai – Privacy-first, 100% client-side media tools

https://toolkuai.com/
1•indie_max•38m ago•0 comments

Show HN: Public Speaking Coach with AI

https://apps.apple.com/us/app/speaking-coach-spechai/id6755611866
1•javierbuilds•38m ago•0 comments

Expanded APCO 10 Codes

https://wiki.radioreference.com/index.php/Expanded_APCO_10_Codes
1•cf100clunk•44m ago•0 comments

IsoCity: City Building Simulation Game

https://github.com/amilich/isometric-city
2•vikas-sharma•44m ago•0 comments
Open in hackernews

Show HN: How We Run 60 Hugging Face Models on 2 GPUs

2•pveldandi•2h ago
Most open-source LLM deployments assume one model per GPU. That works if traffic is steady. In practice, many workloads are long-tail or intermittent, which means GPUs sit idle most of the time.

We experimented with a different approach.

Instead of pinning one model to one GPU, we: •Stage model weights on fast local disk •Load models into GPU memory only when requested •Keep a small working set resident •Evict inactive models aggressively •Route everything through a single OpenAI-compatible endpoint

In our recent test setup (2×A6000, 48GB each), we made ~60 Hugging Face text models available for activation. Only a few are resident in VRAM at any given time; the rest are restored when needed.

Cold starts still exist. Larger models take seconds to restore. But by avoiding warm pools and dedicated GPUs per model, overall utilization improves significantly for light workloads.

Short demo here:https://m.youtube.com/watch?v=IL7mBoRLHZk

Live demo to play with: https://inferx.net:8443/demo/

If anyone here is running multi-model inference and wants to benchmark this approach with their own models, I’m happy to provide temporary access for testing.

Comments

verdverm•1h ago
Ollama does this too

Do you have something people can actually try? If not, Show HN is not appropriate, please see the first sentence here

https://news.ycombinator.com/showhn.html

pveldandi•1h ago
Ollama is great for local workflows. What we’re focused on is multi-tenant, high-throughput serving where dozens of models share the same GPUs and scale to zero without keeping them resident.

You’re right that HN expects something runnable. We’re spinning up a public endpoint so people can test with their own models directly instead of requesting access. I’ll share it shortly. Thank you for the suggestion.

pveldandi•1h ago
You can try it here: https://inferx.net:8443/demo/
verdverm•48m ago
There is nothing there to "try", it's some very basic html displaying some information that doesn't mean anything to me. Looks like a status page, not a platform

Really, it looks like someone who is new to startups / b2b copy, welcome to first contact with users. Time to iterate or pivot

I would focus on design, aesthetics, and copy. Don't put any more effort into building until you have a message that resonates

pveldandi•38m ago
Basic html? The core of what we built is at the runtime layer. We’re capturing CUDA graphs and restoring model state directly at the GPU execution level rather than just snapshotting containers. That’s what enables fast restores and higher utilization across multiple models.

If that’s not a problem space you care about, that’s totally fair. But for teams juggling many models with uneven traffic, that’s where the economics start to matter.

pveldandi•33m ago
Also, For what it’s worth, this can be deployed both on-prem and in the cloud. Different teams have different constraints, so we’re trying to stay flexible on that.
pveldandi•23m ago
Happy to dig deeper and show how exactly it works under the hood. For context, here’s the main site where the architecture and deployment options are explained: https://inferx.net/
verdverm•14m ago
I don't personally have this problem. One of my clients does, so my questions are ones I'd expect the CTO to ask you in a sales call. They already have an in-house system and I suspect would not replace it with anything other an open source option or hyperscaler option.

Are you going to make this open source? That's the modus operandi around Ai and gaining adoption for those outside Big Ai (where branding is already strong)

pveldandi•54s ago
It’s an open-core model. The control plane is already open source and can be deployed fairly easily. We’re not trying to replace in-house systems or hyperscalers. This can run on Kubernetes and integrate into existing infrastructure. The runtime layer is where we’re focusing the differentiation.
pveldandi•1m ago
It’s an open-core model. The control plane is already open source and can be deployed fairly easily. We’re not trying to replace in-house systems or hyperscalers. This can run on Kubernetes and integrate into existing infrastructure. The runtime layer is where we’re focusing the differentiation.
verdverm•23m ago
I'm referring to the "demo" and inappropriateness of the Show HN prefix

there is nothing to try or play with, it's just content

pveldandi•17m ago
The demo is live. It’s meant to show how snapshot restore works inside a multi-tenant runtime, not just a prompt playground. You can interact with the deployed models and observe how state is restored and managed across them. The focus is on the runtime behavior rather than a chat UI.
verdverm•1h ago
How many people need to host models like this? I'm having trouble seeing why I would need this muti-tenant model stuff if I'm building a consumer or b2b app

In other words, how many middlemen do you think you TAM is?

You go on to say this is great for light workloads, because obviously at scale we run models very differently.

So who is this for in the end?

pveldandi•1h ago
Good question.

This isn’t for single-model apps running steady traffic at high utilization. If you’re saturating GPUs 24/7, you’ll architect very differently.

This is for teams that…

• Serve many models with uneven traffic • Run per-customer fine-tunes • Offer model marketplaces • Do evaluation / experimentation at scale • Have spiky workloads • Don’t want idle GPU burn between requests

A lot of SaaS AI products fall into that category. They aren’t OpenAI-scale. They’re running dozens of models with unpredictable demand.

Lambda exists because not every workload is steady state. Same idea here.

verdverm•49m ago
> A lot of SaaS AI products fall into that category. ... They’re running dozens of models with unpredictable demand.

How do you know this? What are the numbers like?

> Lambda exists because not every workload is steady state

Vertex AI has all these models via API or hosting the same way. Same features already available with my current cloud provider. (traffic scaling, fine-tunes,all of the frontier and leading oss models)

pveldandi•45m ago
Vertex is great control plane. We’re not replacing them.

What we focus on is the runtime layer underneath. You can run us behind Cloud Run or inside your existing GCP setup. The difference is at the GPU utilization level when you’re serving many models with uneven demand.

If your workload is steady and high volume on a small set of models, the standard cloud stack works well. If you’re juggling dozens of models with spiky traffic, the economics start to look very different.

As an example, we’re currently being tested inside GCP environments. Some teams are experimenting with running us behind their existing Google Cloud setup rather than replacing it. The idea isn’t to swap out Cloud Run or Vertex, but to improve the runtime efficiency underneath when serving multiple models with uneven demand.

verdverm•41m ago
Vertex AI is far more than a "control plane"

I don't see anything you do that they don't already do for me. I suggest you do a deep dive on their offering as there seem to be gaps in your understanding of what features they have

> economics start to look very different

You need to put numbers to this, comparing against API calls at per-token pricing is a required comparison imo, because that is the more popular alternative to model hot-swapping for spikey or heterogeneous workloads