Show HN: sllm – Split a GPU node with other developers, unlimited tokens

32•jrandolf•2h ago

Running DeepSeek V3 (685B) requires 8×H100 GPUs which is about $14k/month. Most developers only need 15-25 tok/s. sllm lets you join a cohort of developers sharing a dedicated node. You reserve a spot with your card, and nobody is charged until the cohort fills. Prices start at $5/mo for smaller models.

The LLMs are completely private (we don't log any traffic).

The API is OpenAI-compatible (we run vLLM), so you just swap the base URL. Currently offering a few models.

Comments

mmargenot•59m ago

This is a great idea! I saw a similar (inverse) idea the other day for pooling compute (https://github.com/michaelneale/mesh-llm). What are you doing for compute in the backend? Are you locked into a cohort from month to month?

vova_hn2•54m ago

1. Is the given tok/s estimate for the total node throughput, or is it what you can realistically expect to get? Or is it the worst case scenario throughput if everyone starts to use it simultaneously?

2. What if I try to hog all resources of a node by running some large data processing and making multiple queries in parallel? What if I try to resell the access by charging per token?

Edit: sorry if this comment sounds overly critical. I think that pooling money with other developers to collectively rent a server for LLM inference is a really cool idea. I also thought about it, but haven't found a satisfactory answer to my question number 2, so I decided that it is infeasible in practice.

jrandolf•43m ago

1. It's an average. 2. We have sophisticated rate limiter.

esafak•53m ago

Like vast.ai and TensorDock, and presumably others.

spuz•51m ago

It seems crazy to me that the "Join" button does not have a price on it and yet clicking it simply forwards you to a Stripe page again with no price information on it. How am I supposed to know how much I'm about to be charged?

jrandolf•46m ago

That was an error on our part lol. We'll update with the price.

peter_d_sherman•50m ago

What a brilliant idea!

Split a "it needs to run in a datacenter because its hardware requirements are so large" AI/LLM across multiple people who each want shared access to that particular model.

Sort of like the Real Estate equivalent of subletting, or splitting a larger space into smaller spaces and subletting each one...

Or, like the Web Host equivalent of splitting a single server into multiple virtual machines for shared hosting by multiple other parties, or what-have-you...

I could definitely see marketplaces similar to this, popping up in the future!

It seems like it should make AI cheaper for everyone... that is, "democratize AI"... in a "more/better/faster/cheaper" way than AI has been democratized to date...

Anyway, it's a brilliant idea!

Wishing you a lot of luck with this endeavor!

kaoD•46m ago

How is the time sharing handled? I assume if I submit a unit of work it will load to VRAM and then run (sharing time? how many work units can run in parallel?)

How large is a full context window in MiB and how long does it take to load the buffer? I.e. how many seconds should I expect my worst case wait time to take until I get my first token?

ninjha•32m ago

> how many work units can run in parallel

not original author but batching is one very important trick to make inference efficient, you can reasonably do tens to low hundreds in parallel (depending on model size and gpu size) with very little performance overhead

jrandolf•28m ago

vLLM handles GPU scheduling, not sllm. The model weights stay resident in VRAM permanently so there's no loading/unloading per request. vLLM uses continuous batching, so incoming requests are dynamically added to the running batch every decode step and the GPU is always working on multiple requests simultaneously. There is no "load to VRAM and run" per request; it's more like joining an already-running batch.

TTFT is under 2 seconds average. Worst case is 10-30s.

spuz•46m ago

Is this not a more restricted version of OpenRouter? With OpenRouter you pay for credits that can be used to run any commercial or open-source model and you only pay for what you use.

jrandolf•40m ago

OpenRouter is a little different. We are trying to experiment with maximizing a single GPU cluster.

singpolyma3•44m ago

25 t/s is barely usable. Maybe for a background runner

lelanthran•21m ago

> 25 t/s is barely usable. Maybe for a background runner

That's over a 1000 words/s if you were typing. If 1000 words/s is too slow for your use-case, then perhaps $5/m is just not for you.

I kinda like the idea of paying $5/m for unlimited usage at the specified speed.

It beats a 10x higher speed that hits daily restrictions in about 2 hours, and weekly restrictions in 3 days.

freedomben•40m ago

This is an excellent idea, but I worry about fairness during resource contention. I don't often need queries, but when I do it's often big and long. I wouldn't want to eat up the whole system when other users need it, but I also would want to have the cluster when I need it. How do you address a case like this?

jrandolf•19m ago

We implement rate-limiting and queuing to ensure fairness, but if there are a massive amount of people with huge and long queries, then there will be waits. The question is whether people will do this and more often than not users will be idle.

freedomben•7m ago

Is there any way to buy into a pool of people with similar usage patterns? Maybe I'm overthinking it, but just wondering

varunr89•30m ago

$40/mo for deepseek r1 seems steep compared to a pro sub on open ai /claude unless you run 24x7. im not sure how sharing is making this affirdable.

lelanthran•16m ago

> $40/mo for deepseek r1 seems steep compared to a pro sub on open ai /claude unless you run 24x7.

"Running 24x7" is what people want to do with openclaw.

Lalabadie•29m ago

This is the most "Prompted ourselves a Shadcn UI" page I've seen in a while lol

I dig the idea! I'm curious where the costs will land with actual use.

jrandolf•27m ago

Thanks lol. I actually like Shadcn's style. It's sad that people view it as AI now.

Show HN: Running local OpenClaw together with remote agents in an open network

Chat Control: The Technical and Legal Case Against Mass Scanning

Floating point from scratch: Hard Mode

Scientists capture how cells trigger inflammation

Ask HN: Best build in public/regular updates blogs?

Batteries-included terminal UI framework for Go

37,000 AI-generated podcasts on Kaggle

Aspire Docs in Your Terminal (and Your AI's Brain)

Bazaarly – A Thought Exercise

AI Agents to Organise My Secret Society's Dinners

Deafness reversed: One injection restores hearing in just weeks – ScienceDaily

Beyond the Verdict: Holding Big Tech Accountable Isn't as Simple as It Seems

Plague Ships

Mapping AI into Production: A Field Experiment on Firm Performance

Artemis II crew snaps portrait of Earth on their way to the moon

Across the social sciences, half of research doesn't replicate

Polymarket apologizes for allowing wagers on fate of U.S. pilots downed in Iran

Malaysia's age verification rules for social media could be strictest

IBM 3270 Information Display System: Color and Programmed Symbols (1979) [pdf]

Not all of this is new

Artificial Intelligence Will Die – and What Comes After

Astronomers Find a Third Galaxy Missing Its Dark Matter

Token Price Discovery in the AI Diffusion Debate

What does Open Source mean?

Laid Off from Oracle(OCI). Looking for Software Roles (USA)

Iran's Network of Cameras Bolsters Air Defenses, Expert Says

Detecting Defects in Software Systems

Ask HN: Regarding app rejection on 3.1.1 Guidelines

The AI-Native Fork

Sonos Play Review: Performance Meets Convenience