Show HN: sllm – Split a GPU node with other developers, unlimited tokens

59•jrandolf•4h ago

Running DeepSeek V3 (685B) requires 8×H100 GPUs which is about $14k/month. Most developers only need 15-25 tok/s. sllm lets you join a cohort of developers sharing a dedicated node. You reserve a spot with your card, and nobody is charged until the cohort fills. Prices start at $5/mo for smaller models.

The LLMs are completely private (we don't log any traffic).

The API is OpenAI-compatible (we run vLLM), so you just swap the base URL. Currently offering a few models.

Comments

mmargenot•2h ago

This is a great idea! I saw a similar (inverse) idea the other day for pooling compute (https://github.com/michaelneale/mesh-llm). What are you doing for compute in the backend? Are you locked into a cohort from month to month?

vova_hn2•2h ago

1. Is the given tok/s estimate for the total node throughput, or is it what you can realistically expect to get? Or is it the worst case scenario throughput if everyone starts to use it simultaneously?

2. What if I try to hog all resources of a node by running some large data processing and making multiple queries in parallel? What if I try to resell the access by charging per token?

Edit: sorry if this comment sounds overly critical. I think that pooling money with other developers to collectively rent a server for LLM inference is a really cool idea. I also thought about it, but haven't found a satisfactory answer to my question number 2, so I decided that it is infeasible in practice.

jrandolf•2h ago

1. It's an average. 2. We have sophisticated rate limiter.

poly2it•1h ago

Does it take user time zones into account?

jrandolf•1h ago

Yes

esafak•2h ago

Like vast.ai and TensorDock, and presumably others.

spuz•2h ago

It seems crazy to me that the "Join" button does not have a price on it and yet clicking it simply forwards you to a Stripe page again with no price information on it. How am I supposed to know how much I'm about to be charged?

jrandolf•2h ago

That was an error on our part lol. We'll update with the price.

peter_d_sherman•2h ago

What a brilliant idea!

Split a "it needs to run in a datacenter because its hardware requirements are so large" AI/LLM across multiple people who each want shared access to that particular model.

Sort of like the Real Estate equivalent of subletting, or splitting a larger space into smaller spaces and subletting each one...

Or, like the Web Host equivalent of splitting a single server into multiple virtual machines for shared hosting by multiple other parties, or what-have-you...

I could definitely see marketplaces similar to this, popping up in the future!

It seems like it should make AI cheaper for everyone... that is, "democratize AI"... in a "more/better/faster/cheaper" way than AI has been democratized to date...

Anyway, it's a brilliant idea!

Wishing you a lot of luck with this endeavor!

kaoD•2h ago

How is the time sharing handled? I assume if I submit a unit of work it will load to VRAM and then run (sharing time? how many work units can run in parallel?)

How large is a full context window in MiB and how long does it take to load the buffer? I.e. how many seconds should I expect my worst case wait time to take until I get my first token?

ninjha•2h ago

> how many work units can run in parallel

not original author but batching is one very important trick to make inference efficient, you can reasonably do tens to low hundreds in parallel (depending on model size and gpu size) with very little performance overhead

jrandolf•2h ago

vLLM handles GPU scheduling, not sllm. The model weights stay resident in VRAM permanently so there's no loading/unloading per request. vLLM uses continuous batching, so incoming requests are dynamically added to the running batch every decode step and the GPU is always working on multiple requests simultaneously. There is no "load to VRAM and run" per request; it's more like joining an already-running batch.

TTFT is under 2 seconds average. Worst case is 10-30s.

kaoD•31m ago

> The model weights stay resident in VRAM permanently so there's no loading/unloading per request.

Yes, I was thinking about context buffers, which I assume are not small in large models. That has to be loaded into VRAM, right?

If I keep sending large context buffers, will that hog the batches?

kgeist•3m ago

>If I keep sending large context buffers, will that hog the batches?

Technically it should, if your large context (= KV cache) fills the entirety of VRAM, I don't see how vLLM would be able to process other requests without more VRAM.

spuz•2h ago

Is this not a more restricted version of OpenRouter? With OpenRouter you pay for credits that can be used to run any commercial or open-source model and you only pay for what you use.

jrandolf•2h ago

OpenRouter is a little different. We are trying to experiment with maximizing a single GPU cluster.

singpolyma3•2h ago

25 t/s is barely usable. Maybe for a background runner

lelanthran•2h ago

> 25 t/s is barely usable. Maybe for a background runner

That's over a 1000 words/s if you were typing. If 1000 words/s is too slow for your use-case, then perhaps $5/m is just not for you.

I kinda like the idea of paying $5/m for unlimited usage at the specified speed.

It beats a 10x higher speed that hits daily restrictions in about 2 hours, and weekly restrictions in 3 days.

singpolyma3•59m ago

Sure if it was just a matter of typing. But in practise it means sitting and staring for minutes at nothing happening with a "thinking" until something finally happens.

I mean my local 122b is only 20t/s so for background stuff it can be used for that. But not for anything interactive IME.

freedomben•2h ago

This is an excellent idea, but I worry about fairness during resource contention. I don't often need queries, but when I do it's often big and long. I wouldn't want to eat up the whole system when other users need it, but I also would want to have the cluster when I need it. How do you address a case like this?

jrandolf•2h ago

We implement rate-limiting and queuing to ensure fairness, but if there are a massive amount of people with huge and long queries, then there will be waits. The question is whether people will do this and more often than not users will be idle.

freedomben•1h ago

Is there any way to buy into a pool of people with similar usage patterns? Maybe I'm overthinking it, but just wondering

ssl-3•12m ago

I think it'd be best to pool with people with different patterns, not the same patterns. Perhaps it would be best to pool with people in different timezones, and/or with different work/sleep schedules.

If everyone in a pool uses it during the ~same periods and sleeps during the ~same periods, then the node would oscillate between contention and idle -- every day. This seems largely avoidable.

(Or, darker: Maybe the contention/idle dichotomy is a feature, not a bug. After all, when one has control of $14k/month of hardware that is sitting idle reliably-enough for significant periods every day, then one becomes incentivized to devise a way to sell that idle time for other purposes.)

mogili1•1h ago

Rate limit essentially is a token limit

ibejoeb•19m ago

It depends on how it's implemented. If it's a fixed window, then your absolute ceiling is tokens/windows in a month. If it's a function of other usage, like a timeshare, you're still paying for some price for a month and you get what you get without paying more per token. There's an intrinsic limit based on how many tokens the model can process on that gpu in a month anyway, even if it's only you.

petterroea•1h ago

To be fair this is the price you pay for sharing a GPU. Probably good for stuff that doesn't need to be done "now" but that you can just launch and run in the background. I bet some graphs that show when the gpu is most busy could be useful as well

pokstad•51m ago

This problem sounds like an excellent opportunity. We need a race to the bottom for hosting LLMs to democratize the tech and lower costs. I cheer on anyone who figures this out.

varunr89•2h ago

$40/mo for deepseek r1 seems steep compared to a pro sub on open ai /claude unless you run 24x7. im not sure how sharing is making this affirdable.

lelanthran•2h ago

> $40/mo for deepseek r1 seems steep compared to a pro sub on open ai /claude unless you run 24x7.

"Running 24x7" is what people want to do with openclaw.

Lalabadie•2h ago

This is the most "Prompted ourselves a Shadcn UI" page I've seen in a while lol

I dig the idea! I'm curious where the costs will land with actual use.

jrandolf•2h ago

Thanks lol. I actually like Shadcn's style. It's sad that people view it as AI now.

mogili1•1h ago

Can you show a comparison of cost of we went per token pricing.

QuantumNomad_•1h ago

> How does billing work?

> When you join a cohort, your card is saved but not charged until the cohort fills. Stripe holds your card information — we never store it. Once the cohort fills, you are charged and receive an API key for the duration of the cohort.

Have any cohorts filled yet?

I’m interested in joining one, but only if it’s reasonable to assume that the cohort will be full within the next 7 days or so. (Especially because in a little over a week I’m attending an LLM-centered hackathon where we can either use AWS LLM credits provided by the organizer, or we can use providers of our own choosing, and I’d rather use either yours or my own hardware running vLLM than the LLM offerings and APIs from AWS.)

I’d be pretty annoyed if I join a cohort and then it takes like 3 months before the cohort has filled and I can begin to use it. By then I will probably have forgotten all about it and not have time to make use of the API key I am paying you for.

p_m_c•54m ago

Do you own the GPUs or are you multiplexing on a 3rd party GPU cloud?

RIMR•45m ago

I read the FAQ, and I can't imagine this is going to work the way you want it to. It fundamentally doesn't make sense as a business model.

I can sign up for a cohort today, but there's not even a hint of how long it will take the cohort to fill up. The most subscribed cohort is only at 42% (and dropping), so maybe days to weeks? That's a long time to wait if you have a use case to satisfy.

And then the cohort expires, and I have to sign up for another one and play the waiting game again? Nobody wants that level of unreliability.

Also, don't say "15-25 tok/s". That is a min-max figure, but your FAQ says that this is actually a maximum. It makes no sense to measure a maximum as a range, and you state no minimum so I can only assume that it is 0 tok/s. If all users in the cohort use it simultaneously, the best they're getting is something like 1.5 tok/s (probably less), which is abyssmal.

You mention "optimization", but I have no idea what that means. It certainly doesn't mean imposing token limits, because your FAQ says that won't happen. If more than 25 users are using the cohort simultaneously, it is a physical impossibility to improve performance to the levels you advertise without sacrificing something else, like switching to a smaller model, which would essentially be fraud, or adding more GPUs which will bankrupt you at these margins. With 465 users per cohort, a large chunk of whom will be using tools like OpenClaw, nobody will ever see the performance you are offering.

The issue here is you are trying to offer affordable AI GPU nodes without operating at a loss. The entire AI industry is operating at a loss right now because of how expensive this all is. This strategy literally won't work right now unless you start courting VCs to invest tens to hundreds of millions of dollars so you can get this off the ground by operating at a loss until hopefully you turn a profit at some point in the future, but at that point developers will probably be able to run these models at home without your help.

tensor-fusion•33m ago

Interesting direction. One adjacent pattern we've been working on is a bit less about partitioning a shared node for more tokens, and more about letting developers keep a local workflow while attaching to an existing remote GPU via a share link / CLI / VS Code path. In labs and small teams we've found the pain is often not just allocation, but getting access into the everyday workflow without moving code + environment into a full remote VM flow. Curious whether your users mostly want higher GPU utilization, or whether they also want workflow portability from laptops and homelabs. I'm involved with GPUGo / TensorFusion, so that's the lens I'm looking through.

scottcha•30m ago

Pretty cool idea, but whats the stack behind this? As 15-25 tok/s seems a bit low as expected SoA for most providers is around 60 tok/s and quality of life dramatically improves above that.

IanCal•14m ago

Can you explain the benefits over something like openrouter?

jrandolf•6m ago

24/7 LLM for $10/month.

Show HN: A game where you build a GPU

Simple self-distillation improves code generation

Sopwith

Show HN: TurboQuant-WASM – Google's vector quantization in the browser

Apple approves driver that lets Nvidia eGPUs work with Arm Macs

Show HN: sllm – Split a GPU node with other developers, unlimited tokens

Scientists observe an immune signaling complex forming inside cells

Author of "Careless People" banned from saying anything negative about Meta

Tell HN: Anthropic no longer allowing Claude Code subscriptions to use OpenClaw

Some Unusual Trees

Artemis II crew take “spectacular” image of Earth

Plague Ships

Components of a Coding Agent

Training mRNA Language Models Across 25 Species for $165

Emotion concepts and their function in a large language model

The CMS is dead, long live the CMS

Electrical Transformer Manufacturing Is Throttling the Electrified Future

The Indie Internet Index – submit your favorite sites

The Cathedral, the Bazaar, and the Winchester Mystery House

Claude Code Found a Linux Vulnerability Hidden for 23 Years

Mbodi AI (YC P25) Is Hiring

Why the most valuable things you know are things you cannot say

Notes from from Butterick's Practical Typography

A new gene therapy is giving people born deaf the chance to hear

When legal sports betting surges, so do Americans' financial problems

IBM 3270 Information Display System: Color and Programmed Symbols (1979) [pdf]

German men 18-45 need military permit for extended stays abroad

The most-disliked people in the publishing industry

OpenClaw privilege escalation vulnerability

iNaturalist

Show HN: sllm – Split a GPU node with other developers, unlimited tokens

Comments

Show HN: A game where you build a GPU

Simple self-distillation improves code generation

Sopwith

Show HN: TurboQuant-WASM – Google's vector quantization in the browser

Apple approves driver that lets Nvidia eGPUs work with Arm Macs

Show HN: sllm – Split a GPU node with other developers, unlimited tokens

Scientists observe an immune signaling complex forming inside cells

Author of "Careless People" banned from saying anything negative about Meta

Tell HN: Anthropic no longer allowing Claude Code subscriptions to use OpenClaw

Some Unusual Trees

Artemis II crew take “spectacular” image of Earth

Plague Ships

Components of a Coding Agent

Training mRNA Language Models Across 25 Species for $165

Emotion concepts and their function in a large language model

The CMS is dead, long live the CMS

Electrical Transformer Manufacturing Is Throttling the Electrified Future

The Indie Internet Index – submit your favorite sites

The Cathedral, the Bazaar, and the Winchester Mystery House

Claude Code Found a Linux Vulnerability Hidden for 23 Years

Mbodi AI (YC P25) Is Hiring

Why the most valuable things you know are things you cannot say

Notes from from Butterick's Practical Typography

A new gene therapy is giving people born deaf the chance to hear

When legal sports betting surges, so do Americans' financial problems

IBM 3270 Information Display System: Color and Programmed Symbols (1979) [pdf]

German men 18-45 need military permit for extended stays abroad

The most-disliked people in the publishing industry

OpenClaw privilege escalation vulnerability

iNaturalist