frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: sllm – Split a GPU node with other developers, unlimited tokens

https://sllm.cloud
51•jrandolf•3h ago
Running DeepSeek V3 (685B) requires 8×H100 GPUs which is about $14k/month. Most developers only need 15-25 tok/s. sllm lets you join a cohort of developers sharing a dedicated node. You reserve a spot with your card, and nobody is charged until the cohort fills. Prices start at $5/mo for smaller models.

The LLMs are completely private (we don't log any traffic).

The API is OpenAI-compatible (we run vLLM), so you just swap the base URL. Currently offering a few models.

Comments

mmargenot•2h ago
This is a great idea! I saw a similar (inverse) idea the other day for pooling compute (https://github.com/michaelneale/mesh-llm). What are you doing for compute in the backend? Are you locked into a cohort from month to month?
vova_hn2•1h ago
1. Is the given tok/s estimate for the total node throughput, or is it what you can realistically expect to get? Or is it the worst case scenario throughput if everyone starts to use it simultaneously?

2. What if I try to hog all resources of a node by running some large data processing and making multiple queries in parallel? What if I try to resell the access by charging per token?

Edit: sorry if this comment sounds overly critical. I think that pooling money with other developers to collectively rent a server for LLM inference is a really cool idea. I also thought about it, but haven't found a satisfactory answer to my question number 2, so I decided that it is infeasible in practice.

jrandolf•1h ago
1. It's an average. 2. We have sophisticated rate limiter.
poly2it•24m ago
Does it take user time zones into account?
jrandolf•15m ago
Yes
esafak•1h ago
Like vast.ai and TensorDock, and presumably others.
spuz•1h ago
It seems crazy to me that the "Join" button does not have a price on it and yet clicking it simply forwards you to a Stripe page again with no price information on it. How am I supposed to know how much I'm about to be charged?
jrandolf•1h ago
That was an error on our part lol. We'll update with the price.
peter_d_sherman•1h ago
What a brilliant idea!

Split a "it needs to run in a datacenter because its hardware requirements are so large" AI/LLM across multiple people who each want shared access to that particular model.

Sort of like the Real Estate equivalent of subletting, or splitting a larger space into smaller spaces and subletting each one...

Or, like the Web Host equivalent of splitting a single server into multiple virtual machines for shared hosting by multiple other parties, or what-have-you...

I could definitely see marketplaces similar to this, popping up in the future!

It seems like it should make AI cheaper for everyone... that is, "democratize AI"... in a "more/better/faster/cheaper" way than AI has been democratized to date...

Anyway, it's a brilliant idea!

Wishing you a lot of luck with this endeavor!

kaoD•1h ago
How is the time sharing handled? I assume if I submit a unit of work it will load to VRAM and then run (sharing time? how many work units can run in parallel?)

How large is a full context window in MiB and how long does it take to load the buffer? I.e. how many seconds should I expect my worst case wait time to take until I get my first token?

ninjha•1h ago
> how many work units can run in parallel

not original author but batching is one very important trick to make inference efficient, you can reasonably do tens to low hundreds in parallel (depending on model size and gpu size) with very little performance overhead

jrandolf•1h ago
vLLM handles GPU scheduling, not sllm. The model weights stay resident in VRAM permanently so there's no loading/unloading per request. vLLM uses continuous batching, so incoming requests are dynamically added to the running batch every decode step and the GPU is always working on multiple requests simultaneously. There is no "load to VRAM and run" per request; it's more like joining an already-running batch.

TTFT is under 2 seconds average. Worst case is 10-30s.

spuz•1h ago
Is this not a more restricted version of OpenRouter? With OpenRouter you pay for credits that can be used to run any commercial or open-source model and you only pay for what you use.
jrandolf•1h ago
OpenRouter is a little different. We are trying to experiment with maximizing a single GPU cluster.
singpolyma3•1h ago
25 t/s is barely usable. Maybe for a background runner
lelanthran•1h ago
> 25 t/s is barely usable. Maybe for a background runner

That's over a 1000 words/s if you were typing. If 1000 words/s is too slow for your use-case, then perhaps $5/m is just not for you.

I kinda like the idea of paying $5/m for unlimited usage at the specified speed.

It beats a 10x higher speed that hits daily restrictions in about 2 hours, and weekly restrictions in 3 days.

singpolyma3•11m ago
Sure if it was just a matter of typing. But in practise it means sitting and staring for minutes at nothing happening with a "thinking" until something finally happens.

I mean my local 122b is only 20t/s so for background stuff it can be used for that. But not for anything interactive IME.

freedomben•1h ago
This is an excellent idea, but I worry about fairness during resource contention. I don't often need queries, but when I do it's often big and long. I wouldn't want to eat up the whole system when other users need it, but I also would want to have the cluster when I need it. How do you address a case like this?
jrandolf•1h ago
We implement rate-limiting and queuing to ensure fairness, but if there are a massive amount of people with huge and long queries, then there will be waits. The question is whether people will do this and more often than not users will be idle.
freedomben•1h ago
Is there any way to buy into a pool of people with similar usage patterns? Maybe I'm overthinking it, but just wondering
mogili1•41m ago
Rate limit essentially is a token limit
petterroea•13m ago
To be fair this is the price you pay for sharing a GPU. Probably good for stuff that doesn't need to be done "now" but that you can just launch and run in the background. I bet some graphs that show when the gpu is most busy could be useful as well
pokstad•4m ago
This problem sounds like an excellent opportunity. We need a race to the bottom for hosting LLMs to democratize the tech and lower costs. I cheer on anyone who figures this out.
varunr89•1h ago
$40/mo for deepseek r1 seems steep compared to a pro sub on open ai /claude unless you run 24x7. im not sure how sharing is making this affirdable.
lelanthran•1h ago
> $40/mo for deepseek r1 seems steep compared to a pro sub on open ai /claude unless you run 24x7.

"Running 24x7" is what people want to do with openclaw.

Lalabadie•1h ago
This is the most "Prompted ourselves a Shadcn UI" page I've seen in a while lol

I dig the idea! I'm curious where the costs will land with actual use.

jrandolf•1h ago
Thanks lol. I actually like Shadcn's style. It's sad that people view it as AI now.
mogili1•39m ago
Can you show a comparison of cost of we went per token pricing.
QuantumNomad_•27m ago
> How does billing work?

> When you join a cohort, your card is saved but not charged until the cohort fills. Stripe holds your card information — we never store it. Once the cohort fills, you are charged and receive an API key for the duration of the cohort.

Have any cohorts filled yet?

I’m interested in joining one, but only if it’s reasonable to assume that the cohort will be full within the next 7 days or so. (Especially because in a little over a week I’m attending an LLM-centered hackathon where we can either use AWS LLM credits provided by the organizer, or we can use providers of our own choosing, and I’d rather use either yours or my own hardware than AWS.)

I’d be pretty annoyed if I join a cohort and then it takes like 3 months before the cohort has filled and I can begin to use it. By then I will probably have forgotten all about it and not have time to make use of the API key I am paying for.

p_m_c•7m ago
Do you own the GPUs or are you multiplexing on a 3rd party GPU cloud?

Show HN: A game where you build a GPU

https://jaso1024.com/mvidia/
184•Jaso1024•2h ago•41 comments

Show HN: TurboQuant-WASM – Google's vector quantization in the browser

https://github.com/teamchong/turboquant-wasm
68•teamchong•4h ago•2 comments

Show HN: sllm – Split a GPU node with other developers, unlimited tokens

https://sllm.cloud
51•jrandolf•3h ago•30 comments

Show HN: Running local OpenClaw together with remote agents in an open network

https://github.com/hybroai/hybro-hub
7•kevinlu•1h ago•2 comments

Show HN: Kaoslabs – High-intensity AI video and visual experiments

https://kaoslabs.org
2•wilhart•2h ago•0 comments

Show HN: DocMason – Agent Knowledge Base for local complex office files

https://github.com/jetxu-llm/docmason
4•Jet_Xu•2h ago•0 comments

Show HN: Pluck – Copy any UI from any website, paste it into AI coding tools

https://www.pluck.so/
7•bring-shrubbery•7h ago•10 comments

Show HN: I built a frontpage for personal blogs

https://text.blogosphere.app/
734•ramkarthikk•1d ago•188 comments

Show HN: React hooks that predict text height before render, using font metrics

5•ahmadparizaad•3h ago•0 comments

Show HN: Apfel – The free AI already on your Mac

https://apfel.franzai.com
699•franze•1d ago•144 comments

Show HN: A simple iOS app that helps you give yourself some time"

https://apps.apple.com/tr/app/alnuo/id6761344069
4•sezginozgur•4h ago•0 comments

Show HN: I made open source, zero power PCB hackathon badges

https://github.com/KaiPereira/Overglade-Badges
4•kaipereira•4h ago•0 comments

Show HN: Semsei — AI SEO for clicks, not impressions

https://www.semsei.io/en
6•andresdvelez•4h ago•3 comments

Show HN: Tokencap – Token budget enforcement across your AI agents

https://github.com/pykul/tokencap
4•pykul•4h ago•0 comments

Show HN: Travel Hacking Toolkit – Points search and trip planning with AI

https://github.com/borski/travel-hacking-toolkit
81•borski•16h ago•36 comments

Show HN: ctx – an Agentic Development Environment (ADE)

https://ctx.rs
46•luca-ctx•1d ago•51 comments

Show HN: Ownscribe – local meeting transcription, summarization and search

https://github.com/paberr/ownscribe
3•paberr•6h ago•0 comments

Show HN: AdaShape-3D modeler for intuitive 3D printing parts / Windows 11

https://adashape.com
3•fsloth•6h ago•2 comments

Show HN: Ismcpdead.com – Live dashboard tracking MCP adoption and sentiment

https://ismcpdead.com
35•sagirodin•23h ago•21 comments

Show HN: Mtproto.zig – High-performance Telegram proxy with DPI evasion

https://github.com/sleep3r/mtproto.zig
18•slp3r•21h ago•13 comments

Show HN: Tusk for macOS and Gnome

https://shapemachine.xyz/tusk/
3•factorialboy•8h ago•0 comments

Show HN: Made a little Artemis II tracker

https://artemis-ii-tracker.com/
148•codingmoh•1d ago•55 comments

Show HN: Deeplink – Go library for short links, click tracking, and OG previews

https://github.com/yinebebt/deeplink
3•yinebeb_sc•9h ago•2 comments

Show HN: ZipSee – explore remote ZIP archives using HTTP range requests

https://zipsee.pages.dev/
4•vsekar•9h ago•1 comments

Show HN: Dull – Instagram Without Reels, YouTube Without Shorts (iOS)

https://getdull.app
151•kasparnoor•2d ago•117 comments

Show HN: Docking – extensible Linux dock in Python

https://docking.cc
3•edumucelli•10h ago•0 comments

Show HN: DotReader – connects ideas across your books automatically

https://dotreader.info
5•efecerre•18h ago•1 comments

Show HN: Web Push Notifications for Hacker News

https://hn-push.val.run
3•kinlan•11h ago•2 comments

Show HN: GraphReFly – Reactive graph protocol for human and LLM co-operation

https://graphrefly.dev/
5•clfhhc•11h ago•2 comments

Show HN: TinyOS – A minimalist RTOS for Cortex-M written in C

https://github.com/cmc-labo/tinyos-rtos
99•hpscript•21h ago•42 comments