The LLMs are completely private (we don't log any traffic).
The API is OpenAI-compatible (we run vLLM), so you just swap the base URL. Currently offering a few models.
The LLMs are completely private (we don't log any traffic).
The API is OpenAI-compatible (we run vLLM), so you just swap the base URL. Currently offering a few models.
2. What if I try to hog all resources of a node by running some large data processing and making multiple queries in parallel? What if I try to resell the access by charging per token?
Edit: sorry if this comment sounds overly critical. I think that pooling money with other developers to collectively rent a server for LLM inference is a really cool idea. I also thought about it, but haven't found a satisfactory answer to my question number 2, so I decided that it is infeasible in practice.
Split a "it needs to run in a datacenter because its hardware requirements are so large" AI/LLM across multiple people who each want shared access to that particular model.
Sort of like the Real Estate equivalent of subletting, or splitting a larger space into smaller spaces and subletting each one...
Or, like the Web Host equivalent of splitting a single server into multiple virtual machines for shared hosting by multiple other parties, or what-have-you...
I could definitely see marketplaces similar to this, popping up in the future!
It seems like it should make AI cheaper for everyone... that is, "democratize AI"... in a "more/better/faster/cheaper" way than AI has been democratized to date...
Anyway, it's a brilliant idea!
Wishing you a lot of luck with this endeavor!
How large is a full context window in MiB and how long does it take to load the buffer? I.e. how many seconds should I expect my worst case wait time to take until I get my first token?
not original author but batching is one very important trick to make inference efficient, you can reasonably do tens to low hundreds in parallel (depending on model size and gpu size) with very little performance overhead
TTFT is under 2 seconds average. Worst case is 10-30s.
That's over a 1000 words/s if you were typing. If 1000 words/s is too slow for your use-case, then perhaps $5/m is just not for you.
I kinda like the idea of paying $5/m for unlimited usage at the specified speed.
It beats a 10x higher speed that hits daily restrictions in about 2 hours, and weekly restrictions in 3 days.
I mean my local 122b is only 20t/s so for background stuff it can be used for that. But not for anything interactive IME.
"Running 24x7" is what people want to do with openclaw.
I dig the idea! I'm curious where the costs will land with actual use.
> When you join a cohort, your card is saved but not charged until the cohort fills. Stripe holds your card information — we never store it. Once the cohort fills, you are charged and receive an API key for the duration of the cohort.
Have any cohorts filled yet?
I’m interested in joining one, but only if it’s reasonable to assume that the cohort will be full within the next 7 days or so. (Especially because in a little over a week I’m attending an LLM-centered hackathon where we can either use AWS LLM credits provided by the organizer, or we can use providers of our own choosing, and I’d rather use either yours or my own hardware than AWS.)
I’d be pretty annoyed if I join a cohort and then it takes like 3 months before the cohort has filled and I can begin to use it. By then I will probably have forgotten all about it and not have time to make use of the API key I am paying for.
mmargenot•2h ago