The LLMs are completely private (we don't log any traffic).
The API is OpenAI-compatible (we run vLLM), so you just swap the base URL. Currently offering a few models.
The LLMs are completely private (we don't log any traffic).
The API is OpenAI-compatible (we run vLLM), so you just swap the base URL. Currently offering a few models.
2. What if I try to hog all resources of a node by running some large data processing and making multiple queries in parallel? What if I try to resell the access by charging per token?
Edit: sorry if this comment sounds overly critical. I think that pooling money with other developers to collectively rent a server for LLM inference is a really cool idea. I also thought about it, but haven't found a satisfactory answer to my question number 2, so I decided that it is infeasible in practice.
Split a "it needs to run in a datacenter because its hardware requirements are so large" AI/LLM across multiple people who each want shared access to that particular model.
Sort of like the Real Estate equivalent of subletting, or splitting a larger space into smaller spaces and subletting each one...
Or, like the Web Host equivalent of splitting a single server into multiple virtual machines for shared hosting by multiple other parties, or what-have-you...
I could definitely see marketplaces similar to this, popping up in the future!
It seems like it should make AI cheaper for everyone... that is, "democratize AI"... in a "more/better/faster/cheaper" way than AI has been democratized to date...
Anyway, it's a brilliant idea!
Wishing you a lot of luck with this endeavor!
How large is a full context window in MiB and how long does it take to load the buffer? I.e. how many seconds should I expect my worst case wait time to take until I get my first token?
not original author but batching is one very important trick to make inference efficient, you can reasonably do tens to low hundreds in parallel (depending on model size and gpu size) with very little performance overhead
TTFT is under 2 seconds average. Worst case is 10-30s.
That's over a 1000 words/s if you were typing. If 1000 words/s is too slow for your use-case, then perhaps $5/m is just not for you.
I kinda like the idea of paying $5/m for unlimited usage at the specified speed.
It beats a 10x higher speed that hits daily restrictions in about 2 hours, and weekly restrictions in 3 days.
"Running 24x7" is what people want to do with openclaw.
I dig the idea! I'm curious where the costs will land with actual use.
mmargenot•59m ago