Today we are launching Exosphere Flex Inference APIs: Inference APIs should adapt to your constraints, not the other way around.
Usually, when you need to run inference at scale, you are forced into rigid boxes:
1. "Real-time" APIs (Expensive, optimized for <1s latency, prone to 429s).
2. "Batch" APIs (Cheaper, but often force 24-hour windows and rigid file formats).
3. "Self-hosted" (Total control, but high ops overhead).
We built a flexible inference engine that sits in the middle. You define the constraints—SLA (time), Cost, and Quality and the system handles the execution.
Here is how it works under the hood:
1. Flexible SLAs (The "Time" Constraint): Instead of just "now" or "tomorrow," you pass an `sla` parameter (e.g., 60 minutes, 4 hours). Our scheduler bins these requests to optimize GPU saturation across our provider mesh. You trade strict immediacy for up to ~70% lower cost.
2. Reliability Layer (The "Ops" Constraint): We abstract away the error handling. If a provider throws a 429 or 503, you shouldn't have to write a retry loop with backoff jitter. Our infrastructure absorbs these failures and retries internally. We guarantee the request eventually succeeds (within your SLA) or we don't charge you.
3. Built-in Quality Gates (The "Accuracy" Constraint): This is the feature I’m most excited about. You can define an "eval" config in the request (using LLM-as-a-Judge or python scripts). If the output doesn't meet your criteria, our system automatically feeds the failure back into the model and retries it. This moves the "validation loop" from your client code into the infrastructure.
I’d love to hear your thoughts on this approach—specifically, does moving the "retry/eval" loop into the API layer simplify your backend, or do you prefer keeping that logic client-side?
Playground: https://models.exosphere.host/
More Details: https://exosphere.host/flex-inference