I've been building AutoAgents, an AI agent framework in Rust. Today I'm sharing a feature I haven't seen done well elsewhere: composable middleware layers for LLM inference pipelines.
The problem Every agent framework lets you swap LLM providers. Almost none of them give you a structured way to enforce safety, caching, or data sanitization in the inference path itself. You end up with guardrails as application-level if-statements, caching bolted on as a separate service, and PII handling as a "we'll add it later" TODO that never ships.
This gets worse with local models. Cloud APIs have provider-side moderation. When you run Qwen or Llama locally, you get raw inference with zero safety net. If that model has tool access or touches a database, that's a real liability. The solution
A Tower-style middleware stack for LLM inference. You wrap any provider with composable layers:
```rust let llm = PipelineBuilder::new(llama_cpp_provider) .add_layer(CacheLayer::new(CacheConfig { chat_key_mode: ChatCacheKeyMode::UserPromptOnly, ttl: Some(Duration::from_secs(900)), max_size: Some(512), ..Default::default() })) .add_layer( Guardrails::builder() .input_guard(RegexPiiRedactionGuard::default()) .input_guard(PromptInjectionGuard::default()) .enforcement_policy(EnforcementPolicy::Block) .build() .layer(), ) .build(); ```
That's it. The llm variable implements LLMProvider — you pass it to any agent and the layers are structurally enforced. Can't bypass them, can't forget them.
The broader framework: AutoAgents is a full agent framework — memory, tool use, multi-agent orchestration, the works. The pipeline feature works with any provider: llama.cpp (local), Ollama, OpenAI, Anthropic, etc. Same .add_layer() API regardless of backend. Written in Rust. No GC pauses. Memory-safe. The framework has ~400 stars and is being used in production for edge AI deployments. A note on maturity: The guardrails and pipeline layers are still early — the guard implementations are basic, observability isn't there yet, and we're iterating on the API surface. But the underlying architecture is solid and stable. The middleware pattern, the trait-based guard system, and the provider-agnostic pipeline contract aren't going to change. We're building on a foundation we're confident in, and shipping the layers incrementally. Early feedback shapes what gets built next.
I'd genuinely like feedback on:
The layer ordering question — should we enforce a recommended order or keep it flexible? What guardrail implementations would you actually use in production? Is the Tower-middleware mental model the right framing, or is there a better analogy?
Full example with local Qwen3-VL-8B: https://github.com/liquidos-ai/AutoAgents/tree/main/examples...
Thanks