Show HN: Optimizing LiteLLM with Rust – When Expectations Meet Reality

https://github.com/neul-labs/fast-litellm

12•ticktockten•1h ago

I've been working on Fast LiteLLM - a Rust acceleration layer for the popular LiteLLM library - and I had some interesting learnings that might resonate with other developers trying to squeeze performance out of existing systems.

My assumption was that LiteLLM, being a Python library, would have plenty of low-hanging fruit for optimization. I set out to create a Rust layer using PyO3 to accelerate the performance-critical parts: token counting, routing, rate limiting, and connection pooling.

The Approach

- Built Rust implementations for token counting using tiktoken-rs

- Added lock-free data structures with DashMap for concurrent operations

- Implemented async-friendly rate limiting

- Created monkeypatch shims to replace Python functions transparently

- Added comprehensive feature flags for safe, gradual rollouts

- Developed performance monitoring to track improvements in real-time

After building out all the Rust acceleration, I ran my comprehensive benchmark comparing baseline LiteLLM vs. the shimmed version:

Function Baseline Time Shimmed Time Speedup Improvement Status

token_counter 0.000035s 0.000036s 0.99x -0.6%

count_tokens_batch 0.000001s 0.000001s 1.10x +9.1%

router 0.001309s 0.001299s 1.01x +0.7%

rate_limiter 0.000000s 0.000000s 1.85x +45.9%

connection_pool 0.000000s 0.000000s 1.63x +38.7%

Turns out LiteLLM is already quite well-optimized! The core token counting was essentially unchanged (0.6% slower, likely within measurement noise), and the most significant gains came from the more complex operations like rate limiting and connection pooling where Rust's concurrent primitives made a real difference.

Key Takeaways

1. Don't assume existing libraries are under-optimized - The maintainers likely know their domain well 2. Focus on algorithmic improvements over reimplementation - Sometimes a better approach beats a faster language 3. Micro-benchmarks can be misleading - Real-world performance impact varies significantly 4. The most gains often come from the complex parts, not the simple operations 5. Even "modest" improvements can matter at scale - 45% improvements in rate limiting are meaningful for high-throughput applications

While the core token counting saw minimal improvement, the rate limiting and connection pooling gains still provide value for high-volume use cases. The infrastructure I built (feature flags, performance monitoring, safe fallbacks) creates a solid foundation for future optimizations.

The project continues as Fast LiteLLM on GitHub for anyone interested in the Rust-Python integration patterns, even if the performance gains were humbling.

Edit: To clarify - the negative performance for token_counter is likely in the noise range of measurement, suggesting that LiteLLM's token counting is already well-optimized. The 45%+ gains in rate limiting and connection pooling still provide value for high-throughput applications.

Comments

solidsnack9000•41m ago

Interesting write-up.

aaronblohowiak•18m ago

measure before implementing "improvements", you'll develop a sense over time of what is taking too long.

jmalicki•18m ago

The benchmarks in your README.md state that it is several times faster for those operations, are they a lie?

Generative UI: LLMs Are Effective UI Generators

Show HN: Dataset Factory – Generate RAG evaluation datasets from a text prompt

What Is Work?

Active short video use linked to altered attention and brain connectivity

Nestle accused of risking babies' health in Africa

To Be a Leader of Systems

Mapping the future with 3D‑printed titanium Apple Watch cases

Build a full data set using a single web query

John Henry and the Broken Dishwasher

Why one of the nation's most prosperous industries is shedding jobs

A City Is Broke. Can a Billionaires' Urbanist Dream Offer It a Last Chance?

Feeling Flush with Success – Making Museum Bathrooms into Exhibition Spaces

The Connectivity Standards Alliance Announces Zigbee 4.0 and Suzi

MPEG: Setting the Standards for a Digital Future

Mac Mini M4 Storage Upgrade: My Take on the Acasis M001 vs. WD SN7100 NVMe

Optimizing RhBMP-2 Therapy for Bone Regeneration

Color Palette Pro

Old 'Ghost' Theory of Quantum Gravity Makes a Comeback

Google is collecting troves of data from downgraded Nest thermostats

High-resolution climate model forecasts a wet, turbulent future

Google Antigravity – Agentic development IDE [video]

Why crypto is melting down and stocks keep falling

The Only AI Explainer You'll Ever Need

Tooltip Components Should Not Exist

Hey there You are using WhatsApp (enumerating 3B WhatsApp accounts)

Pebble, Rebble, and a Path Forward

Rails to SvelteKit Migration – LocallyGrown

Camper Rental Company Is Selling All of Its Custom Vans

RasterFlow – A lightweight node-based image editor

Show HN: I am self-hosting a time-sorted list of top STEM, Arts and Design posts