THE PROBLEM
Profiled my LangGraph agents. 50-100ms per invocation, most of it not the LLM. Found two culprits:
1. ThreadPoolExecutor created fresh every invoke() — 20ms overhead
2. Checkpointing uses deepcopy() — 52ms for 35KB state, 206ms for 250KB
THE FIX
Rewrote hot paths in Rust via PyO3:
Checkpoint serialization (serde vs deepcopy):
35KB state: 0.29ms vs 52ms = 178x faster
250KB state: 0.28ms vs 206ms = 737x faster
E2E with checkpointing: 2-3x faster
Drop-in usage:
export FAST_LANGGRAPH_AUTO_PATCH=1
# or explicit from fast_langgraph import RustSQLiteCheckpointer
checkpointer = RustSQLiteCheckpointer("state.db")
KEY INSIGHT
PyO3 boundary costs ~1-2μs per call. Rust only wins when you:
- Avoid intermediate Python objects (checkpoint serialization)
- Batch operations (channel updates)
- Handle large data (state > 10KB)
For simple dict ops, Python's C-dict still wins.
Architecture: Python orchestration (compatibility) + Rust hot paths (performance).
Runs regular compatibility checks!
MIT licensed. Feedback welcome.