Hi HN! I built Shimmy, a lightweight AI inference server that can now load HuggingFace SafeTensors models directly without any Python dependencies.
The core problem: I wanted to run HuggingFace models locally but didn't want the heavyweight Python ML stack. Most solutions require Python + PyTorch + transformers libraries, which can be 2GB+ just for
dependencies.
What's new in v1.2.0:
• Native SafeTensors support - loads .safetensors files directly in Rust
• 2x faster model loading compared to traditional formats
• Zero Python dependencies - pure Rust implementation
• Still just a 5MB binary (vs 50MB+ alternatives like Ollama)
• Full OpenAI API compatibility for drop-in replacement
Technical details:
- Built with native SafeTensors parsing (not Python bindings)
- Memory-efficient tensor loading with bounds checking
- Tested up to 100MB+ models with sub-second loading
- Cross-platform: Windows, macOS (Intel/ARM), Linux
- Supports mixed model formats (GGUF + SafeTensors)
This bridges the gap between HuggingFace's model ecosystem and lightweight local deployment. You can now grab any SafeTensors model from HuggingFace and run it locally with just a single binary.
GitHub: https://github.com/Michael-A-Kuykendall/shimmy
Install: `cargo install shimmy`
Happy to answer questions about the SafeTensors implementation or Rust AI inference in general!
somesun•4mo ago
I will try it, does it support GPU like cuda?
one question, can I use it as a library in my rust project, or I can only call it through new process with exe file?
MKuykendall•4mo ago
Rust library: Absolutely! Add shimmy = { version = "0.1.0", features = ["llama"] } to Cargo.toml. Use the inference engine directly:
let engine = shimmy::engine::llama::LlamaEngine::new(); let model = engine.load(&spec).await?; let response = model.generate("prompt", opts, None).await?;
No need to spawn processes - just import and use the components directly in your Rust code.