Hey, I’m one of the maintainers of RamaLama[1] which is part of the containers ecosystem (podman, buildah, skopeo). It’s a runtime-agnostic tool for coordinating local AI inference with containers.
I put together a python SDK for programmatic control over local AI using ramalama under the hood. Being runtime agnosti you can use ramalama with llama.cpp, vLLM, mlx, etc… so long as the underlying service exposes an OpenAI compatible endpoint. This is especially powerful for users deploying to edge or other devices with atypical hardware/software configuration that, for example, requires custom runtime compilations.
``` from ramalama_sdk import RamalamaModel
runtime_image = "quay.io/ramalama/ramalama:latest" model = "huggingface://ggml-org/gpt-oss-20b-GGUF"
with RamalamaModel(model, base_image=runtime_image) as model:
response = model.chat("How tall is Michael Jordan?")
print(response["content"])
```This SDK manages:
- Pulling and verifying runtime images
- Downloading models (HuggingFace, Ollama, ModelScope, OCI registries)
- Managing the runtime process
It works with air-gapped deployments and private registries and also has async support.If you want to learn more the documentation is available here: https://docs.ramalama.com/sdk/introduction.
Otherwise, I hope this is useful to people out there and would appreciate feedback about where to prioritize next whether that’s specific language support, additional features (speech to text? RAG? MCP?), or something else.
1. github.com/containers/ramalama