The Problem
Every team rebuilds the same Kubernetes infrastructure: networking, certificates, monitoring, databases, storage. The existing solutions either lock you into a vendor ecosystem or dump you into raw Kubernetes complexity. We wanted the control of self-hosting without weeks of setup.
Architecture
Our system uses two agent types:
Server agents run on VM hosts and communicate with our backend via gRPC bidirectional streams. When users request a cluster node, the agent provisions a KVM-based VM and bootstraps it.
Node agents run on each Kubernetes node and handle cluster operations, monitoring, and service installations.
Key insight: gRPC streams initiated by agents eliminate firewall configuration and public IP requirements. Agents reach out to our backend, not vice versa.
Why KVM?
- Battle-tested, works great with Ubuntu - Solid Go bindings via libvirt - Excellent GPU passthrough for AI workloads like Ollama - Good isolation/performance balance
Sometimes boring technology is the right choice.
Provisioning Flow
1. User clicks "Create Cluster" 2. Backend selects available server agents 3. gRPC commands sent to provision VMs 4. KVM VMs spin up (Ubuntu Cloud 24.04, 30-60 seconds) 5. Node agents install and connect 6. Kubernetes bootstrap with kubeadm + Cilium 7. WireGuard mesh established between nodes 8. Storage configured (OpenEBS + Longhorn) 9. Cluster ready (5-10 minutes total)
The WireGuard Decision
We manage WireGuard at the OS level, not Kubernetes level. Why?
- Same VPN secures both K8s traffic and SSH access - Nodes communicate securely even if Kubernetes fails - Simpler troubleshooting with separated layers - Easier multi-cluster peering (coming soon)
Our backend orchestrates WireGuard configs across nodes via the agents. Centrally coordinated, locally executed.
Version Management Hell
The hardest problem? Keeping 20+ services compatible across updates.
We offer one-click installation of: PostgreSQL, MySQL, ClickHouse, Kafka, RabbitMQ, MinIO, Longhorn, Harbor, Traefik, Grafana, Prometheus, Ollama, LiteLLM, Open WebUI, and more.
Each has opinions about K8s versions, storage, and networking. We use Helm charts, operators, and custom YAML as appropriate. The real work is maintaining compatibility matrices and testing every combination.
Deployment Models
RunOS Cloud: Managed dedicated servers with fixed 8 CPU/16GB instances (free trial credits available). KVM handles VM provisioning with GPU passthrough for AI workloads. Strict security since it's early access.
Bring Your Own Node: Run node agents on any hardware. Complete tenant isolation since you control infrastructure.
Coming soon: Self-managed VM hosts with custom sizing.
What's Next
Agent code will be open source. One company runs three production clusters already. Common feedback: "I can't believe how fast I went from zero to a working cluster with Postgres, Kafka, and monitoring."
We're planning weekly updates here on HackerNews about new features, technical challenges, and production lessons.
Try it at runos.com - free trial credits for 8 CPU threads and 16GB memory.
Questions? Happy to discuss architecture in the comments.