I built Titan, a distributed orchestrator, to explore the core primitives of distributed systems (scheduling, concurrency, failure detection, and IPC) without the complexity of Kubernetes or the overhead of HTTP.
I originally built it for my home-lab to orchestrate a mix of long-running services and batch jobs. Static schedulers worked at first, but as workflows grew more complex, I needed the system to adapt its execution plan based on runtime data, not just follow a predefined DAG.
This led me to evolve Titan from a hybrid scheduler into a runtime where execution graphs can be constructed or modified dynamically. It still supports static YAML DAGs, but now exposes a Python SDK that allows user code to:
Construct dynamic DAGs programmatically
Branch or loop execution based on live signals
Spawn new tasks/infrastructure in response to failures (Agentic patterns)
The core engine is written in Java 17 using raw TCP sockets and a custom binary protocol (no HTTP, no external DB, around 90KB JAR). Workers use push-based discovery and application-level auto-scaling by spawning ephemeral JVM processes when saturated.
This is a research project (no Raft or mTLS yet), but building it from first principles was interesting. I later realized I had independently converged on patterns similar to Nomad (scheduling) and Temporal (workflow semantics).
I’d love feedback on the architecture, specifically:
Runtime graph mutation as a scheduling primitive
Push-based worker discovery (vs standard pulling)
Tradeoffs of application-level scaling vs centralized autoscalers.