Orchestrating 5000 Workers Without Distributed Locks: Rediscovering TDMA

3•Horos•1mo ago

I needed to orchestrate 500-5000 batch workers (ML training, ETL) using Go and SQLite. Every tutorial said: use etcd, Consul, or ZooKeeper.

But why do these processes need to talk to each other at all?

THE INSIGHT

What if orchestrators never run simultaneously?

Runner-0 executes at T=0s, 10s, 20s... Runner-1 executes at T=2s, 12s, 22s... Runner-2 executes at T=4s, 14s, 24s... Runner-3 executes at T=6s, 16s, 26s... Runner-4 executes at T=8s, 18s, 28s...

Time-Division Multiple Access (TDMA). Same pattern GSM uses for radio.

GO IMPLEMENTATION

type Runner struct { ID, TotalRunners int CycleTime time.Duration }

func (r Runner) Start() { slot := r.CycleTime / time.Duration(r.TotalRunners) offset := time.Duration(r.ID) slot

    for {
        time.Sleep(time.Until(computeNextSlot(offset)))
        r.reconcile() // Check workers, start if needed
    }

}

Each runner gets 2s in a 10s cycle. No overlap = zero coordination.

SQLITE CONFIG

PRAGMA journal_mode=WAL; dbWrite.SetMaxOpenConns(1) // One writer dbRead.SetMaxOpenConns(10) // Concurrent reads

With TDMA, busy_timeout never triggers.

THE MATH

Capacity = SlotDuration / TimePerWorker = 2000ms / 10ms = 200 workers per runner

5 runners = 1000 workers 25 runners = 5000 workers (25s cycle, 12.5s avg latency)

For batch jobs running hours, 10s detection latency is irrelevant.

BENCHMARKS (real data from docs/papers)

System | Writes/s | Latency | Nodes | Use Case etcd | 10,000 | 25ms | 3-5 | Config ZooKeeper | 8,000 | 50ms | 5 | Election Temporal | 2,000 | 100ms | 15-20 | Workflows Airflow | 300 | 2s | 2-3 | Batch TDMA-SPI | 40 | 5s avg | 1-5 | Batch

WHAT YOU GAIN: - Zero consensus protocols (no Raft/Paxos) - Single-node deployment possible - Deterministic behavior - Radical simplicity

WHAT YOU SACRIFICE: - Real-time response (<1s) - High frequency (>1000 ops/sec) - Arbitrary scale (limit ~5000 workers)

UNIVERSAL PATTERN

Wireless Sensor Networks: DD-TDMA (IEEE 2007) - same pattern Kubernetes Controllers: Reconcile every 5-10s (implicit TDMA) Build Systems: Time-slice job claims vs SELECT FOR UPDATE

WHY ISN'T THIS COMMON?

1. Cultural bias: Industry teaches "add consensus layer" as default 2. TDMA sounds old: It's from 1980s telecoms (but old ≠ bad) 3. SQLite underestimated: Actually handles 50K-100K writes/sec on NVMe 4. Most examples optimize for microservices (1000s ops/sec), not batch

WHEN NOT TO USE: Microservices (<100ms latency needed) Real-time systems (trading, gaming) >10,000 operations/sec required

GOOD FOR: Batch processing ML training orchestration ETL pipelines (hourly/daily) Video/image processing Anything where task duration >> detection latency

THE REAL LESSON

Modern distributed systems thinking: 1. Assume coordination needed 2. Pick consensus protocol 3. Deal with complexity

Alternative: 1. Can processes avoid each other? (temporal isolation) 2. Can data be partitioned? (spatial isolation) 3. Is eventual consistency OK?

If yes to all three: you might not need coordination at all.

CONCLUSION

I built a simple orchestrator for batch workers and rediscovered a 40-year-old telecom pattern that eliminates distributed coordination entirely.

The pattern: TDMA + spatial partitioning + SQLite The application to workflow orchestration seems novel.

If Kubernetes feels like overkill, maybe time-slicing is enough.

Sometimes the best distributed system is one that doesn't need to be distributed.

--- Full writeup: [blog link] Code: [github link]

Discussion: Anyone else using time-based scheduling for coordination-free systems? What about high clock skew networks?

Comments

wazokazi•1mo ago

The workers sit idle for n-1 out of n time slices. As n gets larger, amount of work being done approaches zero.

Horos•1mo ago

TDMA schedules the orchestrators (lightweight checks), not the workers (heavy jobs).

Orchestrators: Active 1/n of time (~10ms to check state) Workers: Run continuously for hours once started

T=0s: Orchestrator-0 checks → starts job (runs 2 hours) T=2s: Orchestrator-1 checks → job still running T=10s: Orchestrator-0 checks again → job still running

Think: traffic lights (TDMA) vs cars (drive continuously).

Work throughput is unchanged. TDMA only coordinates who checks when.

pancsta•1mo ago

I do a lot of logical-clock based synchronization using asyncmachine.dev (also in Go), you may want to check it out as “human time” can be error prone and not “tight”. It does involve forming a network state machines, but connections can be partial and nested.

Your results are very hard to read due to formatting, but the idea is interesting.

Horos•1mo ago

Thanks for the pointer to asyncmachine! Let me clarify HOROS architecture since there's some confusion.

HOROS uses time slots for orchestrator clones on a SINGLE machine by default. Not distributed - 5 Go processes share the same kernel clock:

Runner-0: T=0s, 10s, 20s... (slot 0) Runner-1: T=2s, 12s, 22s... (slot 1) Runner-2: T=4s, 14s, 24s... (slot 2)

Zero network, zero clock drift. Just local time.Sleep().

Your approach (logical clocks) solves event ordering in distributed systems. HOROS solves periodic polling - workers can be idle for hours with no events to increment a logical clock. Wall-clock fires regardless.

Different primitives: - Logical clocks: "Event A before Event B?" (causality) - TDMA timers: "Is it your turn?" (time-slicing)

For cross-machine workflows, we use SQLite state bridges:

Machine-Paris Machine-Virginia ┌─────────────┐ ┌──────────────┐ │ Worker-StepA│ │ Worker-StepC │ │ completes │ │ waits │ │ ↓ │ │ ↑ │ │ output.db │ │ input.db │ └──────┬──────┘ └──────▲───────┘ │ │ └──→ bridge.db ←─────────────────┘ (Litestream replication)

bridge.db = shared SQLite with state transitions StepBridger daemon polls bridge.db, moves data between steps

State machines communicate through data writes, not RPC. Each node stays single-machine internally (local TDMA).

Re: formatting - which results were unclear? Happy to improve.

NotebookLM: The AI that only learns from you

Show HN: An open-source starter kit for developing with Postgres and ClickHouse

Game Boy Advance d-pad capacitor measurements

South Korean crypto firm accidentally sends $44B in bitcoins to users

Apache Poison Fountain

Web.whatsapp.com appears to be having issues syncing and sending messages

Google in Your Terminal

Shannon: Claude Code for Pen Testing

Anthropic: Latest Claude model finds more than 500 vulnerabilities

Brooklyn cemetery plans human composting option, stirring interest and debate

Why the 'Strivers' Are Right

Brain Dumps as a Literary Form

Agentic Coding and the Problem of Oracles

Malicious packages for dYdX cryptocurrency exchange empties user wallets

Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

Penisgate erupts at Olympics; scandal exposes risks of bulking your bulge

Arcan Explained: A browser for different webs

What did we learn from the AI Village in 2025?

An open replacement for the IBM 3174 Establishment Controller

The P in PGP isn't for pain: encrypting emails in the browser

Show HN: Mirror Parliament where users vote on top of politicians and draft laws

Ask HN: Opus 4.6 ignoring instructions, how to use 4.5 in Claude Code instead?

We Mourn Our Craft

Jim Fan calls pixels the ultimate motor controller

Exploring a Modern SMTPE 2110 Broadcast Truck with My Dad

AI UX Playground: Real-world examples of AI interaction design

The Field Guide to Design Futures

The Other Leverage in Software and AI

AUR malware scanner written in Rust

Free FFmpeg API [video]