I built Faultline, a PostgreSQL-backed distributed job execution engine using:
- Lease-based claims - Retry scheduling - Idempotent side effects via a ledger table - A deterministic race reproduction harness
The interesting part wasn’t the happy path. It was the lease-expiry race.
Setup:
- Lease TTL: 1s - Worker A sleeps 2.5s (forces expiry) - Barrier enforces deterministic ordering - Worker B reclaims the job
Structured trace:
{"event": "lease_acquired", "job_id": "...", "token": 1, "forced": true} {"event": "execution_started", "job_id": "...", "token": 1} {"event": "lease_acquired", "job_id": "...", "token": 2, "forced": true} {"event": "execution_started", "job_id": "...", "token": 2} {"event": "stale_write_blocked", "job_id": "...", "stale_token": 1, "current_token": 2, "reason": "token_mismatch"} {"event": "worker_exit", "reason": "stale"} {"event": "worker_exit", "reason": "success"}
Worker A believed it still owned the lease. Worker B legitimately reclaimed it.
Without fencing, Worker A could still attempt mutation.
UNIQUE(job_id) alone is insufficient — it prevents duplicate rows but does not encode lease epoch ownership.
The fix:
- Add `fencing_token BIGINT` - Increment atomically on every lease acquisition - Bind side effects to `(job_id, fencing_token)` - Enforce a write gate before mutation
Claim logic:
UPDATE jobs SET state='running', lease_owner=$1, lease_expires_at = NOW() + make_interval(secs => $2), fencing_token=fencing_token+1, updated_at=NOW() WHERE id=$3 AND ( state='queued' OR (state='running' AND lease_expires_at < NOW()) ) RETURNING id, fencing_token;
Lease validity depends solely on DB time (`NOW()`); workers never use local clocks for correctness.
Guarantees under forced expiry + reclaim:
- No duplicate side effects - No stale worker mutation - Deterministic reproduction of the race - DB-enforced epoch ownership via `(job_id, fencing_token)`
The harness forces this race deterministically via barrier gating and forced TTL expiry.
Curious how others handle fencing under lease-based execution — specifically how teams handle fencing token overflow at scale and whether renewal logic changes the fencing guarantee.
kritibehl•1h ago