Most schedulers tell you a job retried 3 times. ReTraced tells you: - When each retry happened (with timestamps) - Why it retried (transient failure vs permanent) - Was it automatic or manually triggered? - Full audit trail before DLQ
I built this because while stress-testing, I found a backoff timing bug that only showed up when I could see retry attempts as structured data. Expected exponential delays (5s → 10s → 20s), but actual timestamps showed ~6s plateaus. The visibility made debugging trivial.
*Core ideas:* - Retry attempts stored as queryable records (not just counts) - Per-job retry strategies (fixed, linear, exponential + jitter) - DLQ is first-class with full failure context - Redis-backed, at-least-once semantics
*Current state:* v1.0.0 — core model stable, usable for internal tools and experimentation. Goal is production-ready self-hostable within a year.
I'm actively looking for feedback on: - DLQ replay strategies - Redis coordination patterns vs Postgres - Retry strategy edge cases I'm missing
GitHub: https://github.com/Anshikakalpana/ReTraced Docs: https://re-trace-five.vercel.app/
Would love your thoughts, especially from folks who've built or operated schedulers in production.