I kept getting annoyed by LLM inference non-reproducibility, and one thing that really surprised me is that changing batch size can change outputs even under “deterministic” settings.
So I built DetLLM: it measures and proves repeatability using token-level traces + a first-divergence diff, and writes a minimal repro pack for every run (env snapshot, run config, applied controls, traces, report).
I prototyped this version today in a few hours with Codex. The hardest part was the HLD I did a few days ago, but I was honestly surprised by how well Codex handled the implementation. I didn’t expect it to come together in under a day.
Would love feedback, and if you find any prompts/models/setups that still make it diverge.
cerru905•1h ago
So I built DetLLM: it measures and proves repeatability using token-level traces + a first-divergence diff, and writes a minimal repro pack for every run (env snapshot, run config, applied controls, traces, report).
I prototyped this version today in a few hours with Codex. The hardest part was the HLD I did a few days ago, but I was honestly surprised by how well Codex handled the implementation. I didn’t expect it to come together in under a day.
Would love feedback, and if you find any prompts/models/setups that still make it diverge.