When swapping models or tweaking prompts, subtle regressions can slip in: – cost spikes – format drift – PII leakage
Traditional CI assumes deterministic output, which LLMs aren’t.
We built a small local-first CLI that compares baseline vs candidate outputs and returns ALLOW / WARN / BLOCK based on cost, drift, and PII.
Curious how others are handling this problem:
Are you snapshot testing?
Using SaaS evaluation tools?
Relying on manual review?
Not gating at all?
Would love to understand real workflows.
cholmess21•1h ago
When swapping models or tweaking prompts, subtle regressions can slip in: – cost spikes – format drift – PII leakage
Traditional CI assumes deterministic output, which LLMs aren’t.
We built a small local-first CLI that compares baseline vs candidate outputs and returns ALLOW / WARN / BLOCK based on cost, drift, and PII.
Curious how others are handling this problem:
Are you snapshot testing?
Using SaaS evaluation tools?
Relying on manual review?
Not gating at all?
Would love to understand real workflows.