Small prompt changes would fix one extraction but silently break another. Nothing threw an error – downstream systems just started receiving slightly different JSON.
Traditional unit tests didn't help because the raw LLM output itself was changing.
Continuum records the full workflow run (LLM call, parsed JSON, etc.) and replays it deterministically so CI can catch drift before production.
Curious how others are handling this problem in real systems. Are people snapshot-testing LLM outputs today?
Mofa1245•1h ago
A prompt tweak or model update can shift an extraction from:
amount: 72
to:
amount: "72.00"
Nothing crashes. Your pipeline just sends incorrect numbers downstream.
I built Continuum to treat AI workflows like unit tests.
It records a known-good run and replays it deterministically in CI. If any step output drifts, verification fails.
Local-first CLI. No SaaS.
Example included: invoice extraction pipeline with deterministic replay.
Repo: https://github.com/Mofa1245/Continuum