Problem: agent “evals” are often flaky (network, time, tool nondeterminism, model drift), so it’s hard to tell if a change actually broke behavior.
What Trajectly does:
records an agent run once (inputs, tool calls, outputs)
replays it deterministically offline as a test fixture (so CI is stable)
checks a TRT “contract” (allowed tools/sequence, budgets, invariants, etc.)
when something breaks, it pinpoints the earliest violating step and can shrink the run to a minimal counterexample
You can try it locally (no signup):
pip install trajectly
run one of the standalone demos:
procurement approval agent demo
support escalation agent demo (or clone the main repo and run the GitHub Actions example)
Repo: https://github.com/trajectly/trajectly
I’m around to answer questions. I’d love feedback on:
what contract checks would be most useful in real agent deployments?
integrations you’d want first (LangGraph / LangChain / custom tool runners)?
whether the “shrink to minimal failing trace” output is understandable.
ashmawy•1h ago
Problem: agent “evals” are often flaky (network, time, tool nondeterminism, model drift), so it’s hard to tell if a change actually broke behavior.
What Trajectly does:
records an agent run once (inputs, tool calls, outputs)
replays it deterministically offline as a test fixture (so CI is stable)
checks a TRT “contract” (allowed tools/sequence, budgets, invariants, etc.)
when something breaks, it pinpoints the earliest violating step and can shrink the run to a minimal counterexample
You can try it locally (no signup):
pip install trajectly
run one of the standalone demos:
procurement approval agent demo
support escalation agent demo (or clone the main repo and run the GitHub Actions example)
Repo: https://github.com/trajectly/trajectly
I’m around to answer questions. I’d love feedback on:
what contract checks would be most useful in real agent deployments?
integrations you’d want first (LangGraph / LangChain / custom tool runners)?
whether the “shrink to minimal failing trace” output is understandable.