Modern ML is mostly run on other people's hardware (clouds, inference marketplaces, heterogeneous GPU fleets). Today, you can't verify what actually ran: which model, what inputs, or whether your result was silently downgraded (model swap, early exit, quantization, ad-embedding tweaks). The usual fix (recompute and compare) breaks because GPUs are nondeterministic: IEEE-754 is non-associative, kernels reorder reductions, and thread schedules/atomics vary. Bitwise replay at scale is brittle and costly.
We built a system that makes ML results verifiable without requiring determinism. Instead of demanding exact equality, we verify outputs up to principled, per-operator error bounds and resolve disagreements with an optimistic, Merkle-anchored dispute game.
What's inside (precise + concise):
Semantics: "Tolerance-aware correctness" for tensor programs and operator-specific acceptance regions induced by IEEE-754 rounding.
Two error models:
Theoretical IEEE-754 bounds (sound, per-operator, element-wise; conservative but cheap to verify).
Empirical percentile thresholds calibrated across GPUs (tight, model-specific).
Dispute protocol: If a result is challenged, we recursively partition the traced graph (Merkle proofs) until a single operator. At the leaf, we either (i) certify with the theoretical bound or (ii) run a small committee vote against the empirical threshold. Violations are slashed.
Runtime + coordinator: PyTorch-compatible runtime that traces graphs, computes bounds on the fly, records subgraph I/O, and emits/validates commitments; a lightweight coordination layer (we prototyped on Ethereum for authenticated logs/bonds, but the protocol doesn't depend on a blockchain). Overhead ~0.3% latency on Qwen3-8B; no extra memory beyond native subgraph execution.
Scope: Designed for the open-model setting (public weights/graphs enable subgraph extraction and thresholding). Closed-model APIs can still expose commitments to permissioned verifiers. No TEEs required; no deterministic kernels needed.
We’d love feedback on blind spots in the threat model, better partition policies for the dispute game, and where empirical thresholds might be gamed or leak information.
cnyaojz•2h ago
We built a system that makes ML results verifiable without requiring determinism. Instead of demanding exact equality, we verify outputs up to principled, per-operator error bounds and resolve disagreements with an optimistic, Merkle-anchored dispute game.
What's inside (precise + concise):
Semantics: "Tolerance-aware correctness" for tensor programs and operator-specific acceptance regions induced by IEEE-754 rounding.
Two error models:
Theoretical IEEE-754 bounds (sound, per-operator, element-wise; conservative but cheap to verify).
Empirical percentile thresholds calibrated across GPUs (tight, model-specific).
Dispute protocol: If a result is challenged, we recursively partition the traced graph (Merkle proofs) until a single operator. At the leaf, we either (i) certify with the theoretical bound or (ii) run a small committee vote against the empirical threshold. Violations are slashed.
Runtime + coordinator: PyTorch-compatible runtime that traces graphs, computes bounds on the fly, records subgraph I/O, and emits/validates commitments; a lightweight coordination layer (we prototyped on Ethereum for authenticated logs/bonds, but the protocol doesn't depend on a blockchain). Overhead ~0.3% latency on Qwen3-8B; no extra memory beyond native subgraph execution.
Scope: Designed for the open-model setting (public weights/graphs enable subgraph extraction and thresholding). Closed-model APIs can still expose commitments to permissioned verifiers. No TEEs required; no deterministic kernels needed.
We’d love feedback on blind spots in the threat model, better partition policies for the dispute game, and where empirical thresholds might be gamed or leak information.