Reproducibility and artifact parity We publish reproducible artifact hashes and full environment manifests for NVIDIA, AMD, Intel CPU, AMD CPU, Apple M4, and Google TPU. We do not distribute proprietary binaries or IP; instead the PDF lists the ROLV artifact hash (identical across platforms), container manifests, and the exact command lines and verification tests you can run to confirm matching outputs, checksums, Nsight/perf traces, and power logs.
What we validated and why it matters Cross‑platform parity — identical outputs and checksums across vendor GPUs, CPUs, and TPUs to eliminate measurement drift from build differences.
Vendor comparisons — benchmarks against vendor dense kernels and vendor sparse libraries (cuBLAS/cuSPARSE, ROCm sparse, vendor BLAS on CPUs, TPU sparse primitives where available) with per‑kernel wall time, memory transfer time, and conversion overheads.
Energy and throughput — kernel energy where measurable and end‑to‑end token throughput for LLM slices and iteration times for non‑LLM workloads; Nsight traces and power logs are referenced. Standout, independently validated numbers (March 2026) Kimi K2.5 expert FFN (7168×2048, batch=512, ~87% sparsity) on a commodity Intel Xeon (13 GB usable RAM): dense baseline 228.38 ms → ROLV 6.36 ms per iteration (35.9×); token throughput 2,240 → 80,500 t/s; kernel energy 16,283.97 J → 350.74 J (97.8% saved).
Finite Element Solver (mobile phone chassis drop test): 193.16× speedup; 99.5% energy saved (multi‑CPU).
LLM proxy matrix (4096×5120, 50% sparsity) on NVIDIA B200: 158.72× speedup; 99.37% energy saved; 40.5M t/s with Nsight‑validated tolerance harness.
Large recommendation GEMM (Meta‑style ranking): 98.76× speedup; 99.0% energy saved.
Additional production and research workloads (GNNs, ViT attention, MusicGen, Llama shapes) are listed with per‑run sparsities and exact matrix shapes in the PDF. Methodology highlights (what to inspect in the PDF) Exact shapes and sparsities — matrix dimensions, sparsity pattern (random/pruned/structured), and batch sizes.
Baseline definitions — vendor dense kernel and vendor sparse library baselines include conversion costs; we report raw kernel times and end‑to‑end times.
Measurement rig — wall‑clock timing, Nsight kernel timelines, and device power sampling points; CPU runs include perf counters and the exact kernel invocation sequence.
Tolerance and correctness — numerical tolerance checks, output checksums, and unit tests used to validate functional equivalence.
Repro scripts — container manifests and run_benchmark verification commands are referenced so reviewers can run the verification tests and compare hashes and checksums.
heggenhougen•1h ago