ROLV – Beats Vendor Kernels, Cross‑Platform

https://img1.wsimg.com/blobby/go/68eb7843-76aa-4f45-831c-bf0cbe513bde/downloads/52638717-8fe8-4b3b-a46e-4fdd0effb72b/ROLV%20Benchmarks.pdf?ver=1772452185674

1•heggenhougen•1h ago

Comments

heggenhougen•1h ago

rolv.ai publishing reproducible benchmarks showing that ROLV outperforms vendor dense kernels and vendor sparse libraries across GPUs, CPUs, and TPUs while producing the same hash/artifact on every platform. Highlights: commodity Intel Xeon Kimi K2.5 expert FFN slice runs 228.38 ms → 6.36 ms per iteration (35.9×), token throughput 2,240 → 80,500 t/s, and kernel energy drops 97.8%. Nsight‑validated traces and full methodology are in the PDF below; results cover LLM FFNs, recommendation GEMMs, FEM, GNNs, mobile and EV proxies, and production subsamples. We welcome technical critique on methodology, requests for additional baselines, and calls for the exact scripts to reproduce any specific run.

heggenhougen•1h ago

I’m on the Rolv.ai team — below is a detailed expansion on our benchmarks and what to inspect in the PDF. This focuses on artifact parity, methodology, standout results, and the exact verification checks reviewers can run to validate our claims.

Reproducibility and artifact parity We publish reproducible artifact hashes and full environment manifests for NVIDIA, AMD, Intel CPU, AMD CPU, Apple M4, and Google TPU. We do not distribute proprietary binaries or IP; instead the PDF lists the ROLV artifact hash (identical across platforms), container manifests, and the exact command lines and verification tests you can run to confirm matching outputs, checksums, Nsight/perf traces, and power logs.

What we validated and why it matters Cross‑platform parity — identical outputs and checksums across vendor GPUs, CPUs, and TPUs to eliminate measurement drift from build differences.

Vendor comparisons — benchmarks against vendor dense kernels and vendor sparse libraries (cuBLAS/cuSPARSE, ROCm sparse, vendor BLAS on CPUs, TPU sparse primitives where available) with per‑kernel wall time, memory transfer time, and conversion overheads.

Energy and throughput — kernel energy where measurable and end‑to‑end token throughput for LLM slices and iteration times for non‑LLM workloads; Nsight traces and power logs are referenced. Standout, independently validated numbers (March 2026) Kimi K2.5 expert FFN (7168×2048, batch=512, ~87% sparsity) on a commodity Intel Xeon (13 GB usable RAM): dense baseline 228.38 ms → ROLV 6.36 ms per iteration (35.9×); token throughput 2,240 → 80,500 t/s; kernel energy 16,283.97 J → 350.74 J (97.8% saved).

Finite Element Solver (mobile phone chassis drop test): 193.16× speedup; 99.5% energy saved (multi‑CPU).

LLM proxy matrix (4096×5120, 50% sparsity) on NVIDIA B200: 158.72× speedup; 99.37% energy saved; 40.5M t/s with Nsight‑validated tolerance harness.

Large recommendation GEMM (Meta‑style ranking): 98.76× speedup; 99.0% energy saved.

Additional production and research workloads (GNNs, ViT attention, MusicGen, Llama shapes) are listed with per‑run sparsities and exact matrix shapes in the PDF. Methodology highlights (what to inspect in the PDF) Exact shapes and sparsities — matrix dimensions, sparsity pattern (random/pruned/structured), and batch sizes.

Baseline definitions — vendor dense kernel and vendor sparse library baselines include conversion costs; we report raw kernel times and end‑to‑end times.

Measurement rig — wall‑clock timing, Nsight kernel timelines, and device power sampling points; CPU runs include perf counters and the exact kernel invocation sequence.

Tolerance and correctness — numerical tolerance checks, output checksums, and unit tests used to validate functional equivalence.

Repro scripts — container manifests and run_benchmark verification commands are referenced so reviewers can run the verification tests and compare hashes and checksums.

How to Recover Your Stolen Crypto After a Scam–Guidance from Intelligence Wizard

Prohibited Countries – Mercury Bank

API to Clean Markdown Docs for AI Agents (No More Stale Endpoints)

Dr Seuss Day: 'Without Oxford University, We Don't Get Dr. Seuss'

Connected Claude to a 1983 oscilloscope [video]

FFmpeg at Meta: Media Processing at Scale

Managed OpenClaw hosting your own AI assistant in 60 seconds, no server needed

People reporting Twitter leaking real names to Israel

What Happens When 2 College Dropouts with No Budget Solve Real-Time Translation

The Poison of Inertia

BullshitBench: Models Answering Nonsense Questions

Show HN: I built a sub-500ms latency voice agent from scratch

Thunderstorms conjure ghostly coronae in treetops

More Is Different (1972) [pdf]

Catch exhaustion before it burns out your engineers

CIAM Weekly: An Interview with Brian Bell

Yukon Time Zone

The AWS SDK for .NET: A Code Quality Wake-Up Call

AI Won't Automatically Accelerate Clinical Trials

Rembrandt's Vision of Zacharias in the Temple rediscovered after 65 years

SerpApi Filed Motion to Dismiss Google's Lawsuit

Show HN: I built an AI sound effects generator for game devs

Firefox 149 beta develops a split personality

Show HN: IndieMe – AI for building music artist identity and release strategy

Welcome (Back) to Macintosh

Show HN: Ed – A modern take on ancient codebook technology

SDK code mode shows SotA accuracy for operating APIs via MCP

Ask HN: Would engineers be interested in a technical prep consultant?

Show HN: Flowly – a macOS app that brings smooth, fluid scrolling to any mouse

18,000 lines to replace a screenshot