Wafer exists to make performance engineers more efficient. Most of the work perf engs do is extracting signal and turning it into the next experiment. You spend hours per kernel doing interpretation and bookkeeping: which counters matter, what changed, what hypothesis you’re testing, what to try next.
Wafer is building an environment where profiling, compiler analysis, and docs are first-class context in your workflow, so iteration is cheap. long-term, that same structured context becomes the interface for an automation layer that can read the evidence, propose a change, and rerun the loop.
NVIDIA has poured an insane amount of truth into their tooling. NCU, compiler output, SASS, the counters, the sections, the warnings, the “this is why you’re slow” breadcrumbs. Serious perf engineers already live in this stuff. The real problem is that it’s still not packaged as a tight loop. You run a profile, you get a giant report, then you spend a bunch of time translating it into a plan, mapping it back to the right lines of code, deciding what to ignore, deciding what to try next, and keeping track of what you’ve already tested. That translation step is where a ton of time goes, and it’s also the part that doesn’t scale.
We're just starting out and today, Wafer makes that translation step cheaper by keeping the evidence and the code in one place. You can run Nsight Compute profiling from your editor and view results where you’re editing, so you’re not flipping between terminals, report viewers, and screenshots. You can compile CUDA and inspect PTX and SASS mapped back to your source, so “what did the compiler actually do” is something you can answer in seconds and iterate on quickly. And you can query GPU documentation from inside the editor with the exact context you’re working in.
What we’re adding and moving towards is making that loop not just faster, but more automatic and more reproducible. We’re rolling out GPU Workspaces, where you keep a persistent CPU environment for your repo and dependencies, and only spin up GPU execution when you actually run something. A lot of GPU dev time is editing, debugging, and iterating on hypotheses, not burning GPU cycles - but today the workflow forces you to keep a GPU box alive just to preserve state. We want the “run the experiment” part to be on-demand and reliable, without killing your environment.
The bigger direction is the same theme: take the evidence perf engineers already use and make it machine-legible, so an automation layer can actually act on it. We're working on tool-driven loops: read the profile, identify the highest leverage bottleneck, propose a concrete code change, run the diff, re-profile, and keep a history of what worked and what didn’t.
If you’ve ever wished you could hand an agent your kernel plus the profiler and compiler evidence and have it do real work instead of vibes, that’s what we’re building towards.
You can see more about us here: https://wafer.ai
Or download directly from here: VS Code: https://marketplace.visualstudio.com/items?itemName=Wafer.wa... Cursor: https://open-vsx.org/extension/wafer/wafer
Would love feedback from anyone doing CUDA, CUTLASS/CuTe, Triton, training or inference perf. If you try it and something feels slow, confusing, or missing, email me at emilio@wafer.ai
stevenarellano•1h ago