It started as a personal frustration: I’d write something that ran great on CUDA, then have to rewrite or retune for ROCm or OpenCL. UHOP tries to make that portable — it detects your hardware, generates or benchmarks candidate kernels, and caches the best performer. It also supports AI-assisted kernel generation using OpenAI APIs and comes with a simple CLI for demos and benchmarking.
Right now, UHOP can:
Auto-detect hardware backends and pick optimal kernels
Run and benchmark fused ops like conv+ReLU
Cache and reuse tuned kernels
Generate kernels dynamically via codegen (CUDA/OpenCL/Python/Triton)
There’s still a lot in progress — better backend integration, distributed optimization, and a web dashboard for visualizing results. I’m sharing it early to get feedback from folks who’ve worked on compilers, GPU runtimes, and ML infra.
Repo: github.com/sevenloops/uhop
Demo: uhop.dev
Would love any thoughts on architecture, testing approaches, or potential contributions from ex-NVIDIA/ROCm engineers.