I've been recently testing GPU kernel dev with Triton; however, I couldn't find a graphics sample that combines an update loop with windowing, mouse-keyboard interactivity, etc., so I decided to give it a shot with single-sphere ray tracing and ImGui. Unfortunately, I couldn't get to zero-shot copy with CUDA & WSLg (since I am running on WSL2), so it defaults to a host copy for framebuffer unless you enable the WIP flag. Any feedback or contributions for a zero-copy pipeline (whether running on a Linux box with OpenGL & CUDA on the same device, or some shortcut for CUDA and WSLg) are always welcome! Any Triton optimization tips & tricks would also be appreciated. I found certain utilities like autotune for kernels to be quite interesting, and computer graphics application could make things and advantages more visual.