Whatever this is doing could be wrapped up in another language.
Either way it's arguable that is even a good idea, since dealing with a regular thread in the same memory space, getting data to and from the GPU and doing computations on the GPU are all completely separate and have different latency characteristics.
For more generic GPU targets there's TRITON [5],[6].
[1] NVIDIA CUDA 13.1 Powers Next-Gen GPU Programming with NVIDIA CUDA Tile and Performance Gains:
https://developer.nvidia.com/blog/nvidia-cuda-13-1-powers-ne...
[2] Nvidia Tilus: A Tile-Level GPU Kernel Programming Language:
https://github.com/NVIDIA/tilus
[3] Simplify GPU Programming with NVIDIA CUDA Tile in Python:
https://developer.nvidia.com/blog/simplify-gpu-programming-w...
[4] Tile Language:
https://github.com/tile-ai/tilelang
[5] Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations:
https://dl.acm.org/doi/10.1145/3315508.3329973
[6] Triton:
LorenDB•2d ago
jamiejquinn•1d ago