The core issue is the "Global Memory Tax": Sequential transforms (Crop, Jitter, Normalize) force the GPU to repeatedly read/write intermediate tensors to VRAM. This kills performance.
The Solution: I use Triton to fuse the entire augmentation pipeline into a single, highly-optimized GPU kernel. This eliminates all intermediate memory I/O.
The Results:
Video: Up to 73.7x faster than Kornia on 5D video tensors.
Image: 8.1x average speedup (up to 12x) over Torchvision v2.
It's designed as a drop-in replacement for your existing Compose pipeline. Check out the GitHub repository for the full API and detailed benchmarks.
I'm focused on developing the next phase (Resize, Rotation, etc.) and welcome any feedback on the kernels or usage patterns!
seedlingfl•1h ago