[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
[2] https://developer.arm.com/architectures/instruction-sets/int...
Nvidia's unified memory architecture may make it better as same memory can be shared between CPU and GPU.
An absolute value every 16 deltas would undermine compression: a greater interval would lose even the modest performance gain, while a smaller interval would completely lose the compressibility benefits of delta coding.
It's a different matter, although there is definitely plausible motivation for absolute values every X deltas: query/update locality (mini-partition-level). You wouldn't want to transcode a huge number of values to access/modify a small subset.
tonetegeatinst•3h ago
almostgotcaught•3h ago
https://github.com/b0nes164/GPUPrefixSums
ashtonsix•2h ago
When we consider that delta coding (and family), are typically applied as one step in a series of CPU-first transforms and benefit from L1-3 caching we find CPU throughput pulls far-ahead of GPU-based approaches for typical workloads.
This note holds for all GPU-based approaches, not just PTX.
_zoltan_•2h ago
We've been implementing GPU support in Presto/Velox for analytical workloads and I'm yet to see a use case where we wouldn't pull ahead.
The DRAM-VRAM memory bottleneck isn't really a bottleneck on GH/GB platforms (you can pull 400+GB/s across the C2C NVLink), and on NVL8 systems like the typical A100/H100 deployments out there, doing real workloads, where the data is coming over the network links, you're toast without using GPUDirect RDMA.
ashtonsix•2h ago
GB/GH are actually ideal targets for my code: both architectures integrate Neoverse V2 cores, the same core I developed for. They are superchips with 144/72 CPU cores respectively.
The perf numbers I shared are for one core, so multiply the numbers I gave by 144/72 to get expected throughput on GB/GH. As you (apparently?) have access to this hardware I'd sincerely appreciate if you could benchmark my code there and share the results.
_zoltan_•1h ago
GH is readily available for anybody at 1.5 dollars per hour on lambda; GB is harder and we're just going to begin to experiment on it.
ashtonsix•1h ago
This superchip (might be different to whichever you're referring to) has 2 CPUs (144 cores): https://developer.nvidia.com/blog/nvidia-grace-cpu-superchip...
tmostak•57m ago
Of course prefix sums are often used within a series of other operators, so if these are already computed on GPU, you come out further ahead still.
ashtonsix•28m ago
Let's consider this in terms of throughput-per-$ so we have a fungible measurement unit. I think we're all agreed that this problem's bottleneck is the host memory<->compute bus so the question is: for $1 which server architecture lets you pump more data from memory to a compute core?
It looks like you can get a H100 GPU with 16xPCIe 5.0 (128 GB/s theoretical, 100 GB/s realistic) for $1.99/hr from RunPod.
With an m8g.8xlarge instance (32 ARM CPU cores) you should get much-better RAM<->CPU throughput (175 GB/s realistic) for $1.44/hr from AWS.
bassp•1h ago
Here’s a link to a pretty accessible writeup, if you’re curious about the details: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...
ashtonsix•1h ago
It even inspired the alternative "transpose" method I describe in the OP README.