[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
[2] https://developer.arm.com/architectures/instruction-sets/int...
s1 = vaddq_u32(s1, vextq_u32(z, s1, 2));
s1 = vaddq_u32(s1, vdupq_laneq_u32(s0, 3));
would be more like this: s1.xy += s1.zw;
s1 += s0.w; #define vaddv(A, B) _Generic((A),
int8x8_t: vaddv_s8((A), (B)),
uint8x8_t: vaddv_u8((A), (B)),
int8x16_t: vaddvq_s8((A), (B)),
uint8x16_t: vaddvq_u8((A), (B)),
int16x4_t: vaddv_s16((A), (B)),
uint16x4_t: vaddv_u16((A), (B)),
int16x8_t: vaddvq_s16((A), (B)),
uint16x8_t: vaddvq_u16((A), (B)),
int32x2_t: vaddv_s32((A), (B)),
uint32x2_t: vaddv_u32((A), (B)),
float32x2_t: vaddv_f32((A), (B)),
int32x4_t: vaddvq_s32((A), (B)),
uint32x4_t: vaddvq_u32((A), (B)),
float32x4_t: vaddvq_f32((A), (B)),
int64x2_t: vaddvq_s64((A), (B)),
uint64x2_t: vaddvq_u64((A), (B)),
float64x2_t: vaddvq_f64((A), (B)))
while in GNU C you can in fact use normal arithmetic and indexing (but not swizzles) on vector types.Nvidia's unified memory architecture may make it better as same memory can be shared between CPU and GPU.
An absolute value every 16 deltas would undermine compression: a greater interval would lose even the modest performance gain, while a smaller interval would completely lose the compressibility benefits of delta coding.
It's a different matter, although there is definitely plausible motivation for absolute values every X deltas: query/update locality (mini-partition-level). You wouldn't want to transcode a huge number of values to access/modify a small subset.
Even if one is willing to adopt such an encoding scheme, you'd still want to optimize what you have here anyways though. It also doesn't help, as mentioned, if the goal is actually latency of small streams rather than throughput of large ones.
I wouldn't expect a strictly linear speed-up due to contention on the memory bus, but it's not as bad as flat-lining after engaging 2-3 cores. On most AWS Graviton instances you should be able to pull ~5.5 GB/s per-core even with all cores active, and that becomes less of a bottleneck when considering you'll typically run a sequence of compute operations between memory round-trips (not just delta).
Part 3 explores a straightforward idea that can be used to improve performance of bitstream decoding - decode multiple streams at once.
> if decoding from a single bitstream is too serial, then why not decode from multiple bitstreams at once?
> So how many streams should you use? It depends. At this point, for anything that is even remotely performance-sensitive, I would recommend trying at least N=2 streams. Even if your decoder has a lot of other stuff going on (computations with the decoded values etc.), bitstream decoding tends to be serial enough that there’s many wasted cycles otherwise, even on something relatively narrow like a dual-issue in-order machine.
> If you’re using multiple streams, you need to decide how these multiple streams get assembled into the final output bitstream. If you don’t have any particular reason to go with a fine-grained interleaving, the easiest and most straightforward option is to concatenate the sub-streams
For delta coding, perhaps instead of resetting the delta encoding stream with an absolute value every X deltas, for a small value of X, this suggests we could use a much larger value of X, i.e. partition the input into coarse chunks and delta encode each, and then rewrite our decoder to decode a pair (or more) of the resulting streams at once.
Unclear if that would help at all here, or if there's already enough independent work for the CPU to keep busy with.
Part 1 https://fgiesen.wordpress.com/2018/02/19/reading-bits-in-far...
Part 2 https://fgiesen.wordpress.com/2018/02/20/reading-bits-in-far...
Part 3 https://fgiesen.wordpress.com/2018/09/27/reading-bits-in-far...
"FastPFoR is well-established in both industry and academia. However, on our target platform (Graviton4, SIMDe-compiled) it benchmarks at only ~7.7 GB/s, beneath a naive scalar loop at ~10.8 GB/s."
I thought the first bit was a typo but it was correct; the naive approach was faster than a "better" method. Another demonstration of how actually benchmarking on the target platform is important!
Fun project btw, HPC is always interesting.
tonetegeatinst•3mo ago
almostgotcaught•3mo ago
https://github.com/b0nes164/GPUPrefixSums
ashtonsix•3mo ago
When we consider that delta coding (and family), are typically applied as one step in a series of CPU-first transforms and benefit from L1-3 caching we find CPU throughput pulls far-ahead of GPU-based approaches for typical workloads.
This note holds for all GPU-based approaches, not just PTX.
_zoltan_•3mo ago
We've been implementing GPU support in Presto/Velox for analytical workloads and I'm yet to see a use case where we wouldn't pull ahead.
The DRAM-VRAM memory bottleneck isn't really a bottleneck on GH/GB platforms (you can pull 400+GB/s across the C2C NVLink), and on NVL8 systems like the typical A100/H100 deployments out there, doing real workloads, where the data is coming over the network links, you're toast without using GPUDirect RDMA.
ashtonsix•3mo ago
GB/GH are actually ideal targets for my code: both architectures integrate Neoverse V2 cores, the same core I developed for. They are superchips with 144/72 CPU cores respectively.
The perf numbers I shared are for one core, so multiply the numbers I gave by 144/72 to get expected throughput on GB/GH. As you (apparently?) have access to this hardware I'd sincerely appreciate if you could benchmark my code there and share the results.
_zoltan_•3mo ago
GH is readily available for anybody at 1.5 dollars per hour on lambda; GB is harder and we're just going to begin to experiment on it.
ashtonsix•3mo ago
This superchip (might be different to whichever you're referring to) has 2 CPUs (144 cores): https://developer.nvidia.com/blog/nvidia-grace-cpu-superchip...
tmostak•3mo ago
Of course prefix sums are often used within a series of other operators, so if these are already computed on GPU, you come out further ahead still.
ashtonsix•3mo ago
Let's consider this in terms of throughput-per-$ so we have a fungible measurement unit. I think we're all agreed that this problem's bottleneck is the host memory<->compute bus so the question is: for $1 which server architecture lets you pump more data from memory to a compute core?
It looks like you can get a H100 GPU with 16xPCIe 5.0 (128 GB/s theoretical, 100 GB/s realistic) for $1.99/hr from RunPod.
With an m8g.8xlarge instance (32 ARM CPU cores) you should get much-better RAM<->CPU throughput (175 GB/s realistic) for $1.44/hr from AWS.
_zoltan_•3mo ago
bassp•3mo ago
Here’s a link to a pretty accessible writeup, if you’re curious about the details: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...
ashtonsix•3mo ago
It even inspired the alternative "transpose" method I describe in the OP README.
TinkersW•3mo ago
I can upload about an average of 3.7 MBs per millisecond to my GPU(PCIe gen 3, x8), but it can be spiky and sometimes take longer than you might expect.
By comparison a byte based AVX2 prefix scan can pretty much run at the speed of DRAM, so there is never any reason to transfer to the GPU.