- Comprehensive tests across read, write, copy, and latency (pointer chasing) workloads
- Multi-platform (e.g. evaluated on multiple x86 and arm64 machines), portable (e.g. tested on BSD), and easy to run via Docker
- Efficient multi-threaded measurements via OpenMP
- Optimal thread placement and memory allocation for NUMA systems
- Adaptive test sizes based on CPU cache amounts
- Automatic Transparent Huge Pages
Disclaimer: Yes, used LLMs both for coding and documentation updates -- but we validated the results carefully against existing tools and their shortcomings, the results look super promising so far, already got some encouraging early feedback from our direct network, so it's time to ask for scrutiny from the wider community /o\
Motivation: We previously benchmarked 3,000+ cloud server types using `bw_mem` from LMbench, but the results were not always consistent with the detected L1/L2/L3 cache sizes. Debugging identified both cache detection issues (mostly relying on `lscpu`, investigating `lstopo` now), and limitations of `bw_mem` as well, e.g. unexpected slowdowns on servers with 100+ vCPUs. See more details in the "Comparison with lmbench" section of the README.
Why does your feedback matter? We plan to run this across ~5,000 cloud server types, so I'd highly appreciate your feedback on methodology, implementation correctness, example results, and any missing cases before burning through a lot of precious cloud credits :)
Unfortunately, we don't have the resources to implement further metrics (e.g. prefetch-to-load distance, load buffer slots, atomic memory operation latency) in this batch, but I'd be happy to take related ideas on the roadmap.