$ rebar cmp results.csv --intersection -f huge
benchmark rust/memchr/memmem/prebuilt stringzilla/memmem/oneshot
--------- --------------------------- --------------------------
memmem/pathological/md5-huge-no-hash 47.4 GB/s (1.00x) 38.1 GB/s (1.25x)
memmem/pathological/md5-huge-last-hash 40.3 GB/s (1.00x) 23.4 GB/s (1.72x)
memmem/pathological/rare-repeated-huge-tricky 40.4 GB/s (1.04x) 42.0 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-match 1977.7 MB/s (1.00x) 563.3 MB/s (3.51x)
memmem/subtitles/common/huge-en-that 35.9 GB/s (1.00x) 25.3 GB/s (1.42x)
memmem/subtitles/common/huge-en-you 15.9 GB/s (1.00x) 9.5 GB/s (1.67x)
memmem/subtitles/common/huge-en-one-space 1376.4 MB/s (1.00x) 1364.0 MB/s (1.01x)
memmem/subtitles/common/huge-ru-that 29.0 GB/s (1.00x) 15.5 GB/s (1.87x)
memmem/subtitles/common/huge-ru-not 16.0 GB/s (1.00x) 3.5 GB/s (4.53x)
memmem/subtitles/common/huge-ru-one-space 2.6 GB/s (1.00x) 2.4 GB/s (1.08x)
memmem/subtitles/common/huge-zh-that 31.2 GB/s (1.00x) 23.8 GB/s (1.31x)
memmem/subtitles/common/huge-zh-do-not 19.4 GB/s (1.00x) 12.1 GB/s (1.59x)
memmem/subtitles/common/huge-zh-one-space 5.3 GB/s (1.05x) 5.6 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson 41.2 GB/s (1.00x) 31.2 GB/s (1.32x)
memmem/subtitles/never/huge-en-all-common-bytes 47.9 GB/s (1.00x) 37.5 GB/s (1.28x)
memmem/subtitles/never/huge-en-some-rare-bytes 43.4 GB/s (1.00x) 42.7 GB/s (1.02x)
memmem/subtitles/never/huge-en-two-space 42.2 GB/s (1.00x) 30.7 GB/s (1.37x)
memmem/subtitles/never/huge-ru-john-watson 42.2 GB/s (1.00x) 42.1 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson 47.6 GB/s (1.00x) 34.0 GB/s (1.40x)
memmem/subtitles/rare/huge-en-sherlock-holmes 40.8 GB/s (1.05x) 42.9 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock 36.7 GB/s (1.16x) 42.5 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle 47.7 GB/s (1.00x) 31.3 GB/s (1.52x)
memmem/subtitles/rare/huge-en-long-needle 44.5 GB/s (1.00x) 32.0 GB/s (1.39x)
memmem/subtitles/rare/huge-en-huge-needle 45.7 GB/s (1.00x) 33.4 GB/s (1.37x)
memmem/subtitles/rare/huge-ru-sherlock-holmes 42.1 GB/s (1.00x) 42.2 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock 42.3 GB/s (1.01x) 42.9 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes 46.7 GB/s (1.00x) 33.1 GB/s (1.41x)
memmem/subtitles/rare/huge-zh-sherlock 47.4 GB/s (1.00x) 42.8 GB/s (1.11x)
But I would say they are overall pretty competitive.If you want to run the benchmarks yourself, you can. First, get rebar[1]. Then, from the root of the `memchr` repository[2]:
$ rebar build -e 'rust/memchr/memmem/prebuilt' -e 'stringzilla/memmem/oneshot'
stringzilla/memmem/oneshot: running: cd "benchmarks/./engines/stringzilla" && "cargo" "build" "--release"
stringzilla/memmem/oneshot: build complete for version 3.12.3
rust/memchr/memmem/prebuilt: running: cd "benchmarks/./engines/rust-memchr" && "cargo" "build" "--release"
rust/memchr/memmem/prebuilt: build complete for version 2.7.4
$ rebar measure -e 'rust/memchr/memmem/prebuilt' -e 'stringzilla/memmem/oneshot' | tee results.csv
$ rebar rank results.csv
Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
rust/memchr/memmem/prebuilt 2.7.4 1.14 57
stringzilla/memmem/oneshot 3.12.3 1.43 54
$ rebar cmp results.csv --intersection -f never/huge
benchmark rust/memchr/memmem/prebuilt stringzilla/memmem/oneshot
--------- --------------------------- --------------------------
memmem/subtitles/never/huge-en-john-watson 41.2 GB/s (1.00x) 31.2 GB/s (1.32x)
memmem/subtitles/never/huge-en-all-common-bytes 47.9 GB/s (1.00x) 37.5 GB/s (1.28x)
memmem/subtitles/never/huge-en-some-rare-bytes 43.4 GB/s (1.00x) 42.7 GB/s (1.02x)
memmem/subtitles/never/huge-en-two-space 42.2 GB/s (1.00x) 30.7 GB/s (1.37x)
memmem/subtitles/never/huge-ru-john-watson 42.2 GB/s (1.00x) 42.1 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson 47.6 GB/s (1.00x) 34.0 GB/s (1.40x)
See also: https://github.com/BurntSushi/memchr/discussions/159There is definitely no AVX-512 support on my CPU. Which is also true for most of my users. I don't bother with AVX-512 for that reason.
Another substantial population of my users are on aarch64, which memchr has optimizations for. I don't think StringZilla does.
This is exactly my issue with targeting AVX-512. It isn't just absent on "older AVX2-only CPUs." It's also absent on many "newer AVX2-only CPUs." For example, the i9-14900K. I don't think any of the other newer Intel CPUs have AVX-512 either. And historically, whether an x86-64 CPU supported AVX-512 at all was hit or miss.
AVX-512 has been around for a very long time now, and it has just never been consistently available.
But its fair to say that I’m mostly focusing on the datacenter/supercomputing hardware, both on the x86 and Arm side.
But realistically, is there any real-world situation where one would use this? What niche or industry or need would benefit from this, where the dependency + setup costs are worth it. Strings just seem to be a long-solved non-issue.
Namely, if you look at DeepMind’s AlphaFold 1 and 2, bulk volume of compute time is spent outside of PyTorch - running sequence alignment. Historically, with BLAST. More recently, in other labs, with some of my code :)
What excites me in this release is the quality of the new hash functions. I’ve built many over the years but never felt they were worth sharing until now. Having two included here was a personal milestone for me, since I’ve always admired how good xxHash and aHash are and wanted to build something of similar caliber.
The new hashes should be directly useful in databases, for example improving JOIN performance. And the fingerprinting interfaces based on 52-bit modulo math with double-precision FMA units open up another path. They aren’t easy to use and won’t apply everywhere, but on petabyte-scale retrieval tasks they can make a real impact.
A suggestion: in the comparison table under the “AES and Port-Parallelism Recipe” it would be great to include “streaming support” and “stable output” (across os/arch) as a column.
Also something to beware of, some hash libraries claim to support streaming via the Hasher interface but actually return different results in streaming and one-shot mode (and have different performance profiles). I’m on mobile so I can’t check atm but I’m about 80% sure gxhash has at least one of these problems that prevented me from using it before.
One micro-question on the editing: why are numbers written with an apostrophe (') as the thousands-separator [1]? I know that is used for this purpose in Switzerland and that many programming languages support it. It just seemed very strange for English text, where typically comma (,) would be used, of course.
[1]: https://en.wikipedia.org/wiki/Decimal_separator#Digit_groupi...
[2]: https://en.wikipedia.org/wiki/Apostrophe#Miscellaneous_uses_...
A digit separator for increased readability of long numbers has been first introduced by Ada (1979-06), which has used the underscore. This usage matched the original reason for the introduction of the underscore in the character set, which had been done by PL/I (1964-12), for increasing the readability of long identifiers, while avoiding the ambiguity caused by using hyphen for that purpose, as previously in COBOL (many LISPs have retained the COBOL usage of the hyphen, because they, like COBOL, do not normally write arithmetic expressions with operators).
Most programming languages that have added a digit separator have followed Ada, by using the underscore.
35 years later, C++ should have done the same and I hate whoever thought otherwise within the people who have updated the standard, causing thus completely unnecessary compatibility problems, e.g. when copying a big initialized array between program text sources written in different languages.
There was some flawed argument against the underscore that it could have caused some parsing problems in some weird legacy programs, but they were not more difficult to solve than avoiding parsing errors caused by the legacy use of the apostrophe in character constants (i.e. forbidding the digit separator as the first character in a number is enough to ensure a non-ambiguous parsing) .
ashvardanian•4mo ago
First, it tuned out that StringZilla scales further to over 900 GigaCUPS around 1000-byte long inputs on Nvidia H100. Moreover, the same performance is obviously accessible on lower-end hardware as the algorithm is not memory bound, no HBM is needed.
Second, I’ve finally transitioned to Xeon 6 Granite Rapids nodes with 192 physical cores and 384 threads. On those, the Ice Lake+ kernels currently yield over 3 TeraCUPS, 3x the current Hopper kernels.
The most recent numbers are already in the repo: https://github.com/ashvardanian/StringWa.rs