Like algorithm 1 (from the article) checks N possible positions at a time by matching the first and last character; algorithm 2 instead tests the 4 leading characters.
Now we just need a name, and we can add it to the smart shootout. Wonder how it compares to the best SIMD algos
The benchmark involves finding this line in that file: https://github.com/ncruces/go-sqlite3/blob/v0.26.1/sqlite3/l...
The improvements over musl are in this table: https://github.com/ncruces/go-sqlite3/tree/v0.26.1/sqlite3/l...
I'd say it's a decent improvement, if not spectacular. musl does particularly well with known length strings (memmem), whereas I'm allowed to cheat for unknown length strings (strstr) because the Wasm environment allows me to skirt some UB.
The NUL terminated string curve ball wrecks many good algorithms.
Because the SIMD (aligned, size computations, etc) setup code is usually more expensive that byte-oriented basic search for "small" strings.
Yes, it depends heavily on the hardware architecture.
These algorithms are basically brute force with very little setup, and can become quadratic for large, periodic needles.
String search algorithms that avoid quadratic behaviour (and may even be sublinear) have an higher cost setup that explores needle structure
Often the setup code is not worth it, even the SIMD complexity is not bringing enough performance in many use cases (basically, kept for niche applications, little lib statically linked for those).
That said, when you work on such code, use assembly, C code is for reference (have a look at dav1d av1 decoder).
Having implemented this, I'll claim that this is already competitive for needles as small as 2 bytes, and if the needle doesn't show up in the first couple “vector sizes” of the haystack.
Also, look at burntsushi's comment. They use this algorithm for similarly small sizes. They do use Rabin-Karp for “supremely short haystacks” (just because that lib is awesome and uses every trick in the book), but that wording should give you an idea of how small the threshold will be.
And really, how common are “supremely small” haystacks?
This is pretty common in my experience when demonstrating SIMD code. Often only the relevant bits are shown, and not a lot of effort is given to getting the boundary cases correct. It's rarely acknowledged explicitly, but I would assume it's because the boundary conditions are somewhat less interesting than the "meat" of how to use vectors in an interesting way.
- I had a very specific use-case that I really needed SIMD for and was willing to really invest a lot in it; or
- The work had already been done for me in a generic sense and I could reap the fruits of that labour (ripgrep / Rust Regex, thanks for that by the way)
The actual implementation is here: https://github.com/BurntSushi/memchr?tab=readme-ov-file#algo...
The main difference with the algorithm presented here is that instead of always using the first and last bytes of the needle, a heuristic is used to try to pick two bytes that occur less frequently according to a background frequency distribution.
It ends up being quite a bit faster than just plain Two-Way or even GNU libc's implementation of `memmem`. From the root of the `memchr` repository:
$ rebar rank benchmarks/record/x86_64/2023-12-29.csv -e '^rust/memchr/memmem/(oneshot|prebuilt|twoway)' -e '^libc/memmem/oneshot'
Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
rust/memchr/memmem/prebuilt 2.7.1 1.07 57
rust/memchr/memmem/oneshot 2.7.1 1.39 54
libc/memmem/oneshot unknown 3.15 54
rust/memchr/memmem/twoway 2.7.1 5.26 54
The "prebuilt" benchmarks also demonstrate the API deficiencies of routines like `memmem`. Because of their API, they need to rebuild the state necessary to execute a search for the needle given. But it is often the case that the needle is invariant across repeated calls with different haystacks. So you end up doing that repeated work for no gain.What exactly do you mean by “use SIMD directly without calling out to another language”?
In some way Assembly will probably anyways be another language… but that’s a technicality.
I guess the spectrum of SIMD-related work in relation to Python is quite broad. There are projects like PeachPy, to help one write x86 Asm in Python, new Python’esque languages like Mojo, or SIMD libraries with thin CPython bindings. Do you mean one of those?
PS: I’m in the last camp. And in case you are focused on ASCII, for vowel search, have you tried StringZilla’s `find_first_of` (< https://github.com/ashvardanian/StringZilla/blob/960967d78c7...>)? I think it should perform well ;)
Which uses SIMD algos for fast HTTP parsing.
ncruces•11h ago
https://github.com/ncruces/go-sqlite3/tree/main/sqlite3/libc
For known haystack lengths, and large needles, I found it useful to augment the approach with Quick Search.
https://igm.univ-mlv.fr/~lecroq/string/node19.html