There Ain’t No Such Thing as the Fastest Code: Lessons Learned in the Pursuit of the Ultimate Word Counter
Article: https://www.phatcode.net/res/224/files/html/ch16/16-01.html
Code: https://www.phatcode.net/res/224/files/html/ch16/16-05.html
This was a nice opportunity to learn about Grand Central Dispatch through AI for me as well. I knew about the page alignment and async read techniques but not on OSX.
Edit: actually you can get even faster with mmap https://github.com/healeycodes/counting-words-at-simd-speed/...
Also, when counting 0xFF bytes from a boolean etc., sub the mask; 0xFF == -1.
Benchmark 1: python3 0_mvp.py bench.txt
Time (mean ± σ): 104.739 s ± 3.982 s [User: 104.213 s, System: 0.158 s]
Range (min … max): 100.303 s … 108.005 s 3 runs
Benchmark 2: python3 1_c_regex.py bench.txt
Time (mean ± σ): 14.777 s ± 0.017 s [User: 14.563 s, System: 0.158 s]
Range (min … max): 14.759 s … 14.791 s 3 runs
Benchmark 1: pypy3 0_mvp.py bench.txt
Time (mean ± σ): 9.381 s ± 0.204 s [User: 9.110 s, System: 0.234 s]
Range (min … max): 9.245 s … 9.616 s 3 runs
Benchmark 2: pypy3 1_c_regex.py bench.txt
Time (mean ± σ): 4.296 s ± 0.031 s [User: 4.038 s, System: 0.236 s]
Range (min … max): 4.262 s … 4.324 s 3 runsWith that being said, when it works, it works great but you have to evaluate whether it's suitable on a per-project/script basis.
Idea being, this is constant in the size of the query set.
This describes a few algorithms: http://0x80.pl/notesen/2018-10-18-simd-byte-lookup.html
Both the alternative version by Geoff Langdale, and the special case for small sets, are substantially similar to the algorithms used in Hyperscan (truffle and shufti). https://github.com/intel/hyperscan
Having something hard coded for spaces can be much faster, especially since 5 of the 6 characters are a range: a wrap-around subtraction and an unsigned less-than does the first 5; an equality compare does the other.
[1]: https://github.com/ashvardanian/StringZilla/blob/2f4b1386ca2...
[2]: https://github.com/ashvardanian/StringZilla/blob/2f4b1386ca2...
This explanation was a bit unsatisfying. This works because 0x09 and 0x0D differ by a single bit, and 0xFB masks that bit (and only that bit) out.
If they differed by more than one bit, the fact that they & the same would be necessary but not sufficient.
kragen•5mo ago