julia> using Random, BenchmarkTools
julia> function isvowel(c)
idx = (c | 0x20) - Int('a')
return (0x00104111 & (1 << idx)) != 0
end
julia> hasvowel(str) = any(c -> isvowel(Int(c)), str)
julia> @btime hasvowel(s) setup=(s=randstring("bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ0123456789", 10))
15.739 ns (0 allocations: 0 bytes)
false
some 77 thousand times faster- Setting bit 5 high forces lowercase for the input letter
- masking with `& 31` gives an index from 0-25 for the input letter
Then you can the `bt` instruction (in x86_64) to bit-test against the set of bits for a,e,i,o,u (after lowercasing) and return whether it matches, in a single instruction.
It came up with this, which I thought was pretty nice: https://godbolt.org/z/KjMdz99be
I'm sure there's other cool ways to test multiple vowels at once using AVX2 or AVX-512, I didn't really get that far. I just thought the bit-test trick was pretty sweet.
Chat transcript is here (it failed pretty spectacularly the first couple times, tripping over AT&T syntax and getting an off-by-one error, but still pretty good) https://chatgpt.com/share/684c8b39-a9c4-8012-8bb6-74e1f8b6d0...
You can do extremely trivial binary checks on ASCII, UTF-8 (and most? other) encoding schemes. All vowels including W and Y contain a 1 in the lowest bit. Then by comparing bits 1-4, you can do a trivial case-insensitive comparison. You detect case by checking bit 5. 0 is upper, 1 is lower.
str|0x32 if you want case-insensitive.
This works because the lowest five bits, more specifically bits two to five, of the vowels are distinct.
The lut is zeroed except for the indices corresponding to the vowels, which contain it self: lut[vowel&31]=vowel or lut[(vowel&31)>>1]=vowel
So in the comparison vs. the original str, vowels get a true and consonants get a false. Doesn't the "==" then returns true if all the chars are vowels?
Now for each character c in the input string, simply do an array index and see if it is true (a vowel) or not. This avoids either five conditionals, or a loop over the string 'aeiou'. The vowel test is constant time regardless of the character value.
It’s also going to be even more broken than TFA if any non-ascii character is present in the string.
I immediately pinned regex as the winner, and here is why:
In Python, you can almost always count on specialized functionality in the stdlib to be faster than any Python code you write, because most of it has probably been optimized in CPython by now.
Second, I have to ask, why would someone think that regular expressions, a tool specifically designed to search strings, would be slower than any other tool? Of course it's going to be the fastest! (At least in a language like Python.)
With that insight it should follow that using another implementation in C should outperform even the regex, and indeed the following simple Python method that this article for whatever reason ignored vastly outperforms everything:
def contains_vowel_find(s):
for c in "aeiouAEIOU":
if s.find(c) != -1:
return True
return False
That's because s.find(c) is implemented in C.In my benchmark this approach is 10 times faster than using a regex:
https://gist.github.com/kranar/24323e81ea1c34fb56aff621f6c09...
def any_gen_perm(s):
return any(c in s for c in "aeiouAEIOU")
I think the crux is that you want the inner loop inside the fast C-implemented primitive to be the one iterating over the longer string, and to leave the outer loop in Python to iterate over the shorter string. With both my version and yours, the Python loop only iterates and calls into the C search loop 10 times, so there's less interpreter overhead.I suspect that permuting the loop nest in similar variations will also see a good speed up, and indeed trying just now:
def loop_in_perm(s):
for c in "aeiouAEIOU":
if c in s:
return True
return False
seems to give the fastest result yet. (Around twice as fast as the permuted generator expression and your find implementation on my machine, with 100 and 1000 character strings.)I really don't think this is true. If you assume that the string is ASCII, a well-optimized regex for a pattern of this type should be a tight loop over a few instructions that loads the next state from a small array. The small array should fit completely in cache as well. This is basically branchless. I expect that you could process 1 character per cycle on modern CPUs.
If the string is short (~100 characters or less, guessing), I expect this implementation to outperform the find() implementation by far as find() almost certainly will incur at least one more branch mispredict than the regex. For longer strings, it depends on the data, as the branchless regex implementation will scan through the whole string, so find() will be faster if there is an vowel early on in the string. Find still might be faster even if there are no vowels; the exact time in that case depends on microarchitecture.
For non-ASCII, things are a bit trickier, but one can construct a state machine that is not too much larger.
Flexibly, search string flexibly.
Why you think that we cannot achieve faster search in non flexible fashion?
I know that in Go doing string operations "manually" is almost always faster than regexps. In a quick check, about 10 times faster for this case (~120ns vs. ~1150ns for 100 chars where the last is a vowel).
Of course Python is not Go, but I wouldn't actually expect simple loops in Python to be that much slower – going from "10x faster" to "2x slower" is quite a jump.
Perhaps "yeah duh obvious" if you're familiar with Python and its performance characteristics, but many people aren't. Or at least I'm not. Based on my background, I wouldn't automatically expect it.
for(char c in string)
if(c & 1 == 0)
continue;
switch(c & 0x1F)
case(a & 0x1F)
return true; //a or A
case(e & 0x1F)
return true; //e or E
case(i & 0x1F)
return true; //i or I
case(o & 0x1F)
return true; //o or O
case(u & 0x1F)
return true; //u or U
default
continue;
Checking for consonants is about as free as it gets. 50-70% of characters in English text are consonants. By checking one bit, you can eliminate that many checks across the whole string. This should also more or less apply to any text encoding; this technique comes from an artifact of the alphabet itself. It just so happens that all English vowels (including Y and W) fall on odd indices within the alphabet.Characters aren't magic! They're just numbers. You can do math and binary tricks on them just like you would with any other primitive type. Instead of thinking about finding letters within a string, sometimes you get better answers by asking "how do I find one of a set of numbers within this array of numbers?". It seems to me that a lot of programmers consider these to be entirely disjoint problems. But then again, I'm an embedded programmer and as far as I'm concerned characters are only ever 8 bits wide. String problems are numeric problems for me.
While I don't want to discourage people from exploring problem spaces, do understand that the problem space of ASCII has been trodden to the bedrock. Many problems like "does this string contain a vowel" have been optimally solved for decades. Your explorations should include looking at how we solved these problems in the 20th century, because those solutions are likely still extremely relevant.
That said, since they're numbers, we should use the most efficient checks for them... which are likely vectorized SIMD assembly instructions particular to your hardware. And which I've seen no one mention.
The fastest way to detect a vowel in a string on any reasonable architecture (Intel or AMD equipped with SIMD of some kind) is using 3-4 instructions which will process 16/32/64 (depends on SIMD length) bytes at once. Obviously access to these will require using a Python library that exposes SIMD.
Leaving SIMD aside, a flat byte array of size 256 will outperform a bitmap since it's always faster to look up bytes in an array than bits in a bitmap, and the size is trivial.
http://0x80.pl/notesen/2016-11-28-simd-strfind.html
* Or rather, I tried my best. I burnt out on that project because I kept jumping back and forth between making a proper DEFLATE implementation or something bespoke. The SIMD stuff was really tough and once I got it "working", I figured I got all I needed from the project and let it go.
https://github.com/rendello/compressor/blob/dev/src/str_matc...
layer8•16h ago
cenamus•15h ago
zerocrates•15h ago
s09dfhks•15h ago
ninkendo•15h ago
SwiftyBug•15h ago
tough•15h ago
TIL: When y forms a diphthong—two vowel sounds joined in one syllable to form one speech sound, such as the "oy" in toy, "ay" in day, and "ey" in monkey—it is also regarded as a vowel. Typically, y represents a consonant when it starts off a word or syllable, as in yard, lawyer, or beyond.
adrian_b•3h ago
While for older English words there is a complex set of rules mentioned by another poster for determining whether Y is a vowel, as mentioned by yet another poster, English also includes more recent borrowings from languages with other spelling rules for Y.
At its origin, Y was a vowel, not a consonant. It was added to the Latin alphabet for writing the front rounded vowel that is written "ü" in German, "u" in French or "y" in Scandinavian languages.
It is very unfortunate that in English, and in some other languages that have followed English, Y has been reassigned to write consonant "i". This has created a lot of problems due to the mismatches between the spelling rules of different languages. The rule that is most consistent with the older usage would have been to use J for consonant "i", like in German and other languages inspired by it. However in many Romance languages the pronunciation of consonant "i" has changed in time, leading to other 3 phonetic values for the letter J, like in English (i.e. Old French), like in French/Portuguese and like in Spanish.
So the result is that both for Y and for J there are great differences in pronunciation between the European languages, and the many words using such letters that have been borrowed between languages create a lot of complexity in spelling rules.
tough•3h ago
nemomarx•15h ago
horsawlarway•15h ago
Which is actually part of why I clicked into the article - I expected it to get into the complexity of trying to detect if 'y' was a vowel as part of the search, and instead got a mostly banal python text search article.
You can see the technical rules for when 'Y' is a vowel in english here:
https://www.merriam-webster.com/grammar/why-y-is-sometimes-a...
Y is considered to be a vowel if…
The word has no other vowel: gym, my.
The letter is at the end of a word or syllable: candy, deny, bicycle, acrylic.
The letter is in the middle of a syllable: system, borborygmus.
lcnPylGDnU4H9OF•14h ago
I think the best way to define when Y is a vowel is when it's not a consonant. Basically, if you make the sound that Y represents in the word "yes", it's a consonant. Otherwise, it's a vowel. (At least, no exceptions come to mind.)
DougN7•4h ago
holycrapwhodat•15h ago
layer8•15h ago
This is also why Unicode doesn’t have a “vowel” character property. Otherwise you could use a regex like `\p{Vowel}`.
csb6•10h ago
I had a linguistics professor say something like “Writing is parasitic on speech”
mystified5016•14h ago
The rule in English is not "all true words contain a vowel" it's that all words contain a vowel sound.
Except the ones that don't, because English is a very messy language.
xigoi•4h ago