Non commutative A|B in regex is broken garbage. Bravo for calling it out!
The issue is that backtracking "greedy match" regex engines, when they deal with the disjunction, simply evaluate the cases left to right and stop on the first match: A|B|C|D is interpreted as "try regex A; if that matches, then stop, else try B ...". So if A matches, it's as if B, C and D don't exist.
Say we have the regex "c.r|carp.t", the input "carpet-odor" and are doing a prefix match. Greedy semantics will try "c.r" which matches "car", and stop there, declaring a three character match. Longest match semantics matches all branches simultaneously, picking the longest match. (This is closely related to the "maximal munch" principle in tokenizing.) That semantics will match see that the "carp.t" branch can match more characters after the "c.r" branch no longer matches, and report the six character match "carpet".
Longest match semantics jibes with a set-theoretical interpretation of regex, and that's why the | operator commutes. R1|R2 means the union of the strings matched by R1 and R2, and so R1|R2 is the same as R2|R1.
Assuming input is "ab",
/(a)b|a(b)/ produces \1=a, \2=<missing>
/a(b)|(a)b/ produces \2=<missing>, \1=b
Probably the easiest way to test this yourself is with GNU sed.Sed makes one choice. I'd guess that GP would call this broken garbage too, and I'd agree. Regular expressions have all these nice theoretical properties like closure under all the boolean operations and linear-time matching, but these nice properties get trashed by features that don't mesh or aren't fully thought through.
In this case (thinking about capturing groups and commutativity), one property of regular expressions is that for each one there is a machine that can do the linear-time matching -- a DFA. Even if the regular expression contains not-mutually-exclusive alternations, when it gets compiled to a DFA, the matching procedure is deterministic by construction. I can imagine a way to integrate capturing start and end actions into the transition edges of the DFA. The right thing to do is to perform capturing on all the matching alternands, not just the first. You lose the ability to number the capturing groups left to right, but instead you should lay them out in a tree that follows the concatenation/alternation structure of the expression.
Just write /(a)b/ or /a(b)/ or /ab/ or /(ab)/ or /(a)(b)/ which mean five slightly different things.
It is not reasonable to expect the user to manually disambiguate every regex.
It is possible to support these semantics with an automata-based engine (see RE2; and pity junyer isn’t here to read this article, he loved derivatives), but I can’t say I recommend it. The benefit, of course, is then you can peg your test suite to PCRE.
Edit: answering myself, this seems to be (at least partially) merged into the dotnet itself https://github.com/dotnet/runtime/pull/102655
kazinator•23h ago
Nope; I did it in TXR in early 2010:
def-lkb•22h ago
burntsushi•10h ago
high_na_euv•12h ago
gjm11•11h ago
kazinator•5h ago
No different from what was done in C#.
burntsushi•10h ago
"industrial" in this context to me means something like, "suitable for production usage in a broad number of scenarios."
IDK if RE# lives up to that, but their benchmark results are impressive. This paper is a year old. In which production systems is RE# used?