Adding lookbehinds to rust-lang/regex

https://systemf.epfl.ch/blog/rust-regex-lookbehinds/

80•emschwartz•6mo ago

Comments

CJefferson•6mo ago

Great! I enjoyed reading through, and I'm going to come back later and read a little more carefully.

If anyone knows (to let me be lazy), is this the same regex engine used by ripgrep? Or is that an independent implementation?

cbarrick•6mo ago

Same engine as ripgrep

flaghacker•6mo ago

Yes, the `regex` crate is also the regex engine used by ripgrep, both were developed by https://github.com/burntsushi.

shilangyu•6mo ago

As others have pointed out, the regex engine is the same so the benefits would trickle downstream. For example, VSCode also uses ripgrep and therefore the rust-lang/regex engine.

burntsushi•6mo ago

ripgrep plugged this gap a long time ago by providing PCRE2 support.

shilangyu•6mo ago

PCRE2 supports only bounded length lookbehinds. It is true, it is not a big improvement to have unbounded ones in rust-lang/regex, but it still feels like something.

burntsushi•6mo ago

Pretty minor IMO. And PCRE2 supports lots of other stuff beside look-behinds.

singron•6mo ago

I don't think there is discussion of the snort-2 and snort-3 benchmarks, which the linear engine handily beats the python re for once (70-80x faster). I'm guessing they are cases where backtracking is painfully quadratic in re, but it would have been nice to hear about those successes. [In the rest of the benchmarks, python re is 2-5x faster]

d3m0t3p•6mo ago

Nice to see a master thesis highlighted on the research groupe page

RadiozRadioz•6mo ago

From a user perspective, this is extremely valuable. What an amazing improvement; unbounded especially. I do hope this would make it into actual RE2 & go.

When I use regex, I expect to be able to lookbehind, so I am routinely hit by RE2's limitations in places where it's used. Sometimes the software uses the entire matched string and you can't use non-capturing groups to work around it.

I understand go's reasons, ReDoS etc, but the "purism" of RE2 does fly in the face of practicality to an irksome degree. This is not uncommon for go.

hnlmorg•6mo ago

The point of standard libraries is to provide sane default behaviours. Go’s regexp package is a sensible default.

For instances where you need something more sophisticated than what’s in the standard library, you reach for 3rd party modules. And there are regex libraries for Go which support backtracking et al.

There’s definitely some irksome defaults in Go, but the choose of regex engine in the regexp library isn’t one of them

masklinn•6mo ago

The authors’ previous article (linked in this one) was about doing this in re2 (https://systemf.epfl.ch/blog/re2-lookbehinds/), and they have a fork with those changes though I don’t know that they have a PR.

> the "purism" of RE2 does fly in the face of practicality to an irksome degree

It’s not purism tho. There are very practical reasons to want an FA-based engine, and if you compromise that to get additional features then the engine is pointless, you could have just used a backtracking engine in the first place.

ncruces•6mo ago

I couldn't find the link in that page, but the fork is here, and seems to be up-to-date: https://github.com/GerHobbelt/re2

If you need that from Go, you can probably use that to create a fork of this: https://github.com/wasilibs/go-re2

aurele_•6mo ago

The RE2 fork from the blog post above is this one: https://github.com/epfl-systemf/re2-lookbehinds

ncruces•6mo ago

Thank you! I had searched GitHub by a relevant snippet of code described in the blog and found that one. I guess they merged those changes?

chubot•6mo ago

What are some examples of problems where you’ve used lookbehinds?

progbits•6mo ago

While I agree this is a common golang theme, in this case I believe this decision predates the golang implementation and comes from the C++ RE2 days, no?

arp242•6mo ago

> I understand go's reasons, ReDoS etc, but the "purism" of RE2 does fly in the face of practicality to an irksome degree.

Preventing ReDos is literally the reason RE2 exists though, so I don't think it's "purism" to not implement these things. What you want is not unreasonable, but fundamentally incompatible with the goals of RE2.

Ways to do look-behinds in linear time, as detailed in this article, are a relatively new development AFAIK(?) I don't think the RE2 people are principally opposed to integrating that if it can be done well. I suspect someone will have to write a patch though, since the main RE2 maintainer died last year.

LegionMammal978•6mo ago

> However, as a downside our lookbehinds do not support containing capture groups which are a feature allowing to extract a substring that matched a part of the regex pattern.

I wonder in what situation someone would even be tempted to put a capture group into a lookbehind expression, except unintentionally by using () instead of (?:) for grouping. Maybe in an attempt to obtain capture groups from overlapping matches? But even in that case, lookaheads would be clearer, when available.

hu3•6mo ago

Interesting. I have used look behind before without knowing their specifics. AI generated a regex and unit tests passed so I carried on with life.

Searching for a simple explanation of how it works, I found this which also explains negative look behind and look ahead. TIL:

https://www.phptutorial.net/php-tutorial/regex-lookbehind/

librasteve•6mo ago

It’s odd to see such a widely adopted language as Rust only just getting some regex basics. Whereas Raku (https://raku.org) has made a strong forward step in regex syntax over PCRE, made by the same language designer with implementation of modern unicode savvy features like Grapheme and Diacritic handling that are essential to building consistent code to handle multilingual needs.

  say "Cool" ~~ /<:Letter>* <:Block("Emoticons")>/; # ｢Cool｣
  say "Cześć" ~~ m:ignoremark/ Czesc /;               # ｢Cześć｣
  say "WEIẞE" ~~ m:ignorecase/ weisse /;              # ｢WEIẞE｣
  say "หนูแฮมสเตอร์" ~~ /<:Letter>+/;                    # ｢หนูแฮมสเตอร์｣

librasteve•6mo ago

huh … guess HN blocks emojis

burntsushi•6mo ago

It's not only just getting some "regex basics." The `fancy-regex` crate has provided look-behind for years. The OP is about adopting look-behind to the linear time guarantee required by the `regex` crate.

My main focus for the `regex` crate has been on performance: https://github.com/BurntSushi/rebar

How does Raku's regex performance compare to Perl?

kibwen•6mo ago

> the linear time guarantee required by the `regex` crate

Making sure this line isn't glossed over: the point of the regex crate is that it provides linear-time guarantees for arbitrary regexes, making it safe (within reason) to expose the regex engine to untrusted input without running the risk of trivial DoS. From what I can tell, supporting lookbehinds in such a context is something that researchers have only recently described.

dmit•6mo ago

> making it safe (within reason) to expose the regex engine to untrusted input

Or even trusted input! https://blog.cloudflare.com/details-of-the-cloudflare-outage...

librasteve•6mo ago

I stand corrected on that - I was responding to the headline and did not appreciate that Rust has had library support beforehand. (That said, having regex around in different standard vs. crate options is not necessarily the ideal).

It's good to have a focus and I agree that Rust is all about performance and stability for a system language.

I haven't seen Raku regex performance benchmarked, but I would be surprised if it beats perl or Rust.

I wouldn't say that Raku is a good choice where speed is the most important consideration since it is a scripting language that runs on a VM with GC. Nevertheless the language syntax includes many features (hyper operators, lazy evaluation to name two) that make it amenable to performance optimisation.

masklinn•6mo ago

> That said, having regex around in different standard vs. crate options is not necessarily the ideal

What 1: both regex and fancy-regex are crates. Regex is under the rust-lang umbrella but it’s not part of the stdlib.

What 2: having different options is the point of third partly libraries, why would you have a third party library which is the exact same thing as the standard library?

librasteve•6mo ago

so Rust has no regex in the standard library, basic/fast regex under the rust-lang umbrella in a crate and fancy-regex is a 3rd party crate

not having different options is the point of (batteries included) standard libraries ;-)

burntsushi•6mo ago

We (I am on libs-api in addition to authoring the regex crate) specifically eschewed a batteries included standard library. The fact that `regex` was its own thing was the best thing that ever happened to it. It let me iterate on its API independent of the standard library.

librasteve•6mo ago

fair enough - there are pros and cons, but in many situations that _can_ lead to balkanisation of the language

Raku has specifically chosen the "kitchen sink" option with a massive amount of cool stuff included ... I would argue that have both regex and Grammars tightly in the core language syntax is a big win in that case (and the default choice of Str as graphemes)

with Rust and Raku that's mitigated by crate and zef respectively - both reliable, unified package manager ecosystems

SteveJS•6mo ago

I loved discovering that rust has O(n) guardrails on regex! The so-called features that break that constraint are anti-features.

Over the last two weeks I wrote a dialog aware english sentence splitter using Claude code to write rust. The compile error when it stuck lookarounds in one of the regex’s was super useful to me.

shawn_w•6mo ago

I don't think Philip Hazel, who wrote PCRE, has anything to do with perl or raku development.

librasteve•6mo ago

sorry I didn't know that Philip Hazel wrote PCRE ... and I certainly credit the initiative to release Perl Compatible Regular Expressions from the grip of perl

my main point is that PCRE was based on perl regexes and that these were designed by Larry Wall and so he had some experience when it came to the strengths and weaknesses of of perl RE when it came to designing the Raku RE syntax (ie. the language formerly known as Perl 6)

quotemstr•6mo ago

This right here is one of the foundational splits in the programming community. This article is all about how cool an _implementation_ is. This comment is about some other engine's cool _syntax_. Deep versus superficial. The two camps can't stand each other.

librasteve•6mo ago

Speaking on behalf of the superficial camp, I admire the Rust core regex focus on linear performance and I can well believe that it is based on recent theoretical work.

Splitting the regex features between some core ones that meet a DoS standard and some non-core modules that do other "convenience" features makes sense as a trade off for Rust. It would not make sense in a scripting language like Raku where the weight is on coder expressiveness and making it easier / faster to write working code.

I seem to have hit a seam of intense implementation guys - and they are holding their own since they know their stuff.

I think there is room for improvement BOTH with new system language / core performance innovation AND with advancing the PCRE regex syntax (largely unchanged since the 1990s) and merging it seamlessly with standard language support for Grammars.

Scientists reverse Alzheimer's in mice and restore memory (2025)

Compiling Prolog to Forth [pdf]

Show HN: Cymatica – an experimental, meditative audiovisual app

GitBlack: Tracing America's Foundation

Horizon-LM: A RAM-Centric Architecture for LLM Training

We just ordered shawarma and fries from Cursor [video]

Correctio

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Scientists reverse Alzheimer's in mice and restore memory (2025)

Compiling Prolog to Forth [pdf]

Show HN: Cymatica – an experimental, meditative audiovisual app

GitBlack: Tracing America's Foundation

Horizon-LM: A RAM-Centric Architecture for LLM Training

We just ordered shawarma and fries from Cursor [video]

Correctio

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Adding lookbehinds to rust-lang/regex

Comments