I am interested in that area, and reading up and learning about it.
In practice most recursive descent parsers use if-else liberally. Thus, they effectively work like pegs where the first match wins (but without the limited backtracking of pegs). They are deterministic in the sense that the implementation always returns a predictable result. But they are still ambiguous in the sense that this behavior might not have been planned by the language designer, and the ambiguity may not have been resolved how the programmer expected.
It has been my eperience that if you have a LALR parser that reports no errors at generation time, and you add something such that there are still no errors, you've not ruined any existing syntax. That could be a theorem.
If the language doesn't fit this LL1 + operator precedence mold then I would not use a recursive descent parser.
You need to have a recursive call to the same rule or a "parent" rule, followed by an optional token match (note that this is already more than a simple branch on the current token since it is effectively a 1 token backtrack).
If there's any extra token match inbetween (like mandatory {}) you've already dodged the problem.
So I agree a mistake is possible, but "easy" I would not call it. It only appears in very specific conditions. The issue is way more prevalent in general PEGs.
In between "easiest to get started with" and "what production-grade systems use", there is "easy to actually finish a medium-sized project with." I think LR parsers still defend that middle ground pretty well.
That was part of my question I think. I wouldn't have been able to tell you that the dominant paradigm being argued against was LR parsers, because I've never come across even one that I'm aware of (I've heard of them, but that's about it). Perhaps it's academia where they're popular?
LR however is more powerful, though this mostly matters if you don't have access to automatic grammar rewriting for your LL. More significantly, however, there's probably more good tooling for LR (or perhaps: you can assume that if tooling exists, it is good at what it is designed for); one problem with LL being so "simple" is that there's a lot of bad tooling out there.
The important things are 1. that you meaningfully eliminate ambiguities (which is easy to enforce for LR and doable for LL if your tooling is good), and 2. that you keep linear time complexity. Any parser other than LL/LR should be rejected because it fails at least one of these, and often both.
Within the LL and LR families there are actually quite a few members. SLR(1) is strong enough to be interesting but too weak for anything I would call a "language". LALR(1) is probably fine; I have never encountered a useful language that must resort to LR(1) (though note that modern tooling can do an optimistic fallback, avoiding the massive blowups of ancient LR tools). SLL(1) I'm not personally familiar with. X(k), where X is one of {SLL, LL, SLR, LALR, LR} and where k > 1, are not very useful; k=1 suffices. LL(*) however should be avoided due to backtracking, but in some cases consider if you can parsing token trees first (this is currently poorly represented in the literature; you want to be doing some form of this for error recovery anyway - automated error recovery is a useless lie) and/or defer the partial ambiguity until the AST is built (often better for error messages anyway, independent of using token trees).
The idea that you're going to hand-roll a parser generator and then use that to generate a parser and the result is going to be less buggy than just hand-rolling a recursive descent parser, screams "I've never written code outside of an academic context".
SQLite, perhaps the most widely deployed software system, takes this approach.
> The Lemon LALR(1) Parser Generator
> The SQL language parser for SQLite is generated using a code-generator program called "Lemon".
> ...
> Lemon was originally written by D. Richard Hipp (also the creator of SQLite) while he was in graduate school at Duke University between 1987 and 1992.
Here are the grammars, if you're curious.
But I do think the wider point is still true, that there can be real benefit to implementing 2 proper layered abstractions rather than implementing 1 broader abstraction where the complexity can span across more of the problem domain.
SQLite isn't some kind of universal template, I'm not saying people should copy it or that recursive descent is a bag choice. But empirically parser generators are used in real production systems. SQLite is unusual in that they also wrote the parser generator, but otherwise is in good company. Postgres uses Bison, for example.
Additionally, I think that Lemon was started as a personal learning project in grad school (as academic a project as it gets) and evolved into a component of what is probably the most widely deployed software system of all time shows this distinction between what is academic and what is practical isn't all that meaningful to begin with. What's academic becomes practical when the circumstances are right. Better to evaluate a technique in the context of your problem than to prematurely bin things into artificial categories.
Sure, but adding the complexity of a parser generator doesn't help with that complexity in most cases.
[General purpose] programming languages are a quintessential example. Yes, a compiler or an interpreter is a very complex program. But unless your programming language needs to be parsed in multiple languages, you definitely do not need to generate the parser in many languages like SQLite does. That just adds complexity for no reason.
You can't just say "it's complex, therefore it needs a parser generator" if adding the parser generator doesn't address the complexity in any way.
Creating abstractions does decrease complexity if one (or more) of the following is true:
- The abstraction generates savings in excess of it's own complexity
- The abstraction is shared by enough projects to amortize the cost of writing/maintaining it to a tolerable level
- There are additional benefits like validating your grammars are unambiguous or generating flow charts of your syntax in your documentation, amortizing the cost across different features of the same project
It's up to you as the implementer to weigh the benefits and costs. If you choose to use recursive descent, more power to you. (For what it's worth, I personally use parser combinators to split the difference between writing grammars and hand-rolling parsers. But I've used parser generators before and found them helpful.)
If your goal is simply to reduce bugs--not something more complex like generating parsers in a bunch of languages--then hand rolling a parser generator and then using it to generate your parser [singular] is not a path to achieving your goals. That's what I said, and that's actually just true, which you probably know.
This is not an invitation to bring up irrelevant, exceptional cases, it's the rule of thumb you should operate on. Put another way, don't add layers when there isn't a reason to do so. If there is a reason to do so, have at it. Obviously.
In a meta sense, it's pretty socially inept to jump in with corrections like this. In a complex field like programming, of course there are exceptions, and it's disrespectful to the group of professionals in the room to assume that they don't know about the exceptions. I'm guilty of this myself: it's because I was brought up being praised for knowing things, so I want to demonstrate that I know things. But as an adult, I had to learn that I'm not the only knowledgeable person in the room, and it's rude to assume that I am.
The only time I have used this myself was an expat style transformer for terraform (HCL). We had a lot of terraform and they kept changing the language, so I would build a fixer to make code written for say 0.10 to work with 0.12 and then again for 0.14. It was very fun and let us keep updating to newer terraform versions. Pretty simple language except for distinguishing quoted blocks from non-quoted.
I hear stories like this and I just wonder how we got here. Like, did this work provide any monetary value to anyone? It sounds like your team just got way too lost in the abstractions and forgot that they were supposed to make a product that did something, ostensibly something that makes money.
I mean, I guess if you can persuade people to give you money to do something, it's profitable. :shrug:
This work made it easier to keep terraform in sync with aws and so maintenance (e.g. adding a new policy to existing S3 buckets was just edit the S3 bucket module and re-applying to every account. The parser was way easier than manually editing the files manually (especially since I had so much terraform as a test bed of data).
What isn't right is changing the language and policies constantly, and if your editing configs is so difficult that writing a parser to do it was easier, one begins to think that Terraform wasn't the right automation tool.
Your comment is quite funny as hand-rolling a recursive descent parser is the kind of thing that is often accused of being a) bug-prone, b) only done in academic environments.
I'm just happy when parsing isn't being done with some absurdly long regex with no documentation.
The reason being that the reason there are so many parser generators is largely that we keep desperately looking for a way of writing one that isn't sheer pain in production use.
Personally I've also written a parser-generator for XML in C# to overcome some of the odd limitations of Microsoft's one when used in AOT contexts.
Hand-rolling is easy if the grammar is small. The larger it gets (and video codecs are huge!) the more you want something with automatic consistency.
But, the vast majority of parsers I've written didn't have this requirement. I needed to write one parser in one language.
It's important to note that ambiguities are something which exist in service of parser generators and the restricted formal grammars that drive them. They do not actually exist in the language to be parsed (unless that language is not well-specified, but then all bets are off and it is meaningless to speak of parsing), because they can be eliminated by side-conditions.
For example, one famous ambiguity is the dangling 'else' problem in C. But this isn't an actual ambiguity in the C language: the language has a side-condition which says that 'else' matches to the closest unmatched 'if'. This is completely unambiguous and so a recursive descent parser for C simply doesn't encounter this problem. It is only because parser generators, at least in their most academic form, lack a way to specify this side-condition that their proponents have to come up with a whole theory of "ambiguities". (Shockingly, Wikipedia gets this exactly right in the article on dangling else which I just thought to look up: "The dangling else is a problem in programming of parser generators".)
Likewise goes the problem of left-recursion. Opponents of recursive descent always present left-recursion as a gotcha which requires some special handling. Meanwhile actual programmers writing actual recursive descent parsers don't have any idea what these academics are talking about because the language that they're parsing (as it exists in their mind) doesn't feature left-recursion, but instead iteration. Left-recursion is only introduced in service of restricted formal grammars in which recursion is the only available primitive and iteration either doesn't exist or is syntactic sugar for recursion. For the recursive descent user, iteration is a perfectly acceptable primitive. The reason for the discrepancy goes back to side-conditions: iteration requires a side-condition stating how to build the parse tree; parser generators call this "resolving the ambiguity" because they can't express this in their restricted grammar, not because the language was ambiguous.
How do you specify your language “well” when you don’t know if your grammar is unambiguous? Determining whether a grammar is ambiguous is famously undecidable in the general case. So how do you decide, if you don’t restrict your grammar to one of the decidable forms checkable by parser generators? You can add some disambiguation rules, but how do you know they cover all ambiguities?
We use formal systems exactly to make sure that the language is well-defined.
* use proper rules rather than cramming everything into "statement", or
* specify explicit precedence rules, which is just a shortcut for the above (also skipping useless reductions)
Doing this is ubiquitous with parser generators when dealing with vaguely Algol-like languages, and is no different than the fact that you have to do the same thing for expressions.
Only partially true. How do you define the language to be parsed? It's with a grammar. If the grammar can yield two different parse trees for the same input, it's ambiguous. In LR parlance, if your grammar is ambiguous because of a shift-reduce conflict, it's because you stuffed up your grammar.
That's a real problem. It the difference between parsing "1 + 2 / 3" as "(1 + 2) / 3" and "1 + (2 / 3)". The two interpretations yield very different outcomes. The reason you see so many people here say "use a generated LL or LR parser" is the generator will find and report that mistake. It's a very easy mistake to make, and you won't realise you've made it.
Then there are what LR calls reduce-reduce conflicts. Yes, that may happen because the LR parser can't look far enough ahead. Or, it may again be because you've stuffed you grammar. Or it may be because the language you have in your head really isn't context free. Perl is in the last category. They claim to have got around it by saying its a "do what I mean" language. Fine, but it turns out in some cases what they think a string obviously means doesn't agree with what I thought it obviously meant.
False. This is how you define a language _to a parser generator_, but it is not how humans (and/or developers) define languages to each other.
> you won't realise you've made it
This is literally impossible in a recursive descent parser. I'm not saying getting it wrong is impossible, of course not. But what you literally cannot do (without concerted intentional effort) is make it ambiguous. Your parser will parse one first, or the other first, or either one left-to-right; and you will know which of these it does by reading the code.
An LR(1) parser can have many more states in it's DFA than LALR(1). That was important back in the 1970's when I was fighting for every byte of RAM, but now it's a total non-issue. I don't know why you would bother with LALR(1) now if you had a LR(1) parser generator.
It seems to be mainly academics and others interested in parsing theory, and those who like complexity for the sake of complexity.
In OCaml, a language highly suited for developing languages in, that de facto standard is the Menhir LR parser generator. It's a modern Yacc with many convenient features, including combinator-like library functions. I honestly enjoy the work of mastering Menhir, poring over the manual, which is all one page: https://gallium.inria.fr/~fpottier/menhir/manual.html
What makes OCaml suited for that?
These days I just handroll recursive descent parsers with a mutable stream record, `raise_notrace` and maybe some combinators inspired by FParsec for choices, repetition and error messages. I know it's not as rigorous, but at least it's regular code without unexpected limitations.
I get really annoyed when people still complain about YACC while ignoring the four decades of practical improvement that Bison has given us if you bother to configure it.
https://pypi.org/project/pybison/ , or its predecessors such as https://pypi.org/project/ply/ ?
But yes, the decidedly non-traditional https://github.com/pyparsing/pyparsing/ is certainly more popular.
By that, do you mean parser combinators?
Actually I wish this generalization of list comprehensions had been taken up by Haskell or other languages. Haskell decided on the do notation while Python users these days seem to shun the feature.
On a side note, I do use Python list comprehensions, and like them.
https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/mona...
1. Replace any expression that's within parentheses by its parse tree by using recursion
2. Find the lowest precedence operator, breaking ties however you'd like. Call this lowest precedence operator OP.
3. View the whole unparsed expression as `x OP y`
4. Generate a parse tree for x and for y. Call them P(x) and P(y).
5. Return ["OP", P(x), P(y)].
It's easy to speed up step 2 by keeping a table of all the operators in an expression, sorted by their precedence levels. For this table to work properly, the positions of all the tokens must never change.Big choices are handrolled recursive decent vs LALR, probably backed by bison or lemon generator and re2c for a lexer.
Passing the lalr(1) check, i.e. having bison actually accept the grammar without complain about ambiguities, is either very annoying or requires thinking clearly about your language, depending on your perspective.
I claim that a lot of the misfires in language implementations are from not doing that work, and using a hand rolled approximation to the parser you had in mind instead, because that's nicer/easier than the formal grammar.
The parser generators emit useless error messages, yes. So if you want nice user feedback, that'll be handrolled in some fashion. Sure.
Sometimes people write a grammar and use a hand rolled parser, hoping they match. Maybe with tests.
The right answer, used by noone as far as I can tell, is to parse with the lalr generated parser, then if that rejects your string because the program was ill formed, call the hand rolled one for guesswork/diagnostics. Never feed the parse tree from the hand rolled parser into the rest of the compiler, that way lies all the bugs.
As alternative phrasing, your linter and your parser don't need to be the same tool, even if it's convenient in some senses to mash them together.
This feels like a recipe for disaster. If the hand-rolled parser won't match a formal grammar, why would it match the generated parser?
The poor programmer will be debugging the wrong thing.
It reminds me of my short stint writing C++ where I'd read undefined memory in release mode, but when I ran it under debug mode it just worked.
The hand rolled parser might do, but also might not, what with software being difficult and testing being boring and so forth.
I assume it’s far too late at this point, but that almost always means that you’re invoking UB. Your next step should be enabling UBSan.
For new languages this should be avoided - just design a sane grammar in the first place.
I wish I could have save the source. It would be fun to see it.
Is is also written in a badass style and argues that this is superior to parser generators.
For those to whom they are new: I found them a little tricky to implement directly from Pratt's paper or even Crockford's javascript that popularized them.
So, through trial and error I figured out how to actually implement them in regular languages (i.e. not in Lisp).
If it helps, examples in C and Go are here:
https://github.com/glycerine/PrattParserInC
https://github.com/glycerine/zygomys/blob/master/zygo/pratt....
I find them easier to work with than the cryptic LALR(1) bison/yacc tools, but then I never really felt like I mastered yacc to begin with.
You then construct the parser by combining unambiguous parsers from the bottom up. The result ends up unambiguous by construction.
This high level algorithm is much easier to implement without a global lexer. Global lexing can be a source of inadvertent ambiguity. Strings make this obvious. If instead, you lex in a context specific way, it is usually easy to efficiently eliminate ambiguities.
It seems that, from the outside looking in, ~all significant PL projects end up using a hand-written recursive descent parser, eventually.
A PEG is always unambiguous because it picks the first option - but whether that was the intended parse is not necessarily straightforward. In practice these problems don't usually show up, so they're fine to work with.
The advantage LR gives you is that it produces a parser where there are no ambiguities and every successful parse is the one intended. An LR grammar is a proof, as well as a means of producing a parser. A decent LR parser generator is like a simple proof assistant - it will find problems with your language before you do, so you can fix your syntax before putting it into production.
In "real-world" parsing tasks as you put it, the problems of LR parser generators is that they're not the best suited to parsing languages that have ambiguities, like C, C++ and many others. Some of the complaints about LR are about the workarounds that need to be done to parse these languages, where it's obviously the wrong tool for the job because those languages aren't described by proper LR grammars.
But if you're designing a new language from scratch, surely it's better to not repeat those mistakes? If you carefully design your language to be parsed by an LR grammar then other developers who come to parse your language won't encounter those issues. They won't need lexical tie-ins and other nonsense that complicates the process.
fjfaase•6mo ago
Recursive decent parsers can simply be implemented with recusive functions. Implementing semantic checks becomes easy with additional parameters.
WalterBright•6mo ago
What a waste of time. I failed miserably.
However, I also realized that the only semantic information needed was to keep track of typedefs. That made recursive descent practical and effective.
ufo•6mo ago
[1] https://en.wikipedia.org/wiki/Lexer_hack
fjfaase•6mo ago