The majority of production compilers use hand rolled parsers, ostensibly for better error reporting and panic synch.
Admittedly, INI is quite a simple format, hence I mention this as an anecdote.
I've got 10 full time senior engineers on a project heading in to its 15th year. We rewrite even extremely low level code like std::vector or malloc to make sure it matches our requirements.
UNIX was written by a couple of dudes.
In this case, this was a project at $EMPLOYER in an existing codebase with colleagues who have never seen Haskell code, using Haskell would've been a major error in judgement.
Haskell is a great language. It can even be a great language for beginners, especially if there's some senior help on hand.
But it's a terrible language to foist upon an unsuspecting and even unwilling victim.
It is lexx/yacc style lexer and parser generation and generates an LR1 parser but using the CPCT+ algorithm for error recovery. Iirc the way it works is that when an error occurs, the nearest likely valid token is inserted, the error is recorded and parsing continues.
I would use this for anything that is simple enough and recursive descent for anything more complicated and where even more context is needed for errors.
What drew me to the grmtools (eventually contributing to it) was that you can evaluate grammars basically like an interpreter without going through that compilation process. Leading to a fairly quick turnaround times during language development process.
I hope this year I can work on porting my grmtools based LSP to browser/wasm.
const result: ast.Expression[] = [];
p.expect("(");
while (!p.eof() && !p.at(")")) {
subexpr = expression(p);
assert(p !== undefined); // << here
result.push(subexpr);
if (!p.at(")")) p.expect(",");
}
p.expect(")");
return result;This is resilient parsing --- we are parsing source code with syntax errors, but still want to produce a best-effort syntax tree. Although expression is required by the grammar, the `expression` function might still return nothing if the user typed some garbage there instead of a valid expression.
However, even if we return nothing due to garbage, there are two possible behaviors:
* We can consume no tokens, making a guess that what looks like "garbage" from the perspective of expression parser is actually a start of next larger syntax construct:
``` function f() { let x = foo(1, let not_garbage = 92; } ```
In this example, it would be smart to _not_ consume `let` when parsing `foo(`'s arglist.
* Alternatively, we can consume some tokens, guessing that the user _meant_ to write an expression there
``` function f() { let x = foo(1, /); } ```
In the above example, it would be smart to skip over `/`.
kccqzy•1mo ago
This also has another benefit of work sharing. A production like `A B | C B` will ensure that in case parsing A or C consumes the same number of characters, the work to parse B will be shared, despite not literally factoring the production into `(A | C) B`.
smj-edison•1mo ago
luizfelberti•1mo ago
It's a bit harder to adapt the technique to parsers because the Thompson NFA always increments the sequence pointer by the same amount, while a parser's production usually has a variable size, making it harder to run several parsing heads in lockstep.
[0] https://swtch.com/~rsc/regexp/regexp2.html
Porygon•1mo ago
I recently tried that approach while simultaneously building an abstract syntax tree, but I dropped it in favor of a right-recursive grammar for now, since restoring the AST when backtracking got a bit complex.
kccqzy•1mo ago