I think hacking GCC/LLVM can be pretty challenging, but hey they are real, production-grade compilers and not just typical academic projects.
Try Andrew Appel's "Modern Compiler implementation in Java/C/ML" or Writing a C Compiler (https://norasandler.com/book) which is much more recent.
Eventually, you'd want to hack GCC/LLVM because they are production-grade compilers.
For learning deeper about other advanced topics there is:
https://www.cs.cornell.edu/courses/cs6120/2025fa/
and
https://mcyoung.xyz/2025/10/21/ssa-1/
So maybe writing a compiler with exactly one FE (for a simple language) and one BE (for a simple architecture), with say 80% of the optimizations could be a doable project.
(1) We should define what we mean by that, because there are thousands of front-ends and back-ends.
> Modern Compiler Implementation by Andrew W. Appel
It comes in three flavors C, ML (Meta Language), and Java
https://www.cs.princeton.edu/~appel/modern/
Writing a compiler in Standard ML is as natural as writing a grammar and denotational semantics.
Compiler writing is becoming an extinct art.
Are you sure it’s an extinct art though? LLVM is flourishing, many interesting IRs come to life like MLIR, many ML-adjacent projects build their own compilers (PyTorch, Mojo, tinygrad), many big tech like Intel, AMD, Nvidia, Apple and others contribute to multiple different compilers, projects integrate one to another at different levels of abstraction (PyTorch -> Triton -> CUDA) - there is a lot of compilation going on from one language to another
Not to mention many languages in a mainstream that weren’t that popular 10 years ago - think Rust, Zig, Go
cooking is an art. software is engineering. no one would say "building skyscrapers as an art is dying".
Ironically, now with modern Java you can that much easier than the approach done in the Java variant back in 1997.
For more modern folks, one can use the ML version as inspiration for doing the book exercises in OCaml, Haskell, F# or Rust.
Writing compilers for a living, and CS research is a niche domain, not something I would consider an extinct art.
Do you distinguish between writing a compiler and writing an optimizing compiler, and if so, how is writing an optimizing compiler an extinct art?
Equality saturation, domination graphs, chordal register allocation, hardware-software codesign, etc there are many new avenues of research for compilers, and these are just the ones on the top of my head that are relevant to my work. Most optimization work is R&D and much of it is left unimplemented at scale, and things like the phase-ordering problem and IR validation are hard to do in practice, even given ample resources and time.
No, not at all, the teachings and techniques have been surpassed since four decades or so.
The algorithm LALR is flawed, it only works for a subset of CFG instead of all. That alone is already a death blow. If you want to try out BNF grammars in the wild, it is nearly guaranteed that they are complex enough for LALR to shit itself with S-R conflicts.
The technique of generating and dumping source code is awkward and the reasons that made that a necessity back then are no longer relevant. A good parser is simply a function call from a code library.
The technique of tokenising, then parsing in a second pass is awkward, introduces errors and again the reasons that made that a necessity back then are no longer relevant. A good parser works "on-line" (term of art, not meaning "over a computer network" here) by tokenising and parsing at the same time/single-pass.
The book precedes Unicode by a long time and you will not learn how to properly deal with text according to the rules laid out in its various relevant reports and annexes.
The book does not take into consideration the syntactic and semantic niceties and features that regex have gained since and thus should definitely also be part of a grammar parser.
> recommend any other learning resources
Depends on what your goals are. For a broad and shallow theoretical introduction and to see what's out there, browse the slide decks of university lectures for this topic on the Web.
I think you conflate “learning to build a compiler for a toy language” with “being effective at working on a modern optimizing compiler suite like GCC/LLVM”
The book is perfectly fine for the first use case, and never claims to touch upon the latter.
The major problem is not to find the sophisticated things, but understand how do it in simple-ish ways.
Do otherwise is a major waste of time!
P.D: And yes, only when you get the basic and learn the jargon still is a problem to find the neat tricks, but is likely that you already get that there is nothing like read the source... (sadly that source is in C or worse C++, but lately with Rust that is gaining traction at least it make more sense!)
What I wanted to say that you do not need complex algorithms to implement parser if you do not have a grammar that can be parsed with look-ahead lexical element.
The grammar of C is ambiguous. The statement "a * b;" can be both parsed as a variable declaration of the variable 'b' of type pointer to 'a' and as an expression consisting of a multiplication of 'a' and 'b'. It all depends on whether 'a' is a type or not. In most cases it would be a type definition, because why multiply two variables and ignore the result. One trick to deal with this is to give precedence for the type declaration grammar rule over the expression grammar rule. However, this is not something that can be done with many parser generating tools.
Yet the first C compiler where single pass compilers with a single look-ahead lexical token probably implemented as a recursive descent parser. It worked, because it kept a (reverse) list of all variable declarations, which allowed it to determine when 'a' was parsed if it was the start of some declaration or the start of a statement based on whether it was defined before as a type or not.
[1] https://stackoverflow.com/questions/12971191/grammars-that-r...
I have been studying compiler design for several years and I have found that writing a simple parser by hand is the best way to go most of the time. There is a process to it: You start with a "Hello, world!" program and you parse it character by character with no separate lexer. You ensure that at each step in your parser, you make an unambiguous decision at each character and never backtrack. The decision may be that you need to enter a disambiguation function that also only moves forward. If the grammar gets in the way of conserving this property, change the grammar not the parser design.
If you follow that high level algorithm, you will end up with a parser with performance linear in the number of characters which is asymptotically as well as you can hope to do. It is both easy and simple to implement (provided you have solid programming fundamentals) and no caching is needed for efficiency.
Deliberate backtracking in a compiler is an evil that should be avoided at all costs. It is potentially injecting exponentially poor performance into a developer's primary feedback loop which is a theft of their time for which they have little to no recourse.
If you teach a course about compiler construction, I think it might be better to teach your students how to write a grammar for some language and use some interactive parser that can parse some input according to the grammar (and visualize the AST). See for example: [1] and [2] (Even if you feed it the C grammar, it succeeds parsing thousands of lines (preprocessed) C code at every keystroke. This interpreting parser is written in JavaScript and uses a simple caching strategy for performance improvement.)
For the scripting language [3] in some of the Bizzdesigns modeling tools, a similar interactive parser was used (implemented in C++). This scripting language is also internally used for implementing the various meta-models. These scripts are parsed once, cached, and interpreted often.
I think it is also true for many domain-specific languages (DSL).
[1] https://info.itemis.com/demo/agl/editor
[2] https://fransfaase.github.io/MCH2022ParserWorkshop/IParseStu...
[3] https://help.bizzdesign.com/articles/#!horizzon-help/the-scr...
Backtracking parsers lead people into creating bad grammars. In principle people are perfectly capable of creating simple context-free grammars and write any parser they want to read it. But on practice your tools guide your decisions for a huge extent, and the less experience people have, the more true that becomes; so it's a really dangerous tool, in particular for students.
Also, fully backtracking parsers have the most unpredictable and hard to fix error conditions for all possibilities. There exist middle grounds where the parser execution is still predictable but you do get most of the benefit from backtracking, but that's a lot of complex engineering decisions to reach and keep your project close that optimal.
Immediate edit: On a CS context there is one reason that is probably more important than any other. People use parsers as an application of automata and regular languages theory. Those two concepts are way more important than the practical implications of parsing.
My experience is that if a back-tracking parser list all the possible terminals it is expecting at the first location (with some additional information about the rules they occurred in) it fails to get passed, that this usually gives enough information to understand what is wrong about the input or the grammar.
I'll do you one better. The main compiler in my course uses only six characters to parse every single project: `(read)`.
A lesson I have to relearn every time: while you can always skip lexing (which is really just another parser), it almost always screws you over to do so.
AdityaSanthosh•2mo ago
ktimespi•2mo ago