Build a Compiler in Five Projects

https://kmicinski.com/functional-programming/2025/11/23/build-a-language/

206•azhenley•2mo ago

Comments

AdityaSanthosh•2mo ago

Hi, seems like an interesting course. I haven't studied compilers in my undergrad( I'm an electronics student) but I have been working as a programmer who studied c and bit of low level languages. Is there any prerequisite compiler knowledge required for this course?

ktimespi•2mo ago

The only prerequisite here is probably Racket, to follow along with the book

UncleOxidant•2mo ago

The Essentials of Compilation book mentioned is only ~$24 on Amazon. Usually books like this are much more expensive. I ordered a copy.

almostgotcaught•2mo ago

looks like a fun book but just be forewarned real compiler engineering is nothing like what's covered there.

anta40•2mo ago

Any recommendation for a more realistic book?

I think hacking GCC/LLVM can be pretty challenging, but hey they are real, production-grade compilers and not just typical academic projects.

almostgotcaught•2mo ago

there are no good modern compiler books - everything that's been written down pales in comparison to what GCC/LLVM really involve. recently i found Engineering a Compiler by Cooper and Torczon when reviewing/prepping for interviews - it wasn't bad. also there's now LLVM Code Generation by Quentin Colombet but that's basically a code walk-through of LLVM (it doesn't cover any of the algos). and it was probably out of the date the second it got published lol (not really but maybe). the truth is that trying to learn how to build a compiler from a single book is like trying to learn how to build a skyscraper from a single book.

kmicinski•2mo ago

Respectfully, I think what you mean is that there are no books which give you the experience of hacking on LLVM for several years.

yu3zhou4•2mo ago

Is Dragon book still relevant? Do you recommend any other learning resources other than reading the source and contributing to llvm?

anta40•2mo ago

I heard that new volume is updated with newer stuffs like data flow analysis, garbage collection, etc. Anyway the book doesn't teach you how to build a basic working compiler, so need to consult another materials.

Try Andrew Appel's "Modern Compiler implementation in Java/C/ML" or Writing a C Compiler (https://norasandler.com/book) which is much more recent.

Eventually, you'd want to hack GCC/LLVM because they are production-grade compilers.

f1shy•2mo ago

IMHO absolutely. The basics of lexer and parser are still there. Some of the optimizations are also relevant. You just cannot expect to read the book and be able to write GCC or LLVM from scratch(1).

For learning deeper about other advanced topics there is:

https://www.cs.cornell.edu/courses/cs6120/2025fa/

and

https://mcyoung.xyz/2025/10/21/ssa-1/

So maybe writing a compiler with exactly one FE (for a simple language) and one BE (for a simple architecture), with say 80% of the optimizations could be a doable project.

(1) We should define what we mean by that, because there are thousands of front-ends and back-ends.

kronnpp•2mo ago

I taught in the past and still like the trilogy of books

> Modern Compiler Implementation by Andrew W. Appel

It comes in three flavors C, ML (Meta Language), and Java

https://www.cs.princeton.edu/~appel/modern/

Writing a compiler in Standard ML is as natural as writing a grammar and denotational semantics.

Compiler writing is becoming an extinct art.

yu3zhou4•2mo ago

Thanks!

Are you sure it’s an extinct art though? LLVM is flourishing, many interesting IRs come to life like MLIR, many ML-adjacent projects build their own compilers (PyTorch, Mojo, tinygrad), many big tech like Intel, AMD, Nvidia, Apple and others contribute to multiple different compilers, projects integrate one to another at different levels of abstraction (PyTorch -> Triton -> CUDA) - there is a lot of compilation going on from one language to another

Not to mention many languages in a mainstream that weren’t that popular 10 years ago - think Rust, Zig, Go

pjmlp•2mo ago

You should look into GraalVM as well, as it is another approach for compiler development.

norir•2mo ago

The prominence of LLVM is a symptom of the dying of compiler writing as an art, not evidence of its vitality.

almostgotcaught•2mo ago

> compiler writing as an art

cooking is an art. software is engineering. no one would say "building skyscrapers as an art is dying".

alabhyajindal•2mo ago

I heard that the ML version was a translation of the C version, and is thus not easy to follow along. Or it may have been the other way around!

pjmlp•2mo ago

The other way around, the best book is the ML version, the other two try to do ML in the respective language.

Ironically, now with modern Java you can that much easier than the approach done in the Java variant back in 1997.

pjmlp•2mo ago

The ML version is my favourite and I can vouch for the books being quite interesting.

For more modern folks, one can use the ML version as inspiration for doing the book exercises in OCaml, Haskell, F# or Rust.

Writing compilers for a living, and CS research is a niche domain, not something I would consider an extinct art.

kaz-inc•2mo ago

I'll take the bait.

Do you distinguish between writing a compiler and writing an optimizing compiler, and if so, how is writing an optimizing compiler an extinct art?

Equality saturation, domination graphs, chordal register allocation, hardware-software codesign, etc there are many new avenues of research for compilers, and these are just the ones on the top of my head that are relevant to my work. Most optimization work is R&D and much of it is left unimplemented at scale, and things like the phase-ordering problem and IR validation are hard to do in practice, even given ample resources and time.

bmn__•2mo ago

> Is Dragon book still relevant?

No, not at all, the teachings and techniques have been surpassed since four decades or so.

The algorithm LALR is flawed, it only works for a subset of CFG instead of all. That alone is already a death blow. If you want to try out BNF grammars in the wild, it is nearly guaranteed that they are complex enough for LALR to shit itself with S-R conflicts.

The technique of generating and dumping source code is awkward and the reasons that made that a necessity back then are no longer relevant. A good parser is simply a function call from a code library.

The technique of tokenising, then parsing in a second pass is awkward, introduces errors and again the reasons that made that a necessity back then are no longer relevant. A good parser works "on-line" (term of art, not meaning "over a computer network" here) by tokenising and parsing at the same time/single-pass.

The book precedes Unicode by a long time and you will not learn how to properly deal with text according to the rules laid out in its various relevant reports and annexes.

The book does not take into consideration the syntactic and semantic niceties and features that regex have gained since and thus should definitely also be part of a grammar parser.

> recommend any other learning resources

Depends on what your goals are. For a broad and shallow theoretical introduction and to see what's out there, browse the slide decks of university lectures for this topic on the Web.

yu3zhou4•2mo ago

Thanks. I'd like to learn the most important things in traditional compilers that are still relevant and translate these skills into ML compilers

sph•2mo ago

> the truth is that trying to learn how to build a compiler from a single book

I think you conflate “learning to build a compiler for a toy language” with “being effective at working on a modern optimizing compiler suite like GCC/LLVM”

The book is perfectly fine for the first use case, and never claims to touch upon the latter.

mamcx•2mo ago

A good counterpoint is that a lot of information about this is dense, cryptic, weird, confusing and hard to get.

The major problem is not to find the sophisticated things, but understand how do it in simple-ish ways.

Do otherwise is a major waste of time!

P.D: And yes, only when you get the basic and learn the jargon still is a problem to find the neat tricks, but is likely that you already get that there is nothing like read the source... (sadly that source is in C or worse C++, but lately with Rust that is gaining traction at least it make more sense!)

fragmede•2mo ago

Real compiler engineering covers a lot of ground. This book is an intro to it, not the whole everything. No need to posture about it.

kryptiskt•2mo ago

I'm knee deep in clang at the moment and I'm so fed up with real compiler engineering. Give me Chez Scheme and the nanopass compiler any day. That is so much better than the big ball of mud that goes into a "real" compiler.

almostgotcaught•2mo ago

..... Yes that's exactly my point that these cutesy books give a completely inaccurate picture of what the job is really like.

pjmlp•2mo ago

Most people just want to compile their toy language into machine code and be done with it.

declank•2mo ago

There is also the Python version too and available free. I do like the register allocation/graph colouring chapter.

evnix•2mo ago

Most of these compiler projects and books would be 100x more popular and accessible if they were in Javascript

wiseowise•2mo ago

Just use LLM to translate example to whatever language you want.

derrida•2mo ago

Now now, isn't that what a compiler does?

Iwan-Zotow•2mo ago

SICP in JAVASCRIPT exist somewhere

fjfaase•2mo ago

It surprises that they are still teaching parsing techniques that are based on limitation from 40 years ago, when memory was limited and you had to parse a file one character at the time. Why not start with a back-tracking recursive descent parser on a file stored in memory? Can be made efficient with some simple caching. In an introduction course there is no need to aim for maximum performance if parsing a 10k lines program takes less than a second.

vintagedave•2mo ago

Are you referring to lookahead, as in allowing more ambiguous grammars?

fjfaase•2mo ago

No, even if a grammar is ambigious it can require unbound look-ahead to be parsed, although this is very rare the case for meaningfull grammars such as the ones you would write for a programming language.

What I wanted to say that you do not need complex algorithms to implement parser if you do not have a grammar that can be parsed with look-ahead lexical element.

fjfaase•2mo ago

I mend to say: No, even if a grammar is not ambiguous ...

fjfaase•2mo ago

Sorry, I mend to say: Even if a grammar is not ambiguous, it can require unbound look-ahead to be parsed correctly [1].

The grammar of C is ambiguous. The statement "a * b;" can be both parsed as a variable declaration of the variable 'b' of type pointer to 'a' and as an expression consisting of a multiplication of 'a' and 'b'. It all depends on whether 'a' is a type or not. In most cases it would be a type definition, because why multiply two variables and ignore the result. One trick to deal with this is to give precedence for the type declaration grammar rule over the expression grammar rule. However, this is not something that can be done with many parser generating tools.

Yet the first C compiler where single pass compilers with a single look-ahead lexical token probably implemented as a recursive descent parser. It worked, because it kept a (reverse) list of all variable declarations, which allowed it to determine when 'a' was parsed if it was the start of some declaration or the start of a statement based on whether it was defined before as a type or not.

[1] https://stackoverflow.com/questions/12971191/grammars-that-r...

norir•2mo ago

Parsing is strange in that many people tend to believe it is a solved problem and yet every project handles it slightly differently (and almost none do it truly well).

I have been studying compiler design for several years and I have found that writing a simple parser by hand is the best way to go most of the time. There is a process to it: You start with a "Hello, world!" program and you parse it character by character with no separate lexer. You ensure that at each step in your parser, you make an unambiguous decision at each character and never backtrack. The decision may be that you need to enter a disambiguation function that also only moves forward. If the grammar gets in the way of conserving this property, change the grammar not the parser design.

If you follow that high level algorithm, you will end up with a parser with performance linear in the number of characters which is asymptotically as well as you can hope to do. It is both easy and simple to implement (provided you have solid programming fundamentals) and no caching is needed for efficiency.

Deliberate backtracking in a compiler is an evil that should be avoided at all costs. It is potentially injecting exponentially poor performance into a developer's primary feedback loop which is a theft of their time for which they have little to no recourse.

fjfaase•2mo ago

I agree, that if you want to write a production grade parser, this is probably the best way to go. I also agree that parsing is not a solved problem for all cases. But that is the case with many more problems. However, for many cases it is a solved problem and that often it is not the first thing you should focus on to optimize.

If you teach a course about compiler construction, I think it might be better to teach your students how to write a grammar for some language and use some interactive parser that can parse some input according to the grammar (and visualize the AST). See for example: [1] and [2] (Even if you feed it the C grammar, it succeeds parsing thousands of lines (preprocessed) C code at every keystroke. This interpreting parser is written in JavaScript and uses a simple caching strategy for performance improvement.)

For the scripting language [3] in some of the Bizzdesigns modeling tools, a similar interactive parser was used (implemented in C++). This scripting language is also internally used for implementing the various meta-models. These scripts are parsed once, cached, and interpreted often.

I think it is also true for many domain-specific languages (DSL).

[1] https://info.itemis.com/demo/agl/editor

[2] https://fransfaase.github.io/MCH2022ParserWorkshop/IParseStu...

[3] https://help.bizzdesign.com/articles/#!horizzon-help/the-scr...

marcosdumay•2mo ago

Well, I can immediately think of two reasons:

Backtracking parsers lead people into creating bad grammars. In principle people are perfectly capable of creating simple context-free grammars and write any parser they want to read it. But on practice your tools guide your decisions for a huge extent, and the less experience people have, the more true that becomes; so it's a really dangerous tool, in particular for students.

Also, fully backtracking parsers have the most unpredictable and hard to fix error conditions for all possibilities. There exist middle grounds where the parser execution is still predictable but you do get most of the benefit from backtracking, but that's a lot of complex engineering decisions to reach and keep your project close that optimal.

Immediate edit: On a CS context there is one reason that is probably more important than any other. People use parsers as an application of automata and regular languages theory. Those two concepts are way more important than the practical implications of parsing.

fjfaase•2mo ago

What do you mean with bad grammars? Do you mean grammars that are hard to parse (require a lot of backtracking) or do you mean that it leads people to creating bad languages?

My experience is that if a back-tracking parser list all the possible terminals it is expecting at the first location (with some additional information about the rules they occurred in) it fails to get passed, that this usually gives enough information to understand what is wrong about the input or the grammar.

marcosdumay•2mo ago

I mean grammars that are hard for people to follow.

kmicinski•2mo ago

> In an introduction course there is no need to aim for maximum performance if parsing a 10k lines program takes less than a second.

I'll do you one better. The main compiler in my course uses only six characters to parse every single project: `(read)`.

Rendello•2mo ago

I love parsing and have made a lot of parsers, but never a typical programming language parser. It's interesting that most of the literature (from academic papers to blog posts) focuses programming language parsers, but the vast majority of parsers out there deal with other things. I had to really figure things out myself, and that's been the same story for every parser I've written.

A lesson I have to relearn every time: while you can always skip lexing (which is really just another parser), it almost always screws you over to do so.

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

I Write Games in C (yes, C)

Vocal Guide – belt sing without killing yourself

SectorC: A C Compiler in 512 bytes

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

History and Timeline of the Proco Rat Pedal (2021)

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Software factories and the agentic moment

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

I Write Games in C (yes, C)

Vocal Guide – belt sing without killing yourself

SectorC: A C Compiler in 512 bytes

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

History and Timeline of the Proco Rat Pedal (2021)

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Software factories and the agentic moment

Build a Compiler in Five Projects

Comments