Tree-sitter vs. Language Servers

https://lambdaland.org/posts/2026-01-21_tree-sitter_vs_lsp/

266•ashton314•2w ago

Comments

tetris11•2w ago

I love tree-sitter+eglot but a few of the languages/schemes I work in, simply don't have parsers:

    > pacman -Ssq tree-sitter
    tree-sitter
    tree-sitter-bash
    tree-sitter-c
    tree-sitter-cli
    tree-sitter-javascript
    tree-sitter-lua
    tree-sitter-markdown
    tree-sitter-python
    tree-sitter-query
    tree-sitter-rust
    tree-sitter-vim
    tree-sitter-vimdoc

Where's R, YAML, Golang, and several others?

taeric•2w ago

Odd, yaml-ts-mode exists? Did they change how it gets its parser?

codethief•2w ago

Uhh… The fact that there's no Archlinux package for a given language doesn't imply there's no tree-sitter support (official or 3rd-party) for that language? See e.g. the very long list of languages on https://github.com/Goldziher/tree-sitter-language-pack , which does include R, YAML, Golang, and many more.

tetris11•2w ago

it does suggest lack of mature support though, and it breaks your configs if you want to rely on system libraries

woodruffw•2w ago

tree-sitter-yaml definitely exists[1]. Presumably nobody has packaged it for Arch yet; that seems like a thing you could contribute.

[1]: https://github.com/tree-sitter-grammars/tree-sitter-yaml

_ache_•2w ago

It's in the AUR (aur/tree-sitter-yaml), a community-driven repository of Arch Linux packages. Not yet official.

Since it comes from `tree-sitter-grammars/tree-sitter-yaml`, it may be quick to integrate the official repo.

matthew-craig•2w ago

In my emacs configuration, I have the following parsers installed:

awk bash bibtex blueprint c c-sharp clojure cmake commonlisp cpp css dart dockerfile elixir glsl gleam go gomod heex html janet java javascript json julia kotlin latex lua magik make markdown nix nu org perl proto python r ruby rust scala sql surface toml tsx typescript typst verilog vhdl vue wast wat wgsl yaml

johanvts•2w ago

Go is here: https://github.com/tree-sitter/tree-sitter-go Try google, the others are probably out there as well.

jasonjmcghee•2w ago

Most of them are in the language pack (https://github.com/Goldziher/tree-sitter-language-pack)

For others, this is a sub optimal answer, but I’ve played with generating grammars with latest llms and they are surprisingly good at doing this (in a few shots).

That being said, if you’re doing something more serious than syntax highlighting or shipping it in a product, you’ll want to spend more time on it.

zokier•2w ago

https://tree-sitter.github.io/tree-sitter/#parsers

https://github.com/tree-sitter/tree-sitter/wiki/List-of-pars...

FjordWarden•2w ago

This is like the difference between an orange and fruit juice. You can squeeze an orange to extract its juices, but that is not the only thing you can do with it, nor is it the only way to make fruit juice.

I use tree-sitter for developing a custom programming language, you still need an extra step to get from CST to AST, but the overall DevEx is much quicker that hand-rolling the parser.

danielvaughn•2w ago

Every time I get to sing Treesitters praise, I take the opportunity to. I love it so much. I've tried a bunch of parser generators, and the TS approach is so simple and so good that I'll probably never use anything else. The iteration speed lets me get into a zen-like state where I just think about syntax design, and I don't sweat the technical bits.

vrighter•1w ago

Whenever I need to write a parser, and I don't need the absolute best performance, I reach for the lua LPeg library. Sometimes I even embed the lua engine just so I can use that and then implement the rest in the original language.

lowbloodsugar•2w ago

N00b question: Language parsers gives me concrete information, like “com.foo.bar.Baz is defined here”. Does tree sitter do that or does it say “this file has a symbol declaration for Baz” and elsewhere for that file “there is a package statement for ‘com.foo.bar’” and then I have to figure that out?

FjordWarden•2w ago

You have to figure this out for yourself in most cases. Tree sitter does have a query language based on s-expressions, but it is more for questions like "give me all the nodes that are literals", and then you can, for example, render those with in single draw call. Tree sitter has incremental parsing, and queries can be fixed at a certain byte range.

lioeters•2w ago

> extra step to get from CST to AST

Could you elaborate on what this involves? I'm also looking at using tree-sitter as a parser for a new language, possibly to support multiple syntaxes. I'm thinking of converting its parse trees to a common schema, that's the target language.

I guess I don't quite get the difference between a concrete and abstract syntax tree. Is it just that the former includes information that's irrelevant to the semantics of the language, like whitespace?

direwolf20•2w ago

That's correct.

FjordWarden•2w ago

TS returns a tree with nodes, you walk the nodes with a visitor pattern. I've experimented with using tree-sitter queries for this, but for now not found this to be easier. Every syntax will have its own CST but it can target a general AST if you will. At the end they can both be represented as s-expressions and but you need rules to go from one flavour of syntax tree to the other.

AST is just CST minus range info and simplified/generalised lexical info (in most cases).

WilcoKruijer•2w ago

In this context you could say that CST -> AST is a normalization process. A CST might contain whitespace and comments, an AST almost certainly won't.

An example: in a CST `1 + 0x1 ` might be represented differently than `1 + 1`, but they could be equivalent in the AST. The same could be true for syntax sugar: `let [x,y] = arr;` and `let x = arr[0]; let y = arr[1];` could be the same after AST normalization.

You can see why having just the AST might not be enough for syntax highlighting.

As a side project I've been working on a simple programming language, where I use tree-sitter for the CST, but first normalize it to an AST before I do semantic analysis such as verifying references.

mattnewport•2w ago

Yeah, you can even use tree-sitter to implement a language server, I've done this for a custom scripting language we use at work.

storystarling•2w ago

I've been using it for semantic chunking in RAG pipelines. Naive splitting is pretty rough for code, but tree-sitter lets you grab full functions or classes. It seems to give much better context quality and keeps token costs down since you aren't retrieving broken fragments.

KlayLay•2w ago

Side note, but thanks for the note about not using AI to write your articles. I'm tired of looking for information online, finding an article that may answer it, and not being sure about the author's integrity (this is so rampant on Medium).

mediaman•2w ago

Yes - I've been thinking about why this is. I'm guessing part of it is that writing forces us to think. I often find when I write something that I haven't thought it out fully, and articulating it makes me see a logical failure in my thinking, and gives me the ability to work that out.

So when we just have AI write it, it means we've avoided the thinking part, and so the written article will be much less useful to the reader because there's no actual distillation of thought.

Using voice to article is a little better, and I do find that talking out a thought helps me see its problems, but writing it seems to do better.

There's also the problem that while it's easy to detect AI writing, it's hard to tell the difference between someone who thought it out by talking and had AI write it versus someone who did little thinking and still had AI write it. So as soon you you smell the whiff of AI writing, the reasonable expectation is that there's less distillation of thought.

munificent•2w ago

I think a big part of it is that we're trying to decide if a piece of text is worth spending the time and effort to read it.

If we know the text is hand-authored, then we have a signal that at least one person believed the content was important enough to put meaningful effort into creating it. That's a sign it might be worth reading.

If it's LLM-authored, then it might still be useful, or it might be complete garbage. It's hard to tell because we don't know if even the "author" was willing to invest anything into it.

ashton314•2w ago

This exactly. Last year I got handed a big ball of work slop. Someone asked me to review this big ol' design document and I had the hardest time parsing it. It sounded right, but none of the pieces actually fit together. When I confronted the PM who gave it to me and asked if it was AI generated, they replied that "there were parts of it that were human-generated"! -_-

Anyway, I wrote a little more about that here: https://lambdaland.org/posts/2025-08-04_artifical_inanity/

Intent matters a ton when reading or writing something.

briaoeuidhtns•2w ago

I think the big reason to put syntax highlighting in the language server is you have more info, ex you can highlight symbols imported from a different file in one color for integers and a different for functions

LoganDark•2w ago

You can enrich highlighting using information from the language server, can't you? I think JetBrains does this

mickeyp•2w ago

Tree-sitter is great. It powers Combobulate in Emacs. Structured editing and movement would not have been easily done without it.

ashton314•2w ago

Hey Mickey! Thanks for all the stuff you've made in the Emacs space. Thanks for commenting here. :)

mickeyp•2w ago

Thanks for the kind remarks, Ashton :)

Fiveplus•2w ago

>It is possible to use the language server for syntax highlighting. I am not aware of any particularly strong reasons why one would want to (or not want to) do this.

Hmm, the strong reason could be latency and layout stability. Tree-sitter parses on the main thread (or a close worker) typically in sub-ms timeframes, ensuring that syntax coloring is synchronous with keystrokes. LSP semantic tokens are asynchronous by design. If you rely solely on LSP for highlighting, you introduce a flash of unstyled content or color-shifting artifacts every time you type, because the round-trip to the server (even a local one) and the subsequent re-tokenization takes longer than the frame budget.

The ideal hygiene could be something like -> tree-sitter provides the high-speed lexical coloring (keywords, punctuation, basic structure) instantly and LSP paints the semantic modifiers (interfaces vs classes, mutable vs const) asynchronously like 200ms later. Relying on LSP for the base layer makes the editor feel sluggish.

mickeyp•2w ago

That's generally how it works in most editors that support both.

Tree-sitter has okay error correction, and that along with speed (as you mentioned) and its flexible query language makes it a winner for people to quickly iterate on a working parser but also obviously integration into an actual editor.

Oh, and some LSPs use tree-sitter to parse.

Metasyntactic•2w ago

> Hmm, the strong reason could be latency and layout stability. Tree-sitter parses on the main thread (or a close worker) typically in sub-ms timeframes

One of the designers/architects of 'Roslyn' here, the semantic analysis engine that powers the C#/VB compilers, VS IDE experiences, and our LSP server.

Note: For roslyn, we aim for microsecond (not millisecond) parsing. Even for very large files, even if the initial parse is milliseconds, we have an incremental parser design (https://github.com/dotnet/roslyn/blob/main/docs/compilers/De...) that makes 99.99+% of edits happen in microseconds, while reusing 99.99+ of syntax nodes, while also producing an independent, immutable tree (thus ensuring no threading concerns sharing these trees out to concurrent consumers).

> you introduce a flash of unstyled content or color-shifting artifacts every time you type, because the round-trip to the server (even a local one) and the subsequent re-tokenization takes longer than the frame budget.

This would indicate a serious problem somewhere.

It's also no different than any sort of modern UI stack. A modern UI stack would never want external code coming in that could ever block it. So all, potentially unbounded, processing work will be happening off the UI thread, ensuring that that thread is always responsive.

Note that "because the round-trip to the server (even a local one)" is no different from round-tripping to a processing thread. Indeed, in Visual Studio that is how it works as we have no need to run our server in a separate process space. Instead, the LSP server itself for roslyn simply runs in-process in VS as a normal library. No different than any other component that might have previously been doing this work.

> Relying on LSP for the base layer makes the editor feel sluggish.

It really should not. Note: this does take some amount of smart work. For example, in roslyn's classification systems we have a cascading set of classifying threads. One that classifies lexically, one for syntax, one for semantics, and finally, one for embedded languages (imagine embedded regex/json, or even C# nested in c#). And, of course, these embedded languages have cascading classification as well :D

Note that this concept is used in other places in LSP as well. For example, our diagnostics server computes compiler-syntax, vs compiler-semantics, versus 3rd-party analyzers, separately.

The approach of all of this has several benefits. First, we can scale up with the capabilities of the machine. So if there are free cores, we can put them to work computing less relevant data concurrently. Second, as results are computed on some operation, it can be displayed to the user without having to wait for the rest to finish. Being fine-grained means the UI can appear crisp and responsive, while potentially slower operations take longer but eventually appear.

For example, compiler syntax diagnostics generally take microseconds. While 3rd-party analyzer diagnostics might take seconds. No point in stalling the former while waiting for the latter to run. LSP makes multi-plexing this stuff easy

teo_zero•2w ago

> For roslyn, we aim for microsecond (not millisecond) parsing. Even for very large files, even if the initial parse is milliseconds, we have an incremental parser design [] that makes 99.99+% of edits happen in microseconds

I'm curious how you can make such statements involving absolute time values, without specifying what the minimum hardware requirements are.

I often write code on a 10-year-old Celeron, and I've opted for tree-sitter on the assumption that a language server would show unbearable latency, but I might have been wrong all this time. Do you claim your engine would give me sub-ms feedback on such hardware?

Metasyntactic•2w ago

> I'm curious how you can make such statements involving absolute time values, without specifying what the minimum hardware requirements are.

That's a very fair point. In this case. I'm using the minimum requirements for visual studio

> Do you claim your engine would give me sub-ms feedback on such hardware?

I would expect yes, for nearly all edits. See the links I've provided in this discussion to our incremental parsing architecture.

Briefly, you can expect an edit to only cause a small handful of allocations. And the parser will be able to reuse almost the entirety of the other tree, skipping over vast swaths of it (before and after the edit) trivially.

Say you have a 100 types, each with a 100 members, each with 100 statements. An edit to a statement will trivially blow through 99 of the types, reusing them. Then in the type surrounding the edited statement, it will reuse 99 members. Then in the edited member, it will reuse 99 statements and just reparse the one affected one.

So basically it's just the computer walking 297 nodes (absolutely cheap on any machine), and reparsing a statement (also cheap).

So this should still be microseconds.

Now. That relates to parsing. But you did say: would give me sub-ms feedback on such hardware?

So it depends on what you mean by feedback. I don't make any claims here about layers higher up and how they operate. But I can make real, measured, claims about incremental parsing performance.

mellery451•2w ago

one topic not mentioned is creating refactoring tools. My sense is that LSPs generally have the advantage here because they have the full parsed tree, but I suspect it would be possible to build simple syntactic refactorings in TS with the potential to be both faster and less sensitive to broken syntax.

vivzkestrel•2w ago

- as a guy who is absolutely not familiar with the idea of how code editors work and has to make a browser based code editor, what are the things that you think I should know?

- i got a hint of language server and tree sitter thanks to this wonderfully written post but it is still missing a lot of details like how does the protocol actually look like, what does a standard language server or tree sitter implementation looks like

- what are the other building blocks?

ferguess_k•2w ago

I don't know why you get downvoted. This article doesn't provide much details. I'd expect at least a series of posts for the comparison.

Let me be blunt: any article posted here should provide more information, or more in-depth analysis than Wikipedia. Since I'm not a compiler person, I might be too harsh to suggest that the article does not provide more in-depth analysis (because it is definitely shorter than it) than the Wikipedia article -- I apologize if that's the case.

feznyng•2w ago

LSPs rely on a parser to generate an AST for a given language. This parser needs to be error-tolerant because it needs to return usable ASTs despite often parsing incomplete, incorrect code and fast enough to run on every keystroke so it can provide realtime feedback.

Most of the time they rely on their own hand-rolled recursive descent parser. Writing these isn't necessarily hard but time-consuming and tedious especially if you're parsing a large language like C++.

Parser generators like yacc, bison, chumsky, ANTLR etc. can generate a parser for you given a grammar. However these parsers usually don't have the best performance or error reporting characteristics because they are auto-generated. A recursive descent parser is usually faster and because you can customize syntax error messages, easier for an LSP to use to provide good diagnostics.

Tree-sitter is also a parser generator but has better error tolerance properties (not quite as good as hand-written but generally better than prior implementations). Additionally, its incremental meaning it can reuse prior parses to more efficiently create a new AST. Most hand-written parsers are not incremental but are usually still fast enough to be usable in LSPs.

To use tree-sitter you define a grammar in JavaScript that tree-sitter will use to generate a parser in C which you can then use a dynamic or static library in your application.

In your case, this is useful because you can compile down those C libraries to WASM which can run right in the browser and will usually be faster than pure JS (the one catch is serialization overhead between JS and WASM). The problem is that you still need to implement all the language analysis features on top.

A good overview of different parsing techniques: https://tratt.net/laurie/blog/2020/which_parsing_approach.ht... LSP spec: https://microsoft.github.io/language-server-protocol/overvie... VSCode's guide on LSP features: https://code.visualstudio.com/api/language-extensions/progra... Tutorial on creating hand-rolled error-tolerant (but NOT incremental) recursive descent parsers: https://matklad.github.io/2023/05/21/resilient-ll-parsing-tu... Tree-sitter book: https://tree-sitter.github.io/tree-sitter/

thramp•2w ago

(Hi, I’m on the rust-analyzer team, but I’ve been less active for reasons that are clear in my bio.)

> Language servers are powerful because they can hook into the language’s runtime and compiler toolchain to get semantically correct answers to user queries. For example, suppose you have two versions of a pop function, one imported from a stack library, and another from a heap library. If you use a tool like the dumb-jump package in Emacs and you use it to jump to the definition for a call to pop, it might get confused as to where to go because it’s not sure what module is in scope at the point. A language server, on the other hand, should have access to this information and would not get confused.

You are correct that a language server will generally provide correct navigation/autocomplete, but a language server doesn’t necessarily need to hook into an existing compiler: a language server might be a latency-sensitive re-implementation of an existing compiler toolchain (rust-analyzer is the one I’m most familiar with, but the recent crop of new language servers tend to take this direction if the language’s compiler isn’t query-oriented).

> It is possible to use the language server for syntax highlighting. I am not aware of any particularly strong reasons why one would want to (or not want to) do this.

Since I spend a lot of time writing Rust, I’ll use Rust as an example: you can highlight a binding if it’s mutable or style an enum/struct differently. It’s one of those small things that makes a big impact once you get used to it: editors without semantic syntax highlighting (as it is called in the LSP specification) feel like they’re naked to me.

ashton314•2w ago

> you can highlight a binding if it’s mutable or style an enum/struct differently

Wow! That is an incredibly good reason. Thank you very much for telling me something I didn’t know. :)

UPDATE: I've added a paragraph talking about the ability of rust-analyzer. Thank you again!

cfiggers•2w ago

Another pretty common application is to color unused bindings with a slightly faded-out color. So for e.g. with the TypeScript LSP, up at the top of the file you can instantly tell what imports are redundant because they're colored differently.

interactivecode•2w ago

I love this in xcode / swift. Where classes and structs have a different colors between local classes and external classes (from a lib).

Its surprisingly useful to know if you’re working with a entity that you made.

kibwen•2w ago

For another example of semantics-aware highlighting for Rust, see Flowistry, which allows you to select an expression in order to highlight all the code that either influences or is influenced by that expression: https://github.com/willcrichton/flowistry

kstrauser•2w ago

Whoa, that's slick! I wish it were available in Zed. Maybe someday!

kibwen•2w ago

Flowistry publishes their underlying rustc plugin as a crate, so all the analysis is already done for you, you'd just need to integrate the output with your editor of choice.

Timon3•2w ago

Thanks for sharing this project! That's a really neat idea and would help me a lot with understanding code written by others. It's unfortunate that it's only available for Rust, but it makes sense that the language design really lends itself to this.

Looking at this, I noticed how long it's been since I saw a new IDE feature that really made me more productive at understanding code. The last I can really remember was parameter inlay hints. It's a bummer - both the Jetbrains IDEs and VS Code seem to only focus on AI features I don't want, to the detriment of everything else.

k__•2w ago

I think it's funny that some languages, like TypeScript, use a different programming language to improve their compile times.

Then there are languages like Rust who are like, whelp, we already use the fastest language, but compilation is still slow, so they have to resort to solutions like the rust-analyzer.

dwattttt•2w ago

> they have to resort to solutions like the rust-analyzer.

It's not really a bad thing. IDEs want results ASAP, so a solution should focus on latency; query based compilers can compile just enough of the source to get the answer to a specific query, so they're a good answer.

Compiling a binary means compiling everything though, so "compiling just the smallest amount of source for a query" isn't specifically a goal, instead you want to optimise for throughput; stuff like batching is a win there.

These aren't language specific improvements, they're recognition that the two tasks are related, but have different goals.

imtringued•2w ago

Eclipse has its own Java compiler just for the purpose of IDE integration. rust-analyzer is a very lightweight solution.

jbreckmckye•2w ago

I'm doing a project with tree sitter right now

Any tips for keeping the grammar sizes under control? I'm distributing a CLI tool that needs to support several languages, and I can see the grammars gradually bloating the binary size

I could build some clever thing where language packs are opt-in and distributed as WASM, maybe. But that could be complex

williamcotton•2w ago

Tree-sitter does incremental parsing which will speed up the Language Server having to otherwise re-parse an entire file.

groundzeros2015•2w ago

Turbo pascal, Borland c++, etc did it have these features and were somehow much faster.

avtar•2w ago

I thought that the blog's domain looked familiar. The author maintains an awesome and well maintained Emacs starter kit https://codeberg.org/ashton314/emacs-bedrock

ashton314•2w ago

Hey! Thanks for the kind words. :) I'm working on some updates for the upcoming release of Emacs 31. Should be good!

DanRosenwasser•2w ago

> It is possible to use the language server for syntax highlighting. I am not aware of any particularly strong reasons why one would want to (or not want to) do this. The language server can be a more complicated program and so could surface particularly detailed information about the syntax; it might also be slower than tree-sitter.

We (TypeScript) used to do this for Visual Studio prior to tmLanguage. It was nice because we didn't have to write a second parser. Our parser was already error-tolerant and incremental, and syntax highlighting just involved descending into the syntax tree's tokens. So there was no room for divergence bugs in parsers, and there was also no need to figure out how to encode oddities and ambiguity-breaking logic in limited formats like tmLanguage.

This all predated TSServer (which predated LSP, though that's coming in TypeScript 7). The latency for syntax highlighting over JSON was too much, and other editors often didn't make syntax highlighting available outside of tmLanguage anyway. Eventually semantic highlighting became a thing, which is more latency-tolerant, and overlays colors on top of a syntactic highlighter in VS Code.

The other issue with this approach was that we still needed a dedicated thread just for fast syntax highlighting. That thread was a separate instance of the JS language service without anything shared, so that was a decent amount of memory overhead just for syntax highlighting.

ashton314•2w ago

Hey there! Thanks for reading my article, and thanks for sharing something cool about TS!

------

Real quick: I'm a PhD student and I'm looking for an internship this summer. I've done a little work with gradual typing—working with TypeScript would be super cool! I don't see a good way to contact you on your profile, hence this reply; if you've got an opening for an intern on your team, I would be very interested in applying. My email is on my blog.

Thanks again for reading my post!

ZeWaka•2w ago

Yep, we do the same in-LSP-highlighting in a game programming language I help develop. We have a lot of really niche rules, and nobody's gotten around to making a full separate tree-sitter spec for our small language.

geophph•2w ago

Funny timing. I’ve been building an LSP for a niche DSL I use at work. I’ve been using tree-sitter to build out the AST of sorts for the LSP functions, but just yesterday it dawned on me that the syntax highlighting my LSP does is all just TS queries and encoding them properly for the protocol. So I was looking into if that can be done in the vscode extension that provides the LSP hookup instead. Kinda nice that the same tree-sitter grammar can be used across the extension and LSP, even tho they’re in different languages.

thepancake•2w ago

Damn the end of your blog post really touched me. I'm high AF, but I think this is still valid. Thanks for the lovely post!

Conscat•2w ago

I find neither LSP for Tree Sitter sufficient for syntax highlighting, but I am extremely satisfied by my special combination of _both_ in Emacs. I love how easy it is to write Tree Sitter queries for very special patterns, like highlighting namespace declarations differently from scope resolution, or highlighting inline assembly differently from normal strings.

But I really want the semantic highlighting from a language server, such as highlighting constants or macros special, and Emacs (among some other editors) make it trivial to blend the strengths of both together.

Metasyntactic•2w ago

C# Language Designer here, and one of the designers/architects of 'Roslyn', the semantic analysis engine that powers the C#/VB compilers, VS IDE experiences, and our LSP server.

The original post conflates some concepts worth separating. LSP and language servers operate at an IDE/Editor feature level, whereas tree-sitter is a particular technological choice for parsing text and producing a syntax tree. They serve different purposes but can work together.

What does a language server actually do? LSP defines features like:

  1. Finding references (`textDocument/references`)

  2. Go-to-definition (`textDocument/definition`)

  3. Syntax highlighting (`textDocument/semanticTokens/...`)

  4. Code completion, diagnostics, refactorings

A language server for language X could use tree-sitter internally to implement these features. But it can use whatever technologies it wants. LSP is protocol-level; tree-sitter is an implementation detail.

The article talks about tree-sitter avoiding the problem of "maintaining two parsers" (one for the compiler, one for the editor). This misunderstands how production compiler/IDE systems actually work. In Roslyn, we don't have two parsers. We have one parser that powers both the compiler and the IDE. Same code, same behavior, same error recovery. This works better, not worse. You want your IDE to understand code exactly the way the compiler does, not approximately.

The article highlights tree-sitter being "error-tolerant" and "incremental" as key advantages. These are real concerns. If you're starting from scratch with no existing language infrastructure, tree-sitter's error tolerance is valuable. But this isn't unique to tree-sitter. Production compiler parsers are already extremely error-tolerant because they have to be. People are typing invalid code 99% of the time in an editor.

Roslyn was designed from day one for IDE scenarios. We do incremental parsing (https://github.com/dotnet/roslyn/blob/main/docs/compilers/De...), but more importantly, we do incremental semantic analysis. When you change a file, we recompute semantic information for just the parts that changed, not the entire project. Tree-sitter gives you incremental parsing. That's good. But if you want rich IDE features, you need incremental semantics too.

The article suggests language servers are inherently "heavy" while tree-sitter is "lightweight." This isn't quite right. An LSP server is as heavy or light as you make it. If all you need is parsing and there's no existing language library, fine, use tree-sitter and build a minimal LSP server on top. But if you want to do more, LSP is designed for that. The protocol supports everything from basic syntax highlighting to complex refactorings.

Now, as to syntax highlighting. Despite the name, it isn't just syntactic in modern IDEs. In C#, we call this "classification," and it's powered by the full semantic model. A reference to a symbol is classified by what that symbol is: local, parameter, field, property, class, struct, type parameter, method, etc. Symbol attributes affect presentation. Static members are italicized, unused variables are faded, overwritten values are underlined. We classify based on runtime behavior: `async` methods, `const` fields, extension methods.

This requires deep semantic understanding. Binding symbols, resolving types, understanding scope and lifetime. Tree-sitter gives you a parse tree. That's it. It's excellent at what it does, but it's fundamentally a syntactic tool.

Example: in C#, `var x = GetValue();` is syntactically ambiguous. Is `var` a keyword or a type name? Only semantic analysis can tell you definitively. Tree-sitter would have to guess or mark it generically.

Tree-sitter is definitely a great technology though. Want to add basic syntax highlighting for a new language to your editor? Tree-sitter makes this trivial. Need structural editing or code folding? Perfect use case. However, for rich IDE experiences, the kind where clicking on a variable highlights all its uses, or where hovering shows documentation, or where renaming a method updates all call sites across your codebase, you need semantic analysis. That's a fundamentally different problem than parsing.

Tree-sitter definitely lowers the barrier to supporting new languages in editors. But it's not a replacement for language servers or semantic analysis engines. They're complementary technologies. For languages with mature compilers and semantic engines (C#, TypeScript, Rust, etc.), using the real compiler infrastructure for IDE features makes sense. For cases with simpler tooling needs, tree-sitter is an excellent foundation to build on.

conartist6•2w ago

Why did Typescript abandon Roslyn's red green tree design?

Metasyntactic•2w ago

I wrote several of typescript's initial compilers. We didn't use red/green for a few reasons:

• The js engines of the time were not efficient with that design. This was primarily testing v8 and chakra (IE/edge's prior engine).

• Red/green takes advantage of many things .net provides to be extremely efficient. For example structs. These are absent in js, making things much more costly. See the document on red-green trees I wrote here for more detail: https://github.com/dotnet/roslyn/blob/main/docs/compilers/De...

• The problem domains are a bit different. In Roslyn the design is a highly concurrent, multi-threaded feature-set that wants to share immutable data. Ts/JS being single threaded doesn't have the same concerns. So there is less need to efficiently create an immutable data structure. So having it be mutable meant working well with the engines of the time, with sacrificing too much.

• The ts parser is incremental, and operates very similarly to what I describe in for Roslyn in https://github.com/dotnet/roslyn/blob/main/docs/compilers/De.... However, because it operates on the equivalent of a red tree, it does need to do extra work to update positions and parent pointers.

Tldr, different engine performance and different consumption patterns pushed us to a different model.

conartist6•2w ago

I ask because I picked up where the TS and Roslyn teams left off. I actually brought red green trees into JS.

My finding is that the historical reasons against this no longer seem to apply today. With monomorphic code style JS has close enough to structs. Multi threading is now essential for perf.

I don't even think multithreading is the strongest argument for immutability, because it's not only parallelization that immutability unlocks but also safe concurrency, and/or the ability to give trusted data to an untrusted plugin without risking corruption

sakesun•2w ago

Notice some design docs are added since the beginning of this year.

torginus•2w ago

I just want to add that treesitter is a heuristic, incremental parser.

The difference between regular parsers and treesitter, is that regular parsers start eating tokens from the start of the file, and try to assemble and AST from that. The AST is built from the top down.

Treesitter works differently, it grabs tokens from an arbitrary point, and assembles them into AST nodes, then tries to extend the AST until the whole file is parsed.

This method supports incremental edits (as you can throw away the AST for the modified part, and try to re-parse), but the problem is that most languages are designed to be unambiguous when parsed left to right, and parsing them like this might involve some retries and guesswork.

Also, unlike modern languages, like Go, which is designed to be parseable without any semantic analysis, a lot of older languages don't have this property, notably C/C++ needs a symbol table. In this case, treesitter has to guess, and it might guess wrong.

As for what can you do with an AST and what can't you: you can tell if something is a function call, a variable reference, or any other piece of syntax, but if you write something like x = 2; then tree-sitter has no idea what x is, is it a float, an int? is it a local, a class variable, or a global? You can tell this with a symbol table which the compiler uses to dereference symbols, but treesitter cant do this for you.

_kb•2w ago

> It is possible to use the language server for syntax highlighting. I am not aware of any particularly strong reasons why one would want to (or not want to) do this.

This is an area where TS excels. It also supports nesting of different languages so a query can inject other languages [0] and compose different parsers.

As an example, this can be a straight forward as a simple comment parser [1], jsdoc [2], regex [3] etc. Or in more complex cases various DSLs. Each of these can then define their own injections too. When working with CI pipelines in particular it transforms an opaque wall of YAML into slightly more manageable CST which is incredible useful for both humans (syntax highlighting) and any machine parsing you may want to do.

[0]: https://tree-sitter.github.io/tree-sitter/3-syntax-highlight...

[1]: https://github.com/stsewd/tree-sitter-comment

[2]: https://github.com/tree-sitter/tree-sitter-jsdoc

[3]: https://github.com/tree-sitter/tree-sitter-regex

the_jizzler•1w ago

It's slightly embarrassing the author proudly declaims "no LLMs were abused" when in fact an LLM gives a much more useful answer by focusing on the syntactic versus semantic distinction. The author essentially parrots back the "what" as opposed to the "why."

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC concludes 25-year run with final collisions

The F Word

I write games in C (yes, C)

Software factories and the agentic moment

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Reinforcement Learning from Human Feedback

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

We mourn our craft

Coding agents have replaced every framework I used

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

France's homegrown open source online office suite

72M Points of Interest

The AI boom is causing shortages everywhere else

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

Unseen Footage of Atari Battlezone Arcade Cabinet Production

History and Timeline of the Proco Rat Pedal (2021)

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Hackers (1995) Animated Experience

Tree-sitter vs. Language Servers

Comments

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC concludes 25-year run with final collisions

The F Word

I write games in C (yes, C)

Software factories and the agentic moment

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Reinforcement Learning from Human Feedback

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

We mourn our craft

Coding agents have replaced every framework I used

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

France's homegrown open source online office suite

72M Points of Interest

The AI boom is causing shortages everywhere else

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

Unseen Footage of Atari Battlezone Arcade Cabinet Production

History and Timeline of the Proco Rat Pedal (2021)

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Hackers (1995) Animated Experience