That inspired me — if AI can rewrite a whole language stack that fast, I wanted to try building a programming language from scratch with AI assistance.
I've also been noticing growing global interest in Korean language and culture, and I wondered: what would a programming language look like if every keyword was in Hangul (the Korean writing system)?
Han is the result. It's a statically-typed language written in Rust with a full compiler pipeline (lexer → parser → AST → interpreter + LLVM IR codegen).
It supports arrays, structs with impl blocks, closures, pattern matching, try/catch, file I/O, module imports, a REPL, and a basic LSP server.
This is a side project, not a "you should use this instead of Python" pitch. Feedback on language design, compiler architecture, or the Korean keyword choices is very welcome.
raaspazasu•1h ago
xodn348•1h ago
I actually tested this with GPT-4o's tokenizer, and the result was the opposite — Korean keywords average 2-3 tokens vs 1 for English. A fibonacci program in Han takes 88 tokens vs 54 in Python.
The reason comes down to how LLM tokenizers work. They use BPE (Byte Pair Encoding), which starts with raw bytes and repeatedly merges the most frequent pairs into single tokens. Since training data is predominantly English, words like `function` and `return` appear billions of times and get merged into single tokens.
Korean text appears far less frequently, so the tokenizer doesn't learn to merge Hangul syllables — it falls back to splitting each character into 2-3 byte-level tokens instead.
It's a tokenizer training bias, not a property of Hangul itself. If a tokenizer were trained on a Korean-heavy corpus, `함수` could absolutely become a single token too.
So no efficiency benefit today. But it was a fun exploration, and Korean speakers can read the code like natural language. It could also be a fun way for people learning Korean to practice reading Hangul in a different context — every keyword is a real Korean word with meaning.
topce•41m ago
I have similar idea to train LLM in Serbian, create even new encoding https://github.com/topce/YUTF-8 inspired by YUSCII. Did not have time and money ;-) Great that you succeed. Idea if train in Serbian text encoded in YUTF-8 (not UTF-8) it will have less token when prompt in Serbian then English, also Serbian Cyrillic characters are 1 byte in YUTF-8 instead of 2 in UTF.Serbian language is phonetic we never ask how you spell it.Have Latin and Cyrillic letters.
xodn348•34m ago
Even without retraining BPE from scratch, starting with YUTF-8 and measuring how existing tokenizers handle it would already be a worthwhile experiment.
Hope you find the time to build it, good luck!
ralferoo•20m ago
xodn348•12m ago
For Korean readers the character systems look quite different, but I can see how it's hard to parse visually without familiarity.
As you said, syntax highlighting helps a lot — there's a colored screenshot at the top of the README showing how it looks in practice.
bbrodriguez•35m ago
The main benefit of Korean actually comes from the fact that the language itself fits perfectly into a standard 27 alphabet keys and laid out in such a way that lets you type ridiculously fast. The consonant letters are always situated in the left half and the vowels are in the right half of the keyboard. This means it is extremely easy to train muscle memory because you’re mostly alternating keystrokes on your left hand and right hand.
Anecdotally I feel like when I’m typing in English, each half of my brain needs to coordinate more compared to when I’m typing in Korean, the right brain only need to remember the consonant positions for my left hand and my left brain only need to remember the vowel positions.
xodn348•28m ago
What you talked is mostly right and I did not know about typing in Korean, the left-hand side and right-hand side. Btw, Consonant(Jaeum) and vowel(Moeum).
In experience-wise, what you had would be precise.