EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

https://esolang-bench.vercel.app/

37•matt_d•2h ago

Comments

deklesen•1h ago

Mhh... my hunch is that part of this is that all python keywords are 1 token, I assume. And for those very weird languages, tokenizing might make it harder to reason over those tokens.

Would love to see how the benchmarks results change if the esoteric languages are changed a bit to make them have 1-token keywords only.

chychiu•1h ago

Considering that brainfuck only has 8 characters and models are scoring at 6.2% I don't think tokenization is the issue

altruios•1h ago

The only issue. *

Reasoning is hard, reasoning about colors while wearing glasses that obfuscate the real colors... even harder... but not the core issue if your brain not wired correctly to reason.

I suspect the way out of this is to separate knowledge from reason: to train reasoning with zero knowledge and zero language... and then to train language on top of a pre-trained-for-reasoning model.

__alexs•1h ago

I had hope we might finally be ushering in a bold new era of programming in Malbolge but apparently that was too optimistic.

bwestergard•1h ago

I'm shocked to see how poorly these models, which I find useful day to day, do in solving virtually any of the problems in Unlambda.

Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.

But the model that did the best, Qwen-235B, got virtually every problem wrong.

__alexs•1h ago

They are also weirdly bad at Brainfuck which is basically just a subset of C.

simianwords•1h ago

I bet I can do better by allowing this: the llm can pull documentation of the language from the web to understand how it works.

If the llm has “skills” for that language, it will definitely increase accuracy.

orthoxerox•31m ago

> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning.

I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?

Or does this simply show that esolangs are hard to reason in by design? A more honest approach would use a "real", but relatively unpopular, language. Make them use CoffeeScript or Ada or PL/I or Odin or that other systems programming language that that very opinionated guy is implementing on top of QBE.

iloveoof•21m ago

Try MUMPS, widely used but little training data online. Probably less than some esolangs

wavemode•19m ago

> I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?

Setting aside whether this benchmark is meaningful or not - the argument you're making is faulty. There are indeed humans who can write complete programs in Brainfuck and these other esolangs. The fact that you personally can't is not logically relevant.

Cockpit is a web-based graphical interface for servers

Astral to Join OpenAI

Google details new 24-hour process to sideload unverified Android apps

How the Turner twins are mythbusting modern technical apparel

Return of the Obra Dinn: spherical mapped dithering for a 1bpp first-person game

Show HN: Three new Kitten TTS models – smallest less than 25MB

Be intentional about how AI changes your codebase

EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

Noq: n0's new QUIC implementation in Rust

Waymo Safety Impact

From Oscilloscope to Wireshark: A UDP Story (2022)

Clockwise acquired by Salesforce and shutting down next week

NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute

4Chan mocks £520k fine for UK online safety breaches

“Your frustration is the product”

Launch HN: Voltair (YC W26) – Drone and charging network for power utilities

Juggalo makeup blocks facial recognition technology (2019)

Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster

OpenBSD: PF queues break the 4 Gbps barrier

An update on Steam / GOG changes for OpenTTD

Tesla: Failure of the FSD's degradation detection system [pdf]

Xiaomi launches next-gen SU7 with 902 km range and Lidar, still undercuts Tesla

The Shape of Inequalities

The Need for an Independent AI Grid

Connecticut and the 1 Kilometer Effect

macOS 26 breaks custom DNS settings including .internal

Anthropic takes legal action against OpenCode

I turned Markdown into a protocol for generative UI

Afroman found not liable in defamation case

Android developer verification: Balancing openness and choice with safety

Cockpit is a web-based graphical interface for servers

Astral to Join OpenAI

Google details new 24-hour process to sideload unverified Android apps

How the Turner twins are mythbusting modern technical apparel

Return of the Obra Dinn: spherical mapped dithering for a 1bpp first-person game

Show HN: Three new Kitten TTS models – smallest less than 25MB

Be intentional about how AI changes your codebase

EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

Noq: n0's new QUIC implementation in Rust

Waymo Safety Impact

From Oscilloscope to Wireshark: A UDP Story (2022)

Clockwise acquired by Salesforce and shutting down next week

NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute

4Chan mocks £520k fine for UK online safety breaches

“Your frustration is the product”

Launch HN: Voltair (YC W26) – Drone and charging network for power utilities

Juggalo makeup blocks facial recognition technology (2019)

Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster

OpenBSD: PF queues break the 4 Gbps barrier

An update on Steam / GOG changes for OpenTTD

Tesla: Failure of the FSD's degradation detection system [pdf]

Xiaomi launches next-gen SU7 with 902 km range and Lidar, still undercuts Tesla

The Shape of Inequalities

The Need for an Independent AI Grid

Connecticut and the 1 Kilometer Effect

macOS 26 breaks custom DNS settings including .internal

Anthropic takes legal action against OpenCode

I turned Markdown into a protocol for generative UI

Afroman found not liable in defamation case

Android developer verification: Balancing openness and choice with safety

EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

Comments