EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

https://esolang-bench.vercel.app/

30•matt_d•1h ago

Comments

deklesen•1h ago

Mhh... my hunch is that part of this is that all python keywords are 1 token, I assume. And for those very weird languages, tokenizing might make it harder to reason over those tokens.

Would love to see how the benchmarks results change if the esoteric languages are changed a bit to make them have 1-token keywords only.

chychiu•1h ago

Considering that brainfuck only has 8 characters and models are scoring at 6.2% I don't think tokenization is the issue

altruios•54m ago

The only issue. *

Reasoning is hard, reasoning about colors while wearing glasses that obfuscate the real colors... even harder... but not the core issue if your brain not wired correctly to reason.

I suspect the way out of this is to separate knowledge from reason: to train reasoning with zero knowledge and zero language... and then to train language on top of a pre-trained-for-reasoning model.

__alexs•1h ago

I had hope we might finally be ushering in a bold new era of programming in Malbolge but apparently that was too optimistic.

bwestergard•1h ago

I'm shocked to see how poorly these models, which I find useful day to day, do in solving virtually any of the problems in Unlambda.

Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.

But the model that did the best, Qwen-235B, got virtually every problem wrong.

__alexs•1h ago

They are also weirdly bad at Brainfuck which is basically just a subset of C.

simianwords•1h ago

I bet I can do better by allowing this: the llm can pull documentation of the language from the web to understand how it works.

If the llm has “skills” for that language, it will definitely increase accuracy.

orthoxerox•16m ago

> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning.

I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?

Or does this simply show that esolangs are hard to reason in by design? A more honest approach would use a "real", but relatively unpopular, language. Make them use CoffeeScript or Ada or PL/I or Odin or that other systems programming language that that very opinionated guy is implementing on top of QBE.

iloveoof•6m ago

Try MUMPS, widely used but little training data online. Probably less than some esolangs

wavemode•4m ago

> I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?

Setting aside whether this benchmark is meaningful or not - the argument you're making is faulty. There are indeed humans who can write complete programs in Brainfuck and these other esolangs. The fact that you personally can't is not logically relevant.

Forked Garry Tan's gstack and adapted for Google's Antigravity and Gemini-CLI

I Spoke to AI Agent Claude – Sen Bernie Sanders

ShouldIBuildThat finds app opportunities that appear across multiple signals

Building a UI Framework [pdf]

IdeaClaw – one sentence, get a camera-ready paper, BP, DD reports, health report

What's in a name? – The unknown faces of history

Making an Argument for (Voluntary) Online Identity Verification

To Catholic thinkers, Pentagon's AI demands violate 'human dignity'

I built a database scoring what separates high-scoring pitch decks from the rest

House speaker, Intel chiefs make new push to renew surveillance law

Replacing Anki: what I learned building a language app (1k users, $21 MRR)

Agent-rendered: the pattern that replaces runtime infra with build-time AI

Vulnerabilities in OpenClaw: A Complete Enterprise Security Analysis

Minecraft Source Code Is Interesting

AI Pentester

Update iOS to protect your iPhone from web attacks

New "PolyShell" flaw allows unauthenticated RCE on Magento e-stores

Generalized Dot-Product Attention: Tackling Real-World Challenges in GPU Kernels

Delve (YC W24) – Fake Compliance as a Service – Part I

M^2RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

COW Fork: Zero-Copy Sandbox Cloning for AI Agents

Netcup Increases Prices over 21%

Location of French aircraft carrier leaked in real time via Strava user on board

360B tokens, 3M customers, 6 engineers

Beat Paxos

Things That Turbo Pascal Is Smaller Than (2011)

Justice Department Disrupts Iranian Cyber Enabled Psychological Operations

US Jobless Claims Fell Last Week to Lowest Since January

Kalshi in Hot Water – What This Means for Startups Like PolyBets

Crypto.com lays off 12% of workforce as latest company to cite AI in job cuts