frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

TOON – Token Oriented Object Notation

https://github.com/johannschopplich/toon
178•royosherove•3mo ago

Comments

anonymoushn•3mo ago
Hello, it's probably better to add leading spaces before all of the words rather than none of them
meander_water•3mo ago
I don't get it, can't you just use yaml instead of inventing another DSL.
mhosayny•3mo ago
It's more compact than YAML. More like a combination of YAML and CSV.
jscheel•3mo ago
For repeating objects of the same structure, yaml will still require each key on each object, whereas this is a hybrid with csv, so it defines the keys once.
3cats-in-a-coat•3mo ago
No one forces us to use objects in JSON with repeated keys you know.
makapuf•3mo ago
Indeed a

    {"header": ["some","column","names"], "values": [[1,2,3],[4,5,6],...]}
could fit.
jscheel•3mo ago
For sure, but most people aren't thinking intentionally about what they are dumping into their context either ;)
inopinatus•3mo ago
Norway.
dragonwriter•3mo ago
YAML 1.2 has been out for 16 years now, so I would simply not assume that the suggestion to use YAML for a new purpose means “use YAML 1.1”.
inopinatus•3mo ago
I could agree that you would not make poor assumptions.

Your LLM, however, may experience cross-format feature superposition and consequential spurious activation.

flyer23•3mo ago
It is, also noone uses it:)
Too•3mo ago
This TOON is bound to have the same problem, because strings are not quoted. You can’t differentiate between the number 123 and the string ”123”.

For LLM consumption, this might not matter, don’t use this for anything else.

vessenes•3mo ago
I’ll be interested to see benchmarks. My expectation is that accuracy will take a hit on mid or longer context prompts: I’d bet that the heavy use of JSON in fine tuning will end up impacting quality of a more terse (less reasoning space) novel encoding.

That said: I like the idea!

brian-bk•3mo ago
There are a very light benchmarks in the Readme, or are you looking for more?
Mumps•3mo ago
Do you mean the [0] Token Benchmarks section? I only see token count numbers.

Which doesn't address the question: do LLMs understand TOON the same as they would JSON? It's quite likely that this notation is not interpreted the same by most LLM, as they would JSON. So benchmarks on, say, data processing tasks, would be warranted.

[0] https://github.com/johannschopplich/toon?tab=readme-ov-file#...

tujux•3mo ago
I think they're talking about these sections:

1. Retrieval Accuracy - https://github.com/johannschopplich/toon?tab=readme-ov-file#...

2. Performance by dataset - https://github.com/johannschopplich/toon?tab=readme-ov-file#...

saretup•3mo ago
I would assume the next iterations/fine-tuned variants of current models would reach similar accuracy for TOON as they do for JSON.

The current models unfortunately do not have TOON in their training set, so they would probably require additional input tokens to grok the notation, and even then probably won’t have the same accuracy as they do for JSON.

mattcollins•3mo ago
FWIW, I ran a test comparing LLM accuracy with TOON versus JSON, CSV and a variety of other formats when using them to represent tabular data: https://www.improvingagents.com/blog/is-toon-good-for-table-...

I've only looked at one model (gpt-4.1-nano) so far. I'm hoping to run similar tests on some other models but it gets challenging to discern statistically significant differences with better models as their accuracy tends to be a lot better across the board.

mattcollins•3mo ago
Results from some further tests here: https://www.improvingagents.com/blog/toon-benchmarks
moralestapia•3mo ago
[flagged]
jayd16•3mo ago
I'm not sure which one would win but its a bit telling that compression isn't mentioned at all.

I guess its about LLMs so the idea is has to be plaintext? But if you can train it on TOON can't you train it on BSON?

inopinatus•3mo ago
JSON unmarshalling often has to consider separately whether an attribute is absent, false, zero, null, or the empty string, but this was never quite semantically ambiguous enough for my tastes, so adding that void-ish values may also now be serialised as a tuple of length [0] seems to me an excellent additional obfuscation.
joshribakoff•3mo ago
The use case here is to reduce the token usage with LLMs, such as an agent that outputs a list of commands eg. Tuples with files to write and their new contents.

Supporting this use case doesn’t require perfectly marshaling every data structure ever.

But to your point the tool could have wider use cases without the limitations.

inopinatus•3mo ago
If one trains a model to understand it then that model will inevitably emit it, which means in turn one shall have to parse it, and now the application supports TOON for anything, and good luck telling the users/customers any different.
ziofill•3mo ago
What if there’s a simple converter back to json after the model output? Is that possible?
anonymoushn•3mo ago
Arrays of length 0 also exist in json?
andrus•3mo ago
Yes, this is valid JSON: []
Pxtl•3mo ago
I'm sorry I don't see this adding value over various other formats. I don't really want a new object serialization format, I just want the existing ones to have the features I need. YAML but with static typing and schema. XML but without crazy internet features. TOML but with an object format that doesn't hurt my brain. JSON but with decent multiline strings and comments. NestedText but with a sub-standard that provides static-typing and schema and whatnot.
foxglacier•3mo ago
The benchmarks show it performs better than them, so that's the value - cost savings and improved accuracy. I suppose you could convert JSON to TOON just for the LLM and not actually read it with your own brain.
verdverm•3mo ago
https://cuelang.org | https://cuetorials.com

CUE can emit the other formats (minus XML because it's a beast of ambiguity, but there are other tools for json->xml i.e.)

It also has modules and imports, a very underrated feature for config languages if you haven't experienced it before

tptacek•3mo ago
This isn't really an interchange formula so much as something you'd JIT compile down to when handing things off to an LLM, right?
furyofantares•3mo ago
And on the way out of the LLM. Token savings nice on the way out too, and then also I have to imagine it's better for the LLM to see one format in all of it's context instead of two.

It seems like a nice idea to me if restricted to that. Although I guess I am not sure if it's really intended that way - the array count for example is probably pretty bad for LLM output.

tptacek•3mo ago
I feel like on the output side you might be working against LLM training? But I don't know.
rs186•3mo ago
I don't think you even need to care about this as a format. It could exist only during communication and encoded/decoded by middleware, and everything still works.
hedgehog•3mo ago
It would be interesting to compare this to BAML and TOML.
toobulkeh•3mo ago
Definitely is a core feature of BAML. My main complaint with BAML is that it’s all or nothing. It’s very opinionated and we can’t get the benefits without the DX and vice versa. Separating this feature without requiring a DSL of model definition is a great add.
hedgehog•3mo ago
TOML has some readability and compactness benefits over JSON while still being both common enough for models to easily be able to process it relatively reliably and widely supported in most languages. I suspect BAML still performs better but likewise due to the tooling work I haven't integrated it.
3cats-in-a-coat•3mo ago
I'll say the obvious. A lot of this you can just do in JSON.

Let's take the example:

    {
      "users": [
        { "id": 1, "name": "Alice", "role": "admin" },
        { "id": 2, "name": "Bob", "role": "user" }
      ]
    }

    users[2]{id,name,role}:
      1,Alice,admin
      2,Bob,user
We can keep it JSON, but use more compact list expressions, as tuples when pragmatic:

    ["users",
       [1, "Alice", "admin"],
       [2, "Bob", "user"]
    ]
The thing is the game with LLMs is not what's shortest, but what's:

1. Mainstream, so they understand it.

2. What they're tuned for, and their tuned for what's mainstream (JSON).

If you want to go extreme compression you can shove it all in JSON strings too and keep the larger structure JSON:

    ["users",
       "1:admin:Alice",
       "2:user:Bob",
    ]
You may say "how is this better". Well it's better because it's still JSON, there's less to explain to the LLM, and to your other devs. Even if we use a weird compact format like "id:role:name" this is still shorter to explain than a completely different syntax with its whole world of rules.
rc1•3mo ago
If fairness to toon, the alternative json your giving doesn’t include hints on structure.

Not sure LLM are more “tuned” to JSON.

That said, your general point holds that toon maybe unnecessary. Especially in the examples given. But perhaps plan text would suffice. Toon could be useful when automating inputs with many different shapes.

copypaper•3mo ago
Yea exactly. The LLMs are tuned to natural language. I don't think anything will beat good ol' templating (a.k.a. plain text). In Go I do something like this:

  // mytemplate.tmpl
  Description="The following data is for the users in our application."
  Format="id,name,role"
  length=2
  Data:
  {{range .}}
  {{.ID}}, {{.Name}}, {{.Role}}
  {{end}}
This way you're able to change the formatting to something the LLM understands for each struct. The LLM might understand some structs better as JSON, others as YAML, and others in an arbitrary format. Templating gives you the most flexibility to choose which one will work best.
mentalgear•3mo ago
Neat. I did a similar thing with CSV (instead of JSON) a year back. Great that there are measurements, but I think the really interesting measure would have it run against the actual "Structured Output Format" endpoints of LLM providers, e.g. those fine-tuned to return valid JSON.
andreygrehov•3mo ago
I don’t know what I’m talking about (pure fantasy), but what if you train a model on compressed data and then perform inference on compressed data as well? Could this work? With the output also being compressed and then decompressed by the client?
Loranubi•3mo ago
Since all input is run through a tokenizer, I would expect the tokenizer space doesn't change a lot between one trained on uncompressed vs one trained on compressed data.
WorldMaker•3mo ago
The tokenizer is already a form of (somewhat lossy) compression of a string of plaintext to a stream of token identifiers. You can reason about Tokenizers/"embedding spaces" as a sort of massive "Dictionary Table/Dictionary Function" like you might use in a zip/gzip stream.

Starting with already compressed data doesn't necessarily mean fewer tokens, you can probably assume similar entropy (or probably worse entropy) in expanding "Dictionary words" in a compressed stream versus "tokens" from a plaintext stream.

s1mon•3mo ago
Obligatory XKCD: https://xkcd.com/927/
chuckadams•3mo ago
indentation-based sounds pretty brittle for a serialization format. I imagine a tabular format that factors out repeating keys could be expressed fairly compactly in json itself.
metalliqaz•3mo ago
What is the font used on that README image?
drewlesueur•3mo ago
Looks like one of the variations of Iosevka. https://github.com/be5invis/Iosevka
metalliqaz•3mo ago
Well done, sir!
awaseem•3mo ago
This is awesome, I saw it on twitter and gave it a star
rs186•3mo ago
I wonder how many tokens will be saved compared to real JSON if we use a special version where property names don't require quotes, like in JavaScript.
neilv•3mo ago
If you instead put parentheses around the lexical sequences, then you wouldn't need syntax like `[3]` to denote length.

You also wouldn't need indentation levels to be syntactically meaningful.

You could also get rid of LLM tokens like square brackets, curly braces, colons, and commas.

And you could have objects nested to arbitrary depth.

In near the same character count as TOON (sometimes more, sometimes less).

(I was telling someone over the weekend that there are only a few small wins for Lisps in most AI work right now. I hadn't considered that the printed syntax itself might have a use with these LLM huge black boxes.)

hdjfjkremmr•3mo ago
have you tried it? models struggle keeping track of opening/closing braces, which is exactly why xml/csv (or toon) tends to work better than json
neilv•3mo ago
What is the reason that the current LLMs work better with XML than with JSON? Is it the names in the element tags?
AvAn12•3mo ago
Similar to YAML? Also do consider ancient formats like fixed width - in which case you don’t even need delimiter characters. Are LLMs clever enough to parse these if given a code book or old-school INPUT statement? Cheers
pshirshov•3mo ago
I have this: https://github.com/7mind/sick , a binary deduplicating storage for JSON-like data structures with efficient direct access (no parsing required). It's even more efficient in terms of space and access speed (but not manually editable).
viggity•3mo ago
Since the whole point of this is to limit LLM token consumption, it'd be interesting to see the results of prompts that use it.

I've seen a ton of people who just paste a CSV into a prompt and expect it to work well because they don't know any better, but the results are typically hot garbage. It's too repetitive, it can't memorize and/or process such a big chunk of data. Asking an LLM to use pandas to iteratively analyze some CSV works great, though.

yohbho•3mo ago
LLMs read

> users[2]{id,name,role}: 1,Alice,admin 2,Bob,user

differently than me, i guess. I would read that as "at index value of two, i.e. the third element of an array, the values 1aliceadmin and 2bobuser are stored, or not, since we want to destructure these values and a pair value of a tuple of three is given. and would be confused and think wtf is that, dear user, did you omit or misformat values?

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
473•klaussilveira•7h ago•116 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
812•xnx•12h ago•487 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
32•matheusalmeida•1d ago•1 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
157•isitcontent•7h ago•17 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
156•dmpetrov•7h ago•67 comments

A century of hair samples proves leaded gas ban worked

https://arstechnica.com/science/2026/02/a-century-of-hair-samples-proves-leaded-gas-ban-worked/
91•jnord•3d ago•12 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
50•quibono•4d ago•6 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
260•vecti•9h ago•123 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
207•eljojo•10h ago•134 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
328•aktau•13h ago•158 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
327•ostacke•13h ago•86 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
411•todsacerdoti•15h ago•219 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
22•kmm•4d ago•1 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
337•lstoll•13h ago•241 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
52•phreda4•6h ago•9 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
4•romes•4d ago•0 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
195•i5heu•10h ago•144 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
115•vmatsiiako•12h ago•38 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
152•limoce•3d ago•79 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
244•surprisetalk•3d ago•32 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
996•cdrnsf•16h ago•420 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
25•gfortaine•5h ago•3 comments

FORTH? Really!?

https://rescrv.net/w/2026/02/06/associative
46•rescrv•15h ago•17 comments

I'm going to cure my girlfriend's brain tumor

https://andrewjrod.substack.com/p/im-going-to-cure-my-girlfriends-brain
67•ray__•3h ago•28 comments

Evaluating and mitigating the growing risk of LLM-discovered 0-days

https://red.anthropic.com/2026/zero-days/
38•lebovic•1d ago•11 comments

Show HN: Smooth CLI – Token-efficient browser for AI agents

https://docs.smooth.sh/cli/overview
78•antves•1d ago•59 comments

How virtual textures work

https://www.shlom.dev/articles/how-virtual-textures-really-work/
30•betamark•14h ago•28 comments

Show HN: Slack CLI for Agents

https://github.com/stablyai/agent-slack
41•nwparker•1d ago•11 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
7•gmays•2h ago•2 comments

Evolution of car door handles over the decades

https://newatlas.com/automotive/evolution-car-door-handle/
41•andsoitis•3d ago•62 comments