Show HN: ISON – Data format that uses 30-70% fewer tokens than JSON for LLMs

https://github.com/maheshvaikri-code/ison

7•maheshvaikri99•1mo ago

ISON (Interchange Simple Object Notation) - a data format optimized for LLMs and Agentic AI.

The problem: JSON wastes tokens. Curly braces, quotes, colons, commas - all eat into your context window.

ISON uses tabular patterns that LLMs already understand from training data:

JSON (87 tokens): { "users": [ {"id": 1, "name": "Alice", "email": "alice@example.com"}, {"id": 2, "name": "Bob", "email": "bob@example.com"} ] }

ISON (34 tokens): table.users id:int name:string email 1 Alice alice@example.com 2 Bob bob@example.com

Features: - 30-70% token reduction - Type annotations - References between tables - Schema validation (ISONantic) - Streaming format (ISONL)

Implementations: Python, JavaScript, TypeScript, Rust, C++ 9 packages, 171+ tests passing

pip install ison-py # Parser pip install isonantic # Validation & schemas

npm install ison-parser # JavaScript npm install ison-ts # TypeScript with full types npm install isonantic-ts # Validation & schemas

[dependencies] ison-rs = "1.0" isonantic-rs = "1.0" # Validation & schemas

Looking for feedback on the format design.

Comments

dtagames•1mo ago

Personally, I'm against anything that goes against the standard LLM data formats of JSON and MD. Any perceived economy is outweighed by confusion when none of these alternative formats exist in the training data in any real sense and every one of them has to be translated (by the LLM) to be used in your code or to apply to your real data.

Any tokens you saved will be lost 3x over in that process, as well as introducing confusing new context information that's unrelated to your app.

maheshvaikri99•1mo ago

Fair point, but I'd push back on "none of these alternative formats exist in training data."

ISON isn't inventing new syntax. It's CSV/TSV with a header - which LLMs have seen billions of times. The table format:

table.users id name email 1 Alice alice@example.com

...is structurally identical to markdown tables and CSVs that dominate training corpora.

On the "3x translation overhead" - ISON isn't meant for LLM-to-code interfaces where you need JSON for an API call. It's for context stuffing: RAG results, memory retrieval, multi-agent state passing.

If I'm injecting 50 user records into context for an LLM to reason over, I never convert back to JSON. The LLM reads ISON directly, reasons over it, and responds.

The benchmark: same data, same prompt, same task. ISON uses fewer tokens and gets equivalent accuracy. Happy to share the test cases if you want to verify.

dtagames•1mo ago

That's exactly the problem. Why convert anything, especially if it's as lossy as CSVs are? You lose nesting and the rest of your structure in favor of a single header row. That's not a benefit.

If your real data is in JSON (and in JS/TS apps, it always is at runtime as only JSON objects exist in that language) it makes no sense to ever convert it, period.

Besides, corporate report type CSVs that are in training materials don't have data shapes anything like JSON or even most businesses software. You're crippling an established and useful data carrier in order to save pennies on tokens. Tokens are getting cheaper, so it's the wrong optimization.

maheshvaikri99•1mo ago

Fair enough. Let me clarify the use case:

ISON isn't meant to replace JSON in your application. Your JS/TS code still uses JSON objects internally. ISON is specifically for the LLM context window.

The flow: App (JSON) → serialize to ISON → inject into prompt → LLM reasons → response → your app

You're right that nesting is lost. But for LLM reasoning, flat structures often work better. LLMs struggle with deeply nested JSON - they lose track of parent-child relationships 4+ levels deep.

On "tokens are getting cheaper": True for API costs. But context windows are still limited. When you're stuffing RAG results, memory, agent state, and user history into 128K tokens, every byte matters. It's not about saving money - it's about fitting more context.

On "wrong optimization": I ran the benchmark. Same data, same task. ISON: 88.3% accuracy. JSON: 84.7%. The LLM actually performed better with the tabular format, not just "equivalent for fewer tokens."

## BENCHMARK STATS:

TOKEN EFFICIENCY: ISON: 3,550 tokens JSON: 12,668 tokens

  ISON vs JSON:        72.0% reduction

LLM ACCURACY (300 Questions): ISON: 265/300 ( 88.3%) JSON: 254/300 ( 84.7%)

EFFICIENCY (Acc/1K): ISON: 24.88 JSON: 6.68 ISON is 272.3% MORE EFFICIENT than JSON!

But I hear you - if your data is deeply nested and that nesting carries semantic meaning the LLM needs, JSON might be the right choice. ISON works best for relational/tabular data going into context.

dClauzel•1mo ago

Just use CSV at this point :D

maheshvaikri99•1mo ago

Ha, fair. CSV gets you 80% there.

The 20% ISON adds: - Multiple named tables in one doc - Cross-table references - No escaping hell (quoted strings handled cleanly) - Schema validation (ISONantic)

If you're stuffing one flat table into context, CSV works fine. When you have users + orders + products with relationships, ISON saves you from JSON's bracket tax.

throw03172019•1mo ago

So CSV with a “typed” header?

maheshvaikri99•1mo ago

Essentially yes, but with a few additions CSV lacks:

1. Multiple tables in one document (table.users, table.orders) 2. References between tables (:user:42 links to id 42) 3. Object blocks for config/metadata 4. Streaming format (ISONL) for large datasets

The type annotations are optional - they help LLMs understand the schema without inference.

You could think of it as "CSV that knows about relationships" - which is exactly what multi-agent systems need when passing state around.

throw03172019•1mo ago

Got it. Thanks.

Any data on how LLMs like this format? Are they able to make the associations etc?

maheshvaikri99•1mo ago

Yes - I ran a 300 Questions benchmark comparing ISON vs JSON vs JSON-COMPACT etc on the same tasks.

ISON: 88.3% accuracy JSON: lower (can share exact numbers if interested)

Tested across Claude, GPT-4, DeepSeek, and Llama 3.

The key finding: LLMs handle tabular formats natively because they've seen billions of markdown tables and CSVs in training. No special prompting needed.

For associations, I tested with multi-table ISON docs like:

table.users id name 1 Alice 2 Bob

table.orders id user_id product 101 :1 Widget 102 :2 Gadget

Prompt: "What did Alice order?"

All models correctly resolved :1 → Alice → Widget without explicit instructions about the reference syntax.

The 30-70% token savings come from removing JSON's structural overhead (braces, quotes, colons, commas) while keeping the same semantic density.

Haven't published formal benchmarks on this yet - that's good feedback. I should.

dmarwicke•1mo ago

tried this with msgpack last year. accuracy tanked. models have seen a trillion json examples, like 12 of whatever format you invent

maheshvaikri99•1mo ago

Token Efficiency

  | Format       | Tokens | vs JSON  |
  |--------------|--------|----------|
  | ISONGraph    | 639    | -69%     |
  | ISON         | 685    | -66%     |
  | TOON         | 856    | -58%     |
  | JSON Compact | 1,072  | -47%     |
  | JSON         | 2,039  | baseline |

  LLM Accuracy

  | Format       | Correct | Accuracy | Acc/1K Tokens |
  |--------------|---------|----------|---------------|
  | ISONGraph    | 46/50   | 92.0%    | 143.97        |
  | ISON         | 44/50   | 88.0%    | 128.47        |
  | JSON         | 42/50   | 84.0%    | 41.20         |
  | JSON Compact | 41/50   | 82.0%    | 76.49         |
  | TOON         | 40/50   | 80.0%    | 93.46         |

  Key Findings

  1. ISONGraph wins on both efficiency AND accuracy - 92% correct with fewest tokens
  2. ISON/ISONGraph excel at multi-hop queries - LLM can follow relationships easily
  3. Acc/1K metric shows ISONGraph provides 3.5x more value per token than JSON
  4. Graph-specific format helps LLM understand relationships better than flat JSON

maheshvaikri99•1mo ago

Published the Benchmark with Results

https://ison.dev/benchmark.html

https://github.com/maheshvaikri-code/ison/tree/main/benchmar...

quinncom•1mo ago

When including ISON alongside normal text, which language should you use for the code fence info string? Is `ison` a known code type, i.e.:

```ison object.config timeout 30 debug true api_key "sk-xxx-secret" max_retries 3 ```

Why there is no official statement from Substack about the data leak

Effects of Zepbound on Stool Quality

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Complex Heterodynes Explained

EVs Are a Failed Experiment

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]

Essential CDN: The CDN that lets you do more than JavaScript

They Hijacked Our Tech [video]

Vouch

HRL Labs in Malibu laying off 1/3 of their workforce

Show HN: High-performance bidirectional list for React, React Native, and Vue

Show HN: I built a Mac screen recorder Recap.Studio

Why there is no official statement from Substack about the data leak

Effects of Zepbound on Stool Quality

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Complex Heterodynes Explained

EVs Are a Failed Experiment

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]

Essential CDN: The CDN that lets you do more than JavaScript

They Hijacked Our Tech [video]

Vouch

HRL Labs in Malibu laying off 1/3 of their workforce

Show HN: High-performance bidirectional list for React, React Native, and Vue

Show HN: I built a Mac screen recorder Recap.Studio

Show HN: ISON – Data format that uses 30-70% fewer tokens than JSON for LLMs

Comments