Show HN: SDF Protocol – Pre-compiled semantic JSON for AI agent web consumption

1•spranab•1h ago

Comments

spranab•1h ago

Hi HN, I built SDF (Structured Data Format), an open protocol that sits between web content and AI agents.

The problem: Every agent that consumes a web page independently fetches HTML, strips boilerplate, extracts entities, and classifies content. A typical page is ~89KB of HTML (~73K tokens). When 100 agents consume the same URL, this extraction happens 100 times with inconsistent results.

What SDF does: Convert once into a schema-validated JSON document (~750 tokens) containing entities, claims, relationships, summaries, and type-specific structured data. Agents consume the pre-extracted representation directly.

Results from production deployment (2,335 documents, 10 content types):

99.2% token reduction from HTML 90% extraction accuracy with fine-tuned 1.5B + 3B model cascade 4.1x faster than monolithic 14B baseline Downstream experiment: general-purpose 7B model scores 0.739 accuracy from SDF vs 0.352 from raw markdown (p < 0.05) The pipeline runs locally on consumer hardware (dual RTX 3090 Ti). Fine-tuned models are open on HuggingFace (sdfprotocol/sdf-classify, sdfprotocol/sdf-extract). Protocol spec and JSON schemas are on GitHub.

Protocol spec + schemas: https://github.com/sdfprotocol/sdf Whitepaper: https://doi.org/10.5281/zenodo.18559223 Models: https://huggingface.co/sdfprotocol Happy to answer questions about the design decisions, the type system, or the evaluation methodology.

ksaj•1h ago

I wonder if people will eventually surf ad-free by sniffing out these files. Easy to parse (maybe even easier than the actual article itself) and no ads or otherwise unrelated distractions.

spranab•1h ago

What do you mean? I just wanted to share something I am working on. Trying to understand what you meant by ads.

ksaj•52m ago

Not your ads. I'm saying that if a site that has ads also has these files, you could get the gist of the article by reading these files instead of going to the ad-laden page itself.

spranab•51m ago

That actually is great, we can add ads detection and extract only the relevant information. Thanks @ksaj

ksaj•46m ago

That's a step further than I was thinking, but I most definitely like the direction.

The Anthropic Hive Mind

The Sling: Humanity's Forgotten Power

Noam Chomsky, Jeffrey Epstein and the Politics of Betrayal

Why securing AI model weights isn't enough

Cursor Composer 1.5

Stop Telling Users Their DNS Is Wrong

Show HN: Voice Legacy: AI that interviews your parents before it's too late

Have the patents for H.264 MPEG-4 AVC expired yet?

The Great Displacement: AI and the Next Fifty Years

GitButler CLI Is Good

Modern Keystroke Visualizer for Linux

AskHN: Is Auth0 Down Again?

Strengthening Windows trust and security through User Transparency and Consent

Paragraphic – Parametric graphic design app made in Godot

Ask HN: I experienced an Attack on Telegram and simcards gone!!!

Ask HN: Any good open source projects written by AI agents?

European Processor Initiative

Show HN: Linkpreview.io – Debug and preview social share cards

The power of anime: using anime for education and outreach in STEM

German patent classified as state secret

Show HN: MumbleFlow – $5 local voice-to-text (whisper.cpp, Rust, no cloud)

Hims cancels plans to sell compounded GLP-1 pill after FDA backlash

Regime-Declared Mathematics as Survivor Sets

National data, local stories: ICE detention in 2026

Is AI the Paperclip?

AGI/Singularity: 9,300 Predictions Analyzed

America Has a Tungsten Problem

Show HN: A last-minute romantic gift app with private links

Show HN: I killed my Calendly link after people booking randomly

Show HN: Claude Cowork for Startup Market Analysis