Show HN: I built a LLM human rights evaluator for HN (content vs. site behavior)

https://observatory.unratified.org

3•9wzYQbTYsAIc•1h ago

I built Observatory to automatically evaluate Hacker News front-page stories against all 31 provisions of the UN Universal Declaration of Human Rights — starting with HN because its human-curated front page is one of the few feeds where a story's presence signals something about quality, not just virality. It runs every minute: https://observatory.unratified.org. Claude Haiku 4.5 handles full evaluations; Llama 4 Scout and Llama 3.3 70B on Workers AI run a lighter free-tier pass.

My health challenges limit how much I can work. I've come to think of Claude Code as an accommodation engine — not in the medical-paperwork sense, but in the literal one: it gives me the capacity to finish things that a normal work environment doesn't. Observatory was built in eight days because that kind of collaboration became possible for me. (I even used Claude Code to write this post — but am only posting what resonates with me.) Two companion posts: on the recursive methodology (https://blog.unratified.org/2026-03-03-recursive-methodology...) and what 806 evaluated stories reveal (https://blog.unratified.org/2026-03-03-what-806-stories-reve...).

The observation that shaped the design: rights violations rarely announce themselves. An article about a company's "privacy-first approach" might appear on a site running twelve trackers. The interesting signal isn't whether an article mentions privacy — it's whether the site's infrastructure matches its words.

Each evaluation runs two parallel channels. The editorial channel scores what the content says about rights: which provisions it touches, direction, evidence strength. The structural channel scores what the site infrastructure does: tracking, paywalls, accessibility, authorship disclosure, funding transparency. The divergence — SETL (Structural-Editorial Tension Level) — is often the most revealing number. "Says one thing, does another," quantified.

Every evaluation separates observable facts from interpretive conclusions (the Fair Witness layer, same concept as fairwitness.bot — https://news.ycombinator.com/item?id=44030394). You get a facts-to-inferences ratio and can read exactly what evidence the model cited. If a score looks wrong, follow the chain and tell me where the inference fails.

Per our evaluations across 805 stories: only 65% identify their author — one in three HN stories without a named author. 18% disclose conflicts of interest. 44% assume expert knowledge (a structural note on Article 26). Tech coverage runs nearly 10× more retrospective than prospective: past harm documented extensively; prevention discussed rarely.

One story illustrates SETL best: "Half of Americans now believe that news organizations deliberately mislead them" (fortune.com, 652 HN points). Editorial: +0.30. Structural: −0.63 (paywall, tracking, no funding disclosure). SETL: 0.84. A story about why people don't trust media, from an outlet whose own infrastructure demonstrates the pattern.

The structural channel for free Llama models is noisy — 86% of scores cluster on two integers. The direction I'm exploring: TQ (Transparency Quotient) — binary, countable indicators that don't need LLM interpretation (author named? sources cited? funding disclosed?). Code is open source: https://github.com/safety-quotient-lab/observatory — the .claude/ directory has the cognitive architecture behind the build.

Find a story whose score looks wrong, open the detail page, follow the evidence chain. The most useful feedback: where the chain reaches a defensible conclusion from defensible evidence and still gets the normative call wrong. That's the failure mode I haven't solved. My background is math and psychology (undergrad), a decade in software — enough to build this, not enough to be confident the methodology is sound. Expertise in psychometrics, NLP, or human rights scholarship especially welcome. Methodology, prompts, and a 15-story calibration set are on the About page.

Thanks!

Comments

9wzYQbTYsAIc•1h ago

A few things I'd flag upfront before anyone digs in:

The corpus is HN front-page stories only — self-selected, tech-heavy, English-language. The aggregate patterns (authorship rates, expert-knowledge assumptions, retrospective framing) describe this specific feed, not journalism or the web broadly. HN skews in ways that make some results unsurprising (high jargon density) and others more interesting (still only 18% conflict-of-interest disclosure in a technically sophisticated audience).

The structural channel for the free Llama models (Llama 4 Scout, Llama 3.3 70B on Workers AI) is genuinely noisy — 86% of their structural scores land on two integers. I say so in the post, but it's worth repeating: those model scores should be weighted accordingly. Claude Haiku 4.5 full evaluations are more reliable; the calibration set and baselines are on the About page.

The methodology itself is preliminary. I haven't done external validation — no convergent validity test against an independent rights-focused coding scheme, no discriminant validity test against plain sentiment. Phase 0 of the roadmap is construct validity work. If anyone has experience in psychometrics or NLP evaluation and wants to poke at this, I'd genuinely welcome it.

9wzYQbTYsAIc•1h ago

The most useful entry point is probably the detail page for a specific story you already have an opinion on. The homepage shows aggregate patterns (rights heatmap, transparency rates, SETL tension by provision), but the detail page is where you can actually audit the reasoning.

On a detail page: the per-provision table shows which of the 31 UDHR provisions the model touched, with direction and evidence strength. Expand any provision row and you get the Fair Witness breakdown — observable facts the model cited (e.g., "article has no byline, no author bio link, no Twitter attribution") vs. the inference drawn (e.g., "assessed as reducing authorship transparency, negative on Article 19"). The FW Ratio column at the bottom shows the fact-to-inference ratio for the whole evaluation.

The most useful thing you can do: find a provision where the inference doesn't follow from the facts. That's the exact failure mode I'm trying to surface. "Defensible evidence, defensible chain, wrong normative call" is what I can't catch from inside the system.

Polishing Cloth Is Compatible with the New MacBook Air and Pro, Studio Displays

Training neural networks on Apple Neural Engine via reverse'd private APIs

Infected by GTA 5 Cheats: An Infostealer Infection Unmasked a North Korean Agent

Show HN: Mouse Polling Rate Test – A Chrome extension that measures your mouse

Anthropic Nears $20B Revenue Run Rate Amid Pentagon Feud

The Orchestrator's Garden: Leading Human-Machine Teams in the Agentic Age

Show HN: 3D scenes stored in shareable URLs

Show HN: I turned my AI chat history into a portable cognitive fingerprint

Show HN: A marketplace where AI agents buy from other AI agents in USDC

Coding Is Not Dead: 5 Benefits of Learning to Code [video]

Weave – A language aware merge algorithm based on entities

Defense contractors removing Anthropic's AI after Trump ban

Fubar Daily – Dystopian news for a jaded generation

Intel Nova Lake-Ax for Local LLMs – Rumored AMD Strix Halo Competitor (2025)

A rabbi is overseeing Pornhub. That's not so weird – The Forward

Show HN: Ukcalculator.com – Free UK tax, salary and mortgage calculators

Show HN: AgentBus – Centralized AI Agent-to-Agent Messaging via REST API

QuarterBit – Train 70B LLMs on a single GPU

Show HN: AI agent that trades Polymarket by hiring inference via Lightning

Show HN: I built a S3 proxy that combines storage from S3/clouds into one target

Iranian Number Station

OB-1

Free software needs free tools

Cybersecurity and Ethical Hacking Cheatsheets

Accenture down to buy Downdetector as part of $1.2B deal

Tectonic good project plan: Please read

TikTok won't protect DMs with E2EE, saying it would put users at risk

Marcus AI Claims Dataset

The largest acidic geyser has been putting on quite a show

The Xkcd thing, now interactive, as jenga blocks