My health challenges limit how much I can work. I've come to think of Claude Code as an accommodation engine — not in the medical-paperwork sense, but in the literal one: it gives me the capacity to finish things that a normal work environment doesn't. Observatory was built in eight days because that kind of collaboration became possible for me. (I even used Claude Code to write this post — but am only posting what resonates with me.) Two companion posts: on the recursive methodology (https://blog.unratified.org/2026-03-03-recursive-methodology...) and what 806 evaluated stories reveal (https://blog.unratified.org/2026-03-03-what-806-stories-reve...).
The observation that shaped the design: rights violations rarely announce themselves. An article about a company's "privacy-first approach" might appear on a site running twelve trackers. The interesting signal isn't whether an article mentions privacy — it's whether the site's infrastructure matches its words.
Each evaluation runs two parallel channels. The editorial channel scores what the content says about rights: which provisions it touches, direction, evidence strength. The structural channel scores what the site infrastructure does: tracking, paywalls, accessibility, authorship disclosure, funding transparency. The divergence — SETL (Structural-Editorial Tension Level) — is often the most revealing number. "Says one thing, does another," quantified.
Every evaluation separates observable facts from interpretive conclusions (the Fair Witness layer, same concept as fairwitness.bot — https://news.ycombinator.com/item?id=44030394). You get a facts-to-inferences ratio and can read exactly what evidence the model cited. If a score looks wrong, follow the chain and tell me where the inference fails.
Per our evaluations across 805 stories: only 65% identify their author — one in three HN stories without a named author. 18% disclose conflicts of interest. 44% assume expert knowledge (a structural note on Article 26). Tech coverage runs nearly 10× more retrospective than prospective: past harm documented extensively; prevention discussed rarely.
One story illustrates SETL best: "Half of Americans now believe that news organizations deliberately mislead them" (fortune.com, 652 HN points). Editorial: +0.30. Structural: −0.63 (paywall, tracking, no funding disclosure). SETL: 0.84. A story about why people don't trust media, from an outlet whose own infrastructure demonstrates the pattern.
The structural channel for free Llama models is noisy — 86% of scores cluster on two integers. The direction I'm exploring: TQ (Transparency Quotient) — binary, countable indicators that don't need LLM interpretation (author named? sources cited? funding disclosed?). Code is open source: https://github.com/safety-quotient-lab/observatory — the .claude/ directory has the cognitive architecture behind the build.
Find a story whose score looks wrong, open the detail page, follow the evidence chain. The most useful feedback: where the chain reaches a defensible conclusion from defensible evidence and still gets the normative call wrong. That's the failure mode I haven't solved. My background is math and psychology (undergrad), a decade in software — enough to build this, not enough to be confident the methodology is sound. Expertise in psychometrics, NLP, or human rights scholarship especially welcome. Methodology, prompts, and a 15-story calibration set are on the About page.
Thanks!
9wzYQbTYsAIc•1h ago
The corpus is HN front-page stories only — self-selected, tech-heavy, English-language. The aggregate patterns (authorship rates, expert-knowledge assumptions, retrospective framing) describe this specific feed, not journalism or the web broadly. HN skews in ways that make some results unsurprising (high jargon density) and others more interesting (still only 18% conflict-of-interest disclosure in a technically sophisticated audience).
The structural channel for the free Llama models (Llama 4 Scout, Llama 3.3 70B on Workers AI) is genuinely noisy — 86% of their structural scores land on two integers. I say so in the post, but it's worth repeating: those model scores should be weighted accordingly. Claude Haiku 4.5 full evaluations are more reliable; the calibration set and baselines are on the About page.
The methodology itself is preliminary. I haven't done external validation — no convergent validity test against an independent rights-focused coding scheme, no discriminant validity test against plain sentiment. Phase 0 of the roadmap is construct validity work. If anyone has experience in psychometrics or NLP evaluation and wants to poke at this, I'd genuinely welcome it.