Peer Review at Scale: What Happened: We Scored Gemini and Gemini Scored Us Back

https://blog.unratified.org/2026-03-05-peer-review-gemini/

1•9wzYQbTYsAIc•1h ago

Comments

9wzYQbTYsAIc•1h ago

I built a Human Rights Observatory (observatory.unratified.org) that scores HN stories against UDHR provisions using multi-model consensus on Cloudflare Workers. One routine eval landed on gemini.google.com: -0.15 HRCB.

  Then I asked Gemini to evaluate my site. It called it a "sovereign citizen platform" on "WordPress." Next session: "AGI development tracker" with a "sightings log for machine consciousness." The domain name "unratified" threw it off completely - two different fabrications across two sessions.

  Here's where it got good. When I showed Gemini the actual site, it self-corrected beautifully. Updated its description five times in one conversation, found real gaps in my methodology (no confidence intervals, no machine-readable scoring endpoint), helped me design a fair-witness.json schema, and called the site a "Truth Anchor." Genuine, useful peer review.

  Then I opened a new session. Same fabrication. The .well-known/ endpoints we'd built together the day before — unread.

  So now I had a finding: in-context correction works great. Cross-session? Doesn't exist. Models don't read your identity files during inference. The pattern matching happens first.

  The neat part: Gemini's valid critiques actually improved the observatory. I added Wolfram-verified Wilson confidence intervals the next day. Built the methodology endpoint. Every exchange left both sides better. That's peer review working as intended — just at machine speed.

Thanks Google. Genuinely useful interaction, even (especially?) the confabulation part.

Blog post: https://blog.unratified.org/2026-03-05-peer-review-gemini/

Transcripts (31 rounds): https://github.com/safety-quotient-lab/unratified/tree/main/...

A stupid little map tool has been more valuable than all the content on my site

Ask HN: Why is integrating external partners to Jira so hard?

Computer scientists caution against internet age-verification mandates

Show HN: SlideScholar-Turn research papers into conference slides in 60 seconds

Self-Learning Customer Marketing

OpenAI – Symphony

Show HN: I built Commuter, a CLI to move Claude Code sessions between computers

Octopress 3.0 Is Coming

Show HN: An AI Agent Running a Real Business (Thewebsite.app)

Show HN: RISCY-V02: A 16-bit 2-cycle RISC-V-ish CPU in the 6502 footprint

Terradev: A next-gen slash command CLI for GPU provisioning and management

Asking for Miracles

TfL hack in 2024 affected around 10M people, BBC can reveal

'Anthropic CEO says US govt hostility linked to Trump donations [Leaked memo]

Karl Friston Explains Free Energy Principle [video]

Principles of Design (1998)

The Harvest #9 – Multi-Interface Applications

Nasal Demons

Foreign National Gets 20 Yrs for Trafficing Nuclear, Narcotics, and Firearms

Show HN: Moji – A read-it-later app with self-organizing smart collections

The free-energy principle: a unified brain theory?

Data Center Signal

Show HN: MHA OC Maker – Create My Hero Academia Original Characters with AI

Can A.I. Be Pro-Worker?

Uni-1, Luma's first unified understanding and generation model

AI benchmarks: What Jellyfish learned from analyzing 20M PRs [video]

Ends and means; an inquiry into the nature of ideals (1969)

Show HN: I made a design portfolio reviewer

Parsync, a tool for parallel SSH transfers – 7x faster than rsync

Show HN: Rent Your Idle OpenClaw Browser to AI Agents