frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: I built a triple-agent LLM system that verifies its own work

1•pupibott•3mo ago
Hi HN,

Six months ago, I asked Gemini to "send my weekly report to the team." It replied: " Email sent successfully"—but the email was never sent. The attachment was wrong. Nobody told me.

That's when I realized: *LLMs lie about their own execution.*

---

*The Problem:*

When you ask an LLM to automate multi-step tasks (search file → attach → send), it cheerfully reports success even when: - The file doesn't exist (hallucinates the ID) - The API call failed silently - Permissions were denied

Single-LLM systems have no incentive to admit failure; they optimize for appearing helpful, not for being correct.

---

*My Solution: Don't Let the LLM Grade Its Own Homework*

I built PupiBot with three separate agents that cannot collude, ensuring *the agent that executed the step is NOT the one verifying it succeeded.*

The architecture is simple:

* *CEO Agent (Planner, Gemini Flash):* Generates the execution plan (No API access). * *COO Agent (Executor, Gemini Pro):* Executes steps, calls 81 Google APIs, returns raw API responses. * *QA Agent (Verifier, Gemini Flash):* *After EVERY critical step, validates success with real, independent API calls.* Triggers retry if verification fails.

*Real Example (Caught & Fixed):* User: "Email last month's sales report to Alice" * Search Drive: Not found * *QA Agent:* "Step failed. Retries with fuzzy search." * Finds: "Q3\_Sales\_Final\_v2.pdf" | *QA Agent:* "File verified. Proceed." * Sends email | *QA Agent:* "Email delivered. Attachment confirmed."

It's like code review: you don't approve your own PRs.

---

*Current Implementation & Transparency:*

* *Open Source*: MIT License, Python 3.10+ * *APIs*: Google Workspace (Gmail, Drive, Contacts, Calendar, Docs). * *Reliability (Self-Tested):* Baseline (single Gemini Pro) was ~70% success. PupiBot (triple-agent) achieves *~92% success* on same tasks. * *Known Limitation*: Google-only, 3x LLM overhead (tradeoff: reliability > speed), early stage.

---

*Why I'm Sharing This (My Garage Story):*

I'm not a programmer, I have no formal CS degree. My development process was simple: I'd use PupiBot as my daily assistant, manually log every error, and bring that "bug report" to my AI assistants (Claude, Gemini) to fix.

PupiBot is my 'custom car' built in the garage, fueled by passion and persistence. I’m finally opening the door to invite the real mechanics (you, HN) to examine the engine.

*What I'd Love from HN:* 1. *Feedback* on the independent QA agent pattern. 2. *Benchmarking ideas* for rigorous evaluation. 3. *Architectural critiques.* Where's the weak link?

---

*Links:* - GitHub: https://github.com/PupiBott/PupiBot1.0 - Quick Demo (1:44 min): https://youtube.com/shorts/wykKckwaukY?si=0xdn7rM6B2tMAIPw - Architecture Docs: https://github.com/PupiBott/PupiBot1.0/blob/main/ARCHITECTUR...

Built with by a self-taught technology enthusiast in Chile Special thanks to Claude Sonnet 4.5 for being my coding partner throughout this journey

Introduce the Vouch/Denouncement Contribution Model

https://github.com/ghostty-org/ghostty/pull/10559
1•DustinEchoes•47s ago•0 comments

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

https://github.com/sultanvaliyev/sshcode
1•sultanvaliyev•1m ago•0 comments

Microsoft appointed a quality czar. He has no direct reports and no budget

https://jpcaparas.medium.com/microsoft-appointed-a-quality-czar-he-has-no-direct-reports-and-no-b...
1•RickJWagner•2m ago•0 comments

Multi-agent coordination on Claude Code: 8 production pain points and patterns

https://gist.github.com/sigalovskinick/6cc1cef061f76b7edd198e0ebc863397
1•nikolasi•3m ago•0 comments

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

https://www.nytimes.com/2026/02/07/technology/washington-post-will-lewis.html
1•jbegley•3m ago•0 comments

DevXT – Building the Future with AI That Acts

https://devxt.com
2•superpecmuscles•4m ago•2 comments

A Minimal OpenClaw Built with the OpenCode SDK

https://github.com/CefBoud/MonClaw
1•cefboud•4m ago•0 comments

The silent death of Good Code

https://amit.prasad.me/blog/rip-good-code
2•amitprasad•5m ago•0 comments

The Internal Negotiation You Have When Your Heart Rate Gets Uncomfortable

https://www.vo2maxpro.com/blog/internal-negotiation-heart-rate
1•GoodluckH•6m ago•0 comments

Show HN: Glance – Fast CSV inspection for the terminal (SIMD-accelerated)

https://github.com/AveryClapp/glance
2•AveryClapp•7m ago•0 comments

Busy for the Next Fifty to Sixty Bud

https://pestlemortar.substack.com/p/busy-for-the-next-fifty-to-sixty-had-all-my-money-in-bitcoin-...
1•mithradiumn•8m ago•0 comments

Imperative

https://pestlemortar.substack.com/p/imperative
1•mithradiumn•9m ago•0 comments

Show HN: I decomposed 87 tasks to find where AI agents structurally collapse

https://github.com/XxCotHGxX/Instruction_Entropy
1•XxCotHGxX•13m ago•1 comments

I went back to Linux and it was a mistake

https://www.theverge.com/report/875077/linux-was-a-mistake
1•timpera•14m ago•1 comments

Octrafic – open-source AI-assisted API testing from the CLI

https://github.com/Octrafic/octrafic-cli
1•mbadyl•15m ago•1 comments

US Accuses China of Secret Nuclear Testing

https://www.reuters.com/world/china/trump-has-been-clear-wanting-new-nuclear-arms-control-treaty-...
2•jandrewrogers•16m ago•1 comments

Peacock. A New Programming Language

1•hashhooshy•21m ago•1 comments

A postcard arrived: 'If you're reading this I'm dead, and I really liked you'

https://www.washingtonpost.com/lifestyle/2026/02/07/postcard-death-teacher-glickman/
2•bookofjoe•22m ago•1 comments

What to know about the software selloff

https://www.morningstar.com/markets/what-know-about-software-stock-selloff
2•RickJWagner•26m ago•0 comments

Show HN: Syntux – generative UI for websites, not agents

https://www.getsyntux.com/
3•Goose78•27m ago•0 comments

Microsoft appointed a quality czar. He has no direct reports and no budget

https://jpcaparas.medium.com/ab75cef97954
2•birdculture•27m ago•0 comments

AI overlay that reads anything on your screen (invisible to screen capture)

https://lowlighter.app/
1•andylytic•28m ago•1 comments

Show HN: Seafloor, be up and running with OpenClaw in 20 seconds

https://seafloor.bot/
1•k0mplex•28m ago•0 comments

Tesla turbine-inspired structure generates electricity using compressed air

https://techxplore.com/news/2026-01-tesla-turbine-generates-electricity-compressed.html
2•PaulHoule•30m ago•0 comments

State Department deleting 17 years of tweets (2009-2025); preservation needed

https://www.npr.org/2026/02/07/nx-s1-5704785/state-department-trump-posts-x
3•sleazylice•30m ago•1 comments

Learning to code, or building side projects with AI help, this one's for you

https://codeslick.dev/learn
1•vitorlourenco•30m ago•0 comments

Effulgence RPG Engine [video]

https://www.youtube.com/watch?v=xFQOUe9S7dU
1•msuniverse2026•32m ago•0 comments

Five disciplines discovered the same math independently – none of them knew

https://freethemath.org
4•energyscholar•32m ago•1 comments

We Scanned an AI Assistant for Security Issues: 12,465 Vulnerabilities

https://codeslick.dev/blog/openclaw-security-audit
1•vitorlourenco•33m ago•0 comments

Amazon no longer defend cloud customers against video patent infringement claims

https://ipfray.com/amazon-no-longer-defends-cloud-customers-against-video-patent-infringement-cla...
2•ffworld•34m ago•0 comments