frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

An agentic verification loop to stop LLMs from faking tests

https://github.com/tzachbon/smart-ralph/pull/128
3•malka666•1h ago

Comments

malka666•1h ago
Hi HN,

I've been building infrastructure for autonomous agents and hit a wall many of you probably recognize: if you let an LLM write both the code and the tests, the agent will simply rewrite the test to pass and hide its own bugs. It doesn't fix things; it masks them.

I decided to tackle this by leaning heavily into Spec-Driven Development (SDD). I submitted a massive PR to smart-ralph (an excellent Claude agent project) introducing what I call "Verification Contracts". Instead of static scripts or Gherkin, the agent receives observable signals and hard invariants. It uses Playwright via MCP to explore the DOM, reasons about the system state like a human QA, and autonomously backtracks to fix the code if it breaks an invariant.

The Elephant in the room: Tokens. Giving an agent this level of exploratory freedom burns through context and tokens at an insane rate. Doing this on a commercial API is cost-prohibitive.

To make this viable, I rely entirely on local inference. I've also open-sourced my local infrastructure stack for running this on Blackwell RTX 5090s so others can run deep verification loops locally:

Linux Optimizer for Blackwell: [Enlace a tu repo de optimizador]

Sovereign vLLM Stack: [Enlace a tu repo de vLLM]

Would love to hear your thoughts on SDD and how you are handling the 'agents faking tests' problem.

Cameron Reed: The Sci-Fi Novelist Who Disappeared for Decades

https://www.newyorker.com/books/under-review/the-sci-fi-novelist-who-disappeared-for-decades
1•mitchbob•33s ago•1 comments

Vim Classic

https://vim-classic.org/
1•netule•37s ago•0 comments

repz ret (2012)

https://repzret.org/p/repzret/
1•davikr•1m ago•0 comments

Dear Anthropic – We're Tired of Babysitting Claude

https://onlinesav.com/index.html
1•RobertHeifler•1m ago•0 comments

Show HN: Cue: Desktop app for subtitling and burned-in video export

https://github.com/davidspivak2/cue
1•davesp2•3m ago•1 comments

Data Services on Rackspace Spot:Persistent Storage Strategies

https://medium.com/@ITInAction/data-services-on-rackspace-spot-postgresql-redis-es-and-persistent...
1•aleroawani•9m ago•0 comments

A $20/month user costs OpenAI $65 in compute. AI video is a money furnace

https://aedelon777.substack.com/p/i-did-the-math-on-sora-ai-video-is
1•Aedelon•10m ago•0 comments

Show HN: RiceVM – A Dis virtual machine and Limbo compiler in Rust

2•habedi0•11m ago•0 comments

SpaceX Targets More Than $2 Trillion Valuation in IPO

https://www.bloomberg.com/news/articles/2026-04-02/spacex-is-said-to-target-more-than-2-trillion-...
1•frmersdog•11m ago•1 comments

Show HN: A LinkedIn Browser Gate Blocker

https://github.com/xaskasdf/linkedin-defense
1•xaskasdf•11m ago•0 comments

A Retro Artemis II Realtime Tracker

https://artemis-tracker.meandmybadself.com/
1•meandmybadself•12m ago•1 comments

Landdown – Simple shell script sandbox

https://git.sr.ht/~marcc/landdown
1•speckx•12m ago•0 comments

How to Build Realistic AI Companions

https://www.emotionmachine.com/blog/realistic-ai-companions
1•sarbak•12m ago•0 comments

Cloning Bench: Evaluating AI Agents on Visual Website Cloning

https://github.com/vibrantlabsai/cloning-bench
1•shahules•14m ago•1 comments

Ref/ect: Self-Improving RL layer on top of Observability

https://getreflect.starlight-search.com
2•Sonam_AI•14m ago•1 comments

Alignment Whack-a-Mole

https://arxiv.org/abs/2603.20957
1•ai_critic•14m ago•0 comments

Go-LLM-proxy – Lightweight LLM aggregator (vLLM, Llama-server)

https://go-llm-proxy.com
1•yatesdr•14m ago•1 comments

Perplexity Computer as a Second Brain

https://www.ai-supremacy.com/p/perplexity-computer-second-brain-tutorial
1•Lunaboo•14m ago•0 comments

Unitree Goes Public

https://www.chinatalk.media/p/unitrees-ipo
2•nanfinitum•15m ago•0 comments

Codemend – your app breaks in prod, the fix arrives in your Telegram

https://codemend.ai
1•nanamint•16m ago•1 comments

Give Claude Code a real browser

https://steel.dev/blog/give-claude-code-a-real-browser
1•nkko•18m ago•0 comments

NASA's mission to orbit the Moon is interrupted by Outlook refusing to open

https://twitter.com/tomwarren/status/2039693524266873055
1•rmason•18m ago•0 comments

Failed AI tractor company lays off all employees, abandons Bay Area headquarters

https://www.sfgate.com/tech/article/monarch-ai-tractor-failure-22183476.php
4•randycupertino•18m ago•1 comments

Map of breast tissue changes reveals role of menopause in cancer susceptibility

https://www.cam.ac.uk/research/news/most-detailed-map-to-date-of-breast-tissue-changes-reveals-ro...
1•gmays•19m ago•0 comments

Coreutils: A Comprehensive Review (2023)

https://ratfactor.com/slackware/pkgblog/coreutils
2•birdculture•19m ago•0 comments

What is next for big tech after landmark addiction verdict?

https://www.bbc.com/news/articles/c87wd0d84jqo
1•paulpauper•20m ago•0 comments

How well do you remember the 2017 Bitcoin bull run?

https://longmarkets.app/rewinds/rewind-bitcoin-2017
1•nswizzle31•20m ago•0 comments

The Team Behind a Pro-Iran, Lego-Themed Viral-Video Campaign

https://www.newyorker.com/culture/infinite-scroll/the-team-behind-a-pro-iran-lego-themed-viral-vi...
1•mitchbob•20m ago•1 comments

Against the Concept of Telescopic Altruism

https://www.astralcodexten.com/p/against-the-concept-of-telescopic
1•paulpauper•21m ago•0 comments

Crash-proofing a Zig desktop app with randomized fuzzing and Claude

https://enopdf.com/blog/searching-for-unknown-unknowns/
2•basscodes•21m ago•1 comments