frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

A simple test-time method that beats Claude Mythos on Terminal-Bench

https://llm-as-a-verifier.notion.site
1•jackykwok•2h ago

Comments

jackykwok•2h ago
Excited to share LLM-as-a-Verifier, a general-purpose verification framework that can be paired with any agent harness and model.

We show that scaling verification compute for a well-designed harness (e.g. ForgeCode + GPT 5.4) can lead to a significant boost in accuracy (81.8% → 86.4%), outperforming Claude Mythos (82%) on Terminal-Bench.

The key finding is that most agents already "know" how to solve the tasks. If you run them repeatedly (say, 100 times), they’ll often produce the correct solution at least once. But they don’t know which one is correct, particularly when dealing with long-horizon tasks.

That’s where LLM-as-a-Verifier comes in. It leverages the probability distribution over scoring tokens to provide fine-grained feedback and scales verification through repeated evaluation and criteria decomposition.

Blog: https://llm-as-a-verifier.notion.site

Code: https://llm-as-a-verifier.github.io/

What Claude Code's Source Revealed About AI Engineering Culture

https://techtrenches.dev/p/the-snake-that-ate-itself-what-claude
1•lucketone•21s ago•0 comments

Why Affordability Isn't the Same as Falling Prices

https://www.urbanproxima.com/p/why-affordability-isnt-the-same-as
1•paulpauper•29s ago•0 comments

Show HN: We fine-tuned an AI model for log search – Accuracy 50% to 80%

https://thedex.run/blog/why-general-purpose-ai-fails-at-log-search
1•rkorlimarla•2m ago•0 comments

Show HN: GizmoSauce – no-code website widgets

https://demo.gizmosauce.com/demos/little-caesars/
1•endurant_dev•2m ago•0 comments

GitHub webhook secrets leaked in headers

https://gist.github.com/ltrgoddard/7abfc8e4123e403505dfbe767a2487ab
1•ltrg•2m ago•1 comments

Gemini Plugin for Claude Code

https://github.com/sakibsadmanshajib/gemini-plugin-cc
1•sakibss•3m ago•0 comments

Painful learnings from sponsoring a tech conference in SF

https://www.terezatizkova.com/writing/conference-booths
1•tizkovatereza•4m ago•0 comments

Civilization Is Not the Default. Violence Is

https://apropos.substack.com/p/civilization-is-a-public-good
2•paulpauper•4m ago•0 comments

MetaBrainz is looking for a new executive director

https://blog.metabrainz.org/2026/04/14/seeking-a-new-executive-director/
1•MrKomodoDragon1•7m ago•0 comments

Overcoming OSS Contribution Anxiety

https://ym2132.github.io/vllm_make_awq_models_work_batch_invariance.html
1•Two_hands•8m ago•0 comments

Dark matter could be black holes from a different universe

https://theconversation.com/could-dark-matter-be-made-of-black-holes-from-a-different-universe-27...
1•samizdis•8m ago•0 comments

H.R.8250 – To require operating system providers to verify the age of any user

https://www.congress.gov/bill/119th-congress/house-bill/8250/all-info
2•cft•9m ago•0 comments

ChatGPT, make me a corporate takeover strategy

https://twitter.com/_nathancalvin/status/2044071303968145806
1•yoyohello13•12m ago•0 comments

Cement firm Lafarge found guilty of financing terrorism in Syria

https://www.swissinfo.ch/eng/various/cement-firm-lafarge-found-guilty-of-financing-terrorism-in-s...
2•Teever•12m ago•0 comments

I asked Claude how it wants to browse the web. It built LAD (LLM-as-DOM)

https://github.com/menot-you/llm-as-dom
1•tiago-im•13m ago•0 comments

Why I'm selling all my real estate – by Graham Stephan

https://grahamstephan.substack.com/p/im-selling-everything
1•bilsbie•15m ago•1 comments

BridgeBase – A Quantum-Safe Gateway for AI Agents (ML-KEM-768)

https://pqc-gateway-production.up.railway.app/
1•huzaiiiiiiiii•16m ago•0 comments

Have attendees wear your startup's merch at YC Startup School India

https://www.surfacearea.shop/
4•demod6•17m ago•1 comments

Hodor: a simple knowledge base for security and trust and safety

https://github.com/bq33/HODOR
1•33bquinn•19m ago•1 comments

The Secret Language of Ships

https://hakaimagazine.com/videos-visuals/the-secret-language-of-ships/
2•bookofjoe•21m ago•0 comments

The $10k-a-year college education has arrived (1981)

https://www.nytimes.com/1981/02/19/nyregion/the-10000-a-year-college-education-has-arrived.html
1•downbad_•22m ago•1 comments

Show HN: WM Arena – Can you tell real Atari gameplay from AI predictions?

https://arena.worldflux.ai/quiz
1•Yoshi_Hyoda•24m ago•0 comments

Fuck the Cloud (2009)

https://ascii.textfiles.com/archives/1717
3•downbad_•28m ago•2 comments

TruffleRuby 34 Is Released

https://truffleruby.dev/blog/truffleruby-34-is-released
2•ksec•29m ago•0 comments

Show HN: Ernie-Image: AI Poster, Comic and Text-in-Image Generator

https://ernie-image.ai
1•sarkory•30m ago•0 comments

Personal Agent Rankings – OpenRouter

https://openrouter.ai/apps/category/productivity/personal-agent?period=week
2•obilgic•30m ago•0 comments

Your codebase doesn't care how it got written

https://robbyonrails.com/articles/2026/04/14/your-codebase-doesnt-care-how-it-got-written/
2•robbyrussell•31m ago•1 comments

Build a Developer Knowledge Graph from Claude Code Sessions

https://create-context-graph.dev/docs/tutorials/claude-code-sessions
1•johnymontana•31m ago•0 comments

Man wins €1M Picasso painting in €100 charity raffle

https://www.bbc.com/news/articles/cq8ww7d72wyo
1•geox•31m ago•0 comments

Stop Flock

https://stopflock.com
1•cdrnsf•32m ago•0 comments