frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

How2Everything: Mining the web to evaluate and improve LLMs on real procedures

https://allenai.org/blog/how2everything
1•maxloh•1h ago

Comments

maxloh•1h ago
From Allen AI's Discord:

Introducing *How2Everything*—an open framework for benchmarking & improving how LLMs generate step-by-step procedures.

LLMs constantly produce instructions for everything from filing taxes to plans for AI agents, but improving this capability is challenging. Outputs can sound fluent while describing steps that don't actually work, surface-level metrics miss critical mistakes like omitted prerequisites or contradictory instructions, and manual verification doesn't scale.

How2Everything closes this gap with a practical loop: mine real procedures from the web → benchmark LLM outputs → detect critical failures (missing steps, wrong order, omissions) → use that signal to train better models.

It has three main components:

*How2Mine*—a pipeline that extracts & standardizes procedures from web pages covering 14 topics

*How2Bench*—a 7,000-procedure benchmark built from How2Mine

*How2Score*—an evaluation protocol powered by How2Judge, an open 8B judge model trained to flag critical failures

How2Judge agrees with human judgments ~80% of the time and is cheap enough for large-scale eval, making it practical as both a benchmark scorer and an RL reward signal.

RL training with How2Score yields >10-point gains on Qwen3 4B, Qwen3 8B, and Olmo 3 7B Think, with no regressions across 12 standard benchmarks covering knowledge, reasoning, chat, math, and code. How2Bench also scales cleanly, remaining informative from early 1B pretraining checkpoints through frontier LLMs. And we stress-tested two shortcut explanations (format compliance and memorization); neither accounts for the improvements, pointing to real gains in procedure generation.

The full How2Everything framework, including How2Judge, is available now.

Blog: https://allenai.org/blog/how2everything

Paper: https://arxiv.org/pdf/2602.08808

Code: https://github.com/lilakk/how2everything

HF: https://huggingface.co/collections/how2everything/how2everyt...

Show HN: I made Seedance 2.0 accessible before the official API launches

https://seedance2-pro.com
1•samidatikakr•51s ago•0 comments

Deepfaking Orson Welles's Mangled Masterpiece

https://www.newyorker.com/magazine/2026/02/09/deepfaking-orson-welless-mangled-masterpiece
1•CharlesW•1m ago•0 comments

China's Data Center Boom: A View from Zhangjiakou (2025)

https://sinocities.substack.com/p/chinas-data-center-boom-a-view-from
1•fzliu•2m ago•0 comments

Video can be "recovered" from Nest cameras even without cloud subscription

https://www.nbcnews.com/news/us-news/authorities-release-surveillance-photo-potential-subject-nan...
1•mv4•2m ago•1 comments

ICE defies judges' orders to release detainees, step by step

https://www.politico.com/news/2026/02/10/ice-immigration-detention-court-orders-00771727
1•SilverElfin•3m ago•0 comments

Introducing winpulse

https://xenodium.com/introducing-winpulse
1•xenodium•4m ago•0 comments

'E-bike for your feet': How bionic sneakers could change human mobility

https://www.npr.org/2026/02/10/nx-s1-5698195/nike-amplify-bionic-sneakers
1•apparent•5m ago•0 comments

New ARIA research funding programme: nearly £50M to secure AI agents in the wild

https://www.aria.org.uk/opportunity-spaces/trust-everything-everywhere/scaling-trust/funding/
1•multiagent•5m ago•0 comments

Digital Sovereignty Won't Save Us from Internet Shutdowns

https://www.ictworks.org/digital-sovereignty-wont-save-us-from-internet-shutdowns/
1•laurex•7m ago•0 comments

Build a AI coding agent in less than 700 lines of Python code

https://leanpub.com/build-your-own-coding-agent
1•jingweno•8m ago•0 comments

The Day I Realized Recovery Affects VO₂ Max More Than Effort

https://vo2maxpro.com/blog/recovery-affects-vo2-max-more-than-effort
1•GoodluckH•9m ago•0 comments

A pattern for safe database access with AI coding agents

https://docs.getpochi.com/tutorials/secure-db-access-in-pochi/
2•gyxlucy•9m ago•0 comments

A Deep Dive into Ruby C Extension Memory Management: embedded vs. separate (2025)

https://medium.com/@m.mastrodonato/a-deep-dive-into-ruby-c-extension-memory-management-embedded-v...
1•ciconia•10m ago•0 comments

Show HN: AIOpt – local-only guardrail for LLM token/cost regressions

https://github.com/tkddlr0716-collab/aiopt
1•psi0716•12m ago•0 comments

Hard Work and Success

https://almowry.com/writing/hard-work-and-success/
1•amukbils•12m ago•1 comments

I want a phone I can fix, and Fairphone's growth shows the world does too

https://www.androidcentral.com/phones/i-want-a-phone-i-can-actually-fix-and-fairphones-record-gro...
2•NoboruWataya•13m ago•0 comments

FBI releases surveillance video in Guthrie case recovered from Nest cam back end

https://twitter.com/FBIDirectorKash/status/2021281103454072983/photo/1
1•tokyobreakfast•13m ago•0 comments

Jeffrey Epstein's digital cleanup crew

https://www.theverge.com/report/876081/jeffrey-epstein-files-seo-google-digital-footprint-emails
4•imartin2k•15m ago•0 comments

Real-time Reddit sentiment tracker for stock trading

https://www.wsbsentiment.com/
1•shawnmfarnum•15m ago•3 comments

Trump's War on History

https://www.motherjones.com/politics/2026/02/america-freedom-task-force-250-trump-anniversary-his...
3•leotravis10•15m ago•0 comments

Quitting .NET after 22 years

https://www.thatsoftwaredude.com/content/14253/quitting-dot-net-after-22-years
1•Waltz1•16m ago•0 comments

Is human collaboration the answer to the skill formation risks by AI?

https://www.gethopp.app/blog/pair-prompting
1•iparaskev•20m ago•2 comments

Microsoft Should Watch the Expanse

https://idiallo.com/blog/microsoft-should-watch-the-expanse
1•nomdep•20m ago•0 comments

Show HN: Cosmic CLI – Build, deploy, and manage apps from your terminal with AI

https://github.com/cosmicjs/cli
1•tonyspiro•20m ago•0 comments

AgentLogs: Open-source observability for AI coding agents

https://github.com/agentlogs/agentlogs
1•tosh•21m ago•0 comments

WordCatcher

https://wordwalker.ca/games/word-catcher/
1•petedrinnan•21m ago•0 comments

Breakthrough pancreatic cancer therapy blocks tumor resistance in mice

https://www.pnas.org/doi/10.1073/pnas.2523039122
1•DpdC•22m ago•0 comments

Show HN: Multimodal perception system for real-time conversation

https://raven.tavuslabs.org
2•mert_gerdan•23m ago•2 comments

Heuristics for lab robotics, and where its future may go

https://www.owlposting.com/p/heuristics-for-lab-robotics-and-where
1•abhishaike•24m ago•0 comments

Show HN: Traction – Security readiness framework for scaling SaaS teams

https://traction.fyi
1•ERROR_0x06•25m ago•0 comments