How2Everything: Mining the web to evaluate and improve LLMs on real procedures

1•maxloh•1h ago

Comments

maxloh•1h ago

From Allen AI's Discord:

Introducing *How2Everything*—an open framework for benchmarking & improving how LLMs generate step-by-step procedures.

LLMs constantly produce instructions for everything from filing taxes to plans for AI agents, but improving this capability is challenging. Outputs can sound fluent while describing steps that don't actually work, surface-level metrics miss critical mistakes like omitted prerequisites or contradictory instructions, and manual verification doesn't scale.

How2Everything closes this gap with a practical loop: mine real procedures from the web → benchmark LLM outputs → detect critical failures (missing steps, wrong order, omissions) → use that signal to train better models.

It has three main components:

*How2Mine*—a pipeline that extracts & standardizes procedures from web pages covering 14 topics

*How2Bench*—a 7,000-procedure benchmark built from How2Mine

*How2Score*—an evaluation protocol powered by How2Judge, an open 8B judge model trained to flag critical failures

How2Judge agrees with human judgments ~80% of the time and is cheap enough for large-scale eval, making it practical as both a benchmark scorer and an RL reward signal.

RL training with How2Score yields >10-point gains on Qwen3 4B, Qwen3 8B, and Olmo 3 7B Think, with no regressions across 12 standard benchmarks covering knowledge, reasoning, chat, math, and code. How2Bench also scales cleanly, remaining informative from early 1B pretraining checkpoints through frontier LLMs. And we stress-tested two shortcut explanations (format compliance and memorization); neither accounts for the improvements, pointing to real gains in procedure generation.

The full How2Everything framework, including How2Judge, is available now.

Blog: https://allenai.org/blog/how2everything

Paper: https://arxiv.org/pdf/2602.08808

Code: https://github.com/lilakk/how2everything

HF: https://huggingface.co/collections/how2everything/how2everyt...

Show HN: I made Seedance 2.0 accessible before the official API launches

Deepfaking Orson Welles's Mangled Masterpiece

China's Data Center Boom: A View from Zhangjiakou (2025)

Video can be "recovered" from Nest cameras even without cloud subscription

ICE defies judges' orders to release detainees, step by step

Introducing winpulse

'E-bike for your feet': How bionic sneakers could change human mobility

New ARIA research funding programme: nearly £50M to secure AI agents in the wild

Digital Sovereignty Won't Save Us from Internet Shutdowns

Build a AI coding agent in less than 700 lines of Python code

The Day I Realized Recovery Affects VO₂ Max More Than Effort

A pattern for safe database access with AI coding agents

A Deep Dive into Ruby C Extension Memory Management: embedded vs. separate (2025)

Show HN: AIOpt – local-only guardrail for LLM token/cost regressions

Hard Work and Success

I want a phone I can fix, and Fairphone's growth shows the world does too

FBI releases surveillance video in Guthrie case recovered from Nest cam back end

Jeffrey Epstein's digital cleanup crew

Real-time Reddit sentiment tracker for stock trading

Trump's War on History

Quitting .NET after 22 years

Is human collaboration the answer to the skill formation risks by AI?

Microsoft Should Watch the Expanse

Show HN: Cosmic CLI – Build, deploy, and manage apps from your terminal with AI

AgentLogs: Open-source observability for AI coding agents

WordCatcher

Breakthrough pancreatic cancer therapy blocks tumor resistance in mice

Show HN: Multimodal perception system for real-time conversation

Heuristics for lab robotics, and where its future may go

Show HN: Traction – Security readiness framework for scaling SaaS teams