frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Sims with verifiable rewards for web agent benchmarking and RL

https://halluminate.ai/blog/westworld
1•wujerry2000•1h ago

Comments

wujerry2000•1h ago
Hi all! Sharing some of our recent work around building RL envs and sims for agent training.

There are a lot more technical details on building the benchmark in the post. If you are interested in more RL/Post-Training, I'd highly recommend reading this super in-depth blog from our partners at Yutori: https://yutori.com/blog/introducing-navigator

Some more casual thoughts and lessons:

1) A high volume of quality RL environments / sims remain one of the largest blockers to training frontier agents, especially as labs/enterprises shift towards creating increasingly specialized AI coworkers that can do real work.

2) Building an RL env is VERY different from building a high quality dataset. While the primary input for dataset creation is specialized human annotators and clear rubrics, the inputs to building a great RL env involve humans, engineers, product, data, and an orchestration of everything together. There are a lot of green field problems when you move from building singular environments to SCALING 1-3 orders of magnitude.

3) There is a constant push/pull between building tasks that are easily verifiable and building tasks that are realistic. Its sort of like a 2x2 grid. The best (and most valuable) tasks are realistic and verifiable. There are constant tradeoffs being made, and we often find ourselves limited by the types of realistic tasks we can make if they lack a clear verifier. I'm reminded of Jason Wei's post here: https://www.jasonwei.net/blog/asymmetry-of-verification-and-...

4) When it comes to building browser sims, we found the hardest challenges to come NOT from mimicking the frontend components but rather creating a realistic distribution of data to sit on top of. Although not immediately obvious, this makes a lot of sense. For example, when building Noodle Flights, the front end UI was (although non trivial) manageable to create, but modeling the distribution of complex flight data was infinitely harder.

5) Its an iterative process. Building a perfect sim / verifier out the gate is very difficult, and a large part of the RL process is shepherding / QA of specific tasks and verifiers. The best way to do this is by constantly reviewing trajectories and spotting false positives/negatives. This is tedious work, but often front loaded - until you see smooth gains :)

Have lots more thoughts but these were just top of mind today. If this work is interesting always happy to chat (we're also hiring)!

Datadog Is Down

https://status.datadoghq.com/incidents/cvdjtf81756n
2•markiannucci•3m ago•0 comments

DMA Collectives for Efficient ML Communication Offloads

https://arxiv.org/abs/2511.06605
1•matt_d•3m ago•0 comments

Taking prenatal supplements associated with 30% lower risk of autism

https://medicalxpress.com/news/2025-11-prenatal-supplements-autism.html
1•bikenaga•4m ago•0 comments

Calculated Risk: Trade Deficit Decreased to $59.6B in August

https://www.calculatedriskblog.com/2025/11/trade-deficit-decreased-to-596-billion.html
1•speckx•5m ago•0 comments

Our tools are failing us

https://blank.page/@mo/our-tools-are-failing-us
1•boudra•6m ago•0 comments

Pixar: The Early Days

https://stevejobsarchive.com/stories/pixar-early-days
2•tosh•6m ago•0 comments

Act-1: A Robot Foundation Model Trained on Zero Robot Data

https://www.sunday.ai/journal/no-robot-data
1•pr337h4m•7m ago•0 comments

Fine, Trade Labubu Futures

https://www.bloomberg.com/opinion/newsletters/2025-11-19/fine-trade-labubu-futures
1•ioblomov•8m ago•1 comments

Enumerating Three Billion WhatsApp Accounts for Security and Privacy

https://github.com/sbaresearch/whatsapp-census
2•filippofinke•9m ago•0 comments

Understanding neural networks through sparse circuits – OpenAI

https://openai.com/index/understanding-neural-networks-through-sparse-circuits/
1•JnBrymn•10m ago•0 comments

Gov. Abbott's office redacts pages of emails about Elon Musk

https://www.kut.org/politics/2025-11-19/texas-governor-abbott-elon-musk-emails-redacted
6•pavel_lishin•13m ago•0 comments

Nest Thermostats upload 50 megabytes to Google every day after being disabled [video]

https://www.youtube.com/watch?v=jC5wcJM8iuU
1•tartoran•14m ago•1 comments

Building with Distributed Actors: What and Why

https://withblue.ink/2025/11/19/distributed-actors-model.html
2•ItalyPaleAle•14m ago•0 comments

Europe wants to make space food out of thin air and astronaut pee

https://www.space.com/space-exploration/human-spaceflight/europe-wants-to-make-space-food-out-of-...
3•domofutu•16m ago•0 comments

What AI Is Really For

https://www.chrbutler.com/what-ai-is-really-for
4•delaugust•16m ago•0 comments

A simple UK self-employed tax calculator (instant monthly estimate)

https://selfemployedtaxcalculators.co.uk/
1•seo-punk•17m ago•1 comments

Was MCP a mistake? The internet weighs in

https://www.aiengineering.report/p/was-mcp-a-mistake-the-internet-weighs
3•waprin•18m ago•0 comments

Chinese EV makers accelerate robotics drive for 'game-changing' edge over US

https://www.scmp.com/business/china-evs/article/3333310/chinese-ev-makers-accelerate-robotics-dri...
2•Teever•20m ago•0 comments

OpenHands Software Agent SDK

https://github.com/OpenHands/software-agent-sdk
1•rbren•21m ago•0 comments

Show HN: Allein - Markdown editor with AI autocompletion, completely offline

https://github.com/szilarddoro/allein
1•szdoro•22m ago•0 comments

Firefox adds support for customizable keyboard shortcuts

https://bugzilla.mozilla.org/show_bug.cgi?id=1995889
3•spiros•23m ago•2 comments

Clinically ready magnetic microrobots for targeted therapies

https://www.science.org/doi/10.1126/science.adx1708
2•domofutu•25m ago•0 comments

Session Theft and DPoP

https://byo.propelauth.com/post/session-theft-and-dpop
4•aisrael•25m ago•0 comments

Pyrefly Beta (fast type checker and language server for Python) [video]

https://www.youtube.com/watch?v=4o0RLJJ-FAo
1•ocamoss•26m ago•0 comments

Show HN: Dia2, open-weights TTS model for realtime speech to speech

https://github.com/nari-labs/dia2
1•toebee•26m ago•0 comments

I'm 3.5 webapps deep into nothingness

1•thepra•26m ago•0 comments

Sunday's Memo Robot

https://twitter.com/sundayrobotics/status/1991196264772387261
1•kelguerin•27m ago•0 comments

Linus "my first, and hopefully last flamefest" Torvalds (1992)

http://groups.google.com/group/comp.os.minix/msg/6372404c547d7ab4
2•birdculture•27m ago•0 comments

Versatile gene-switch tool uses non-toxic molecule for safer research

https://phys.org/news/2025-11-versatile-gene-tool-toxic-molecule.html
1•PaulHoule•31m ago•0 comments

Show HN: Build AI chatbots and structured APIs easily with custom RAG knowledge

https://easyai.passiolife.com
1•aebranton•32m ago•1 comments