frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: I built a clawdbot that texts like your crush

https://14.israelfirew.co
1•IsruAlpha•48s ago•0 comments

Scientists reverse Alzheimer's in mice and restore memory (2025)

https://www.sciencedaily.com/releases/2025/12/251224032354.htm
1•walterbell•3m ago•0 comments

Compiling Prolog to Forth [pdf]

https://vfxforth.com/flag/jfar/vol4/no4/article4.pdf
1•todsacerdoti•5m ago•0 comments

Show HN: Cymatica – an experimental, meditative audiovisual app

https://apps.apple.com/us/app/cymatica-sounds-visualizer/id6748863721
1•_august•6m ago•0 comments

GitBlack: Tracing America's Foundation

https://gitblack.vercel.app/
1•martialg•6m ago•0 comments

Horizon-LM: A RAM-Centric Architecture for LLM Training

https://arxiv.org/abs/2602.04816
1•chrsw•7m ago•0 comments

We just ordered shawarma and fries from Cursor [video]

https://www.youtube.com/shorts/WALQOiugbWc
1•jeffreyjin•7m ago•1 comments

Correctio

https://rhetoric.byu.edu/Figures/C/correctio.htm
1•grantpitt•8m ago•0 comments

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

https://chillphysicsenjoyer.substack.com/p/trying-to-make-an-automated-ecologist
1•crescit_eundo•12m ago•0 comments

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

https://www.twz.com/air/watch-ukraines-minigun-firing-drone-hunting-turboprop-in-action
1•breve•13m ago•0 comments

Free Trial: AI Interviewer

https://ai-interviewer.nuvoice.ai/
1•sijain2•13m ago•0 comments

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

https://www.fda.gov/news-events/press-announcements/fda-intends-take-action-against-non-fda-appro...
8•randycupertino•14m ago•2 comments

Supernote e-ink devices for writing like paper

https://supernote.eu/choose-your-product/
3•janandonly•16m ago•0 comments

We are QA Engineers now

https://serce.me/posts/2026-02-05-we-are-qa-engineers-now
1•SerCe•17m ago•0 comments

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

https://arxiv.org/abs/2602.01465
2•NBenkovich•17m ago•0 comments

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

https://www.latent.space/p/adversarial-reasoning
1•swyx•17m ago•0 comments

Show HN: Poddley.com – Follow people, not podcasts

https://poddley.com/guests/ana-kasparian/episodes
1•onesandofgrain•25m ago•0 comments

Layoffs Surge 118% in January – The Highest Since 2009

https://www.cnbc.com/2026/02/05/layoff-and-hiring-announcements-hit-their-worst-january-levels-si...
8•karakoram•25m ago•0 comments

Papyrus 114: Homer's Iliad

https://p114.homemade.systems/
1•mwenge•25m ago•1 comments

DicePit – Real-time multiplayer Knucklebones in the browser

https://dicepit.pages.dev/
1•r1z4•26m ago•1 comments

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

https://arxiv.org/abs/2601.14340
2•PaulHoule•27m ago•0 comments

Show HN: AI Agent Tool That Keeps You in the Loop

https://github.com/dshearer/misatay
2•dshearer•28m ago•0 comments

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

https://drmowinckels.io/blog/2026/sitrep-functions/
1•todsacerdoti•29m ago•0 comments

Achieving Ultra-Fast AI Chat Widgets

https://www.cjroth.com/blog/2026-02-06-chat-widgets
1•thoughtfulchris•31m ago•0 comments

Show HN: Runtime Fence – Kill switch for AI agents

https://github.com/RunTimeAdmin/ai-agent-killswitch
1•ccie14019•33m ago•1 comments

Researchers surprised by the brain benefits of cannabis usage in adults over 40

https://nypost.com/2026/02/07/health/cannabis-may-benefit-aging-brains-study-finds/
2•SirLJ•35m ago•0 comments

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

https://fortune.com/2026/02/04/peter-thiel-antichrist-greta-thunberg-end-of-modernity-billionaires/
4•randycupertino•36m ago•2 comments

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

https://www.twz.com/sea/uss-preble-used-helios-laser-to-zap-four-drones-in-expanding-testing
3•breve•41m ago•0 comments

Show HN: Animated beach scene, made with CSS

https://ahmed-machine.github.io/beach-scene/
1•ahmedoo•42m ago•0 comments

An update on unredacting select Epstein files – DBC12.pdf liberated

https://neosmart.net/blog/efta00400459-has-been-cracked-dbc12-pdf-liberated/
3•ks2048•42m ago•0 comments
Open in hackernews

Evaluations for Testing Agentic AI

1•stichers•1w ago
What do you think about measuring agentic AI in practice. A few weeks ago I read something on Anthropic’s blog on evals for AI agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents, and then yesterday saw this on Medium https://medium.com/quantumblack/evaluations-for-the-agentic-world-c3c150f0dd5a Feels like this is becoming a thing. Anthropic talk about how to structure agent evals and what they’ve learned from running these internally. The QuantumBlack post clooks more programmatic or lifecyle focussed. How evals need to change once agents are combined and using tools. What to do whenthey're deployed, and how to factor them in early.

Curious what peeple are doing in the real world. Are you rolling your own task suites? Using offline + online evals? Or mostly vibes and logs for now?

Comments

alexhans•6d ago
> Curious what peeple are doing in the real world

It's hard to answer this as the right thing to do might depend a lot on the environment.

My pitch would be that automation is not a new problem and the fundamentals still apply:

- We can’t compare what we can’t measure

- I need to ask myself, can I trust this to run on its own?

- I need to be able to describe what I want

- I need to understand my risk profile

Once you're there, you can use evals-first in a TDD ish style, and keep control throughout your journey, not over-investing, not losing control and allowing people to join in understanding those fundamentals, regardless of whether they're coming from a very solid software engineering background or not.

To be specific, for cross functional teams with > 90% non software engineers, we invested in programmatic first tools (Think claude/codex/continue.dev cli) over "hype" UI solutions because we knew integration would come through MCPs (and then skills), we designed in such a way that the agents/clis/black boxes were decoupled and no particular change/resource limitation would destroy our ability to continue. Focusing a lot on "not having to maintain" (ZeroOps) whatever we do and not having to worry [1] about manually monitoring.

For evals I suggest you explore 2 different tools, such as promptfoo (Technically no programming required for an eval creator, just yaml) and deepevals/pydanticai-evals (python based) and see how you feel about using either based on your context/team composition needs. Once you start, some things will start to click such as the value of bounding non-determinism [2] as much as possible and you might even have proof for them. Other challenges will come later (such as cost, system tests, sandboxing and more).

- [1] https://alexhans.github.io/posts/series/zeroops/no-news-is-g...

- [2] https://alexhans.github.io/posts/series/evals/error-compound...