frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Empusa – Visual debugger to catch and resume AI agent retry loops

https://github.com/justin55afdfdsf5ds45f4ds5f45ds4/EmpusaAI
1•justinlord•2m ago•0 comments

Show HN: Bitcoin wallet on NXP SE050 secure element, Tor-only open source

https://github.com/0xdeadbeefnetwork/sigil-web
2•sickthecat•4m ago•0 comments

White House Explores Opening Antitrust Probe on Homebuilders

https://www.bloomberg.com/news/articles/2026-02-06/white-house-explores-opening-antitrust-probe-i...
1•petethomas•5m ago•0 comments

Show HN: MindDraft – AI task app with smart actions and auto expense tracking

https://minddraft.ai
2•imthepk•10m ago•0 comments

How do you estimate AI app development costs accurately?

1•insights123•11m ago•0 comments

Going Through Snowden Documents, Part 5

https://libroot.org/posts/going-through-snowden-documents-part-5/
1•goto1•11m ago•0 comments

Show HN: MCP Server for TradeStation

https://github.com/theelderwand/tradestation-mcp
1•theelderwand•14m ago•0 comments

Canada unveils auto industry plan in latest pivot away from US

https://www.bbc.com/news/articles/cvgd2j80klmo
2•breve•15m ago•0 comments

The essential Reinhold Niebuhr: selected essays and addresses

https://archive.org/details/essentialreinhol0000nieb
1•baxtr•18m ago•0 comments

Rentahuman.ai Turns Humans into On-Demand Labor for AI Agents

https://www.forbes.com/sites/ronschmelzer/2026/02/05/when-ai-agents-start-hiring-humans-rentahuma...
1•tempodox•19m ago•0 comments

StovexGlobal – Compliance Gaps to Note

1•ReviewShield•22m ago•1 comments

Show HN: Afelyon – Turns Jira tickets into production-ready PRs (multi-repo)

https://afelyon.com/
1•AbduNebu•23m ago•0 comments

Trump says America should move on from Epstein – it may not be that easy

https://www.bbc.com/news/articles/cy4gj71z0m0o
5•tempodox•24m ago•2 comments

Tiny Clippy – A native Office Assistant built in Rust and egui

https://github.com/salva-imm/tiny-clippy
1•salvadorda656•28m ago•0 comments

LegalArgumentException: From Courtrooms to Clojure – Sen [video]

https://www.youtube.com/watch?v=cmMQbsOTX-o
1•adityaathalye•31m ago•0 comments

US moves to deport 5-year-old detained in Minnesota

https://www.reuters.com/legal/government/us-moves-deport-5-year-old-detained-minnesota-2026-02-06/
5•petethomas•34m ago•2 comments

If you lose your passport in Austria, head for McDonald's Golden Arches

https://www.cbsnews.com/news/us-embassy-mcdonalds-restaurants-austria-hotline-americans-consular-...
1•thunderbong•39m ago•0 comments

Show HN: Mermaid Formatter – CLI and library to auto-format Mermaid diagrams

https://github.com/chenyanchen/mermaid-formatter
1•astm•55m ago•0 comments

RFCs vs. READMEs: The Evolution of Protocols

https://h3manth.com/scribe/rfcs-vs-readmes/
2•init0•1h ago•1 comments

Kanchipuram Saris and Thinking Machines

https://altermag.com/articles/kanchipuram-saris-and-thinking-machines
1•trojanalert•1h ago•0 comments

Chinese chemical supplier causes global baby formula recall

https://www.reuters.com/business/healthcare-pharmaceuticals/nestle-widens-french-infant-formula-r...
2•fkdk•1h ago•0 comments

I've used AI to write 100% of my code for a year as an engineer

https://old.reddit.com/r/ClaudeCode/comments/1qxvobt/ive_used_ai_to_write_100_of_my_code_for_1_ye...
2•ukuina•1h ago•1 comments

Looking for 4 Autistic Co-Founders for AI Startup (Equity-Based)

1•au-ai-aisl•1h ago•1 comments

AI-native capabilities, a new API Catalog, and updated plans and pricing

https://blog.postman.com/new-capabilities-march-2026/
1•thunderbong•1h ago•0 comments

What changed in tech from 2010 to 2020?

https://www.tedsanders.com/what-changed-in-tech-from-2010-to-2020/
3•endorphine•1h ago•0 comments

From Human Ergonomics to Agent Ergonomics

https://wesmckinney.com/blog/agent-ergonomics/
1•Anon84•1h ago•0 comments

Advanced Inertial Reference Sphere

https://en.wikipedia.org/wiki/Advanced_Inertial_Reference_Sphere
1•cyanf•1h ago•0 comments

Toyota Developing a Console-Grade, Open-Source Game Engine with Flutter and Dart

https://www.phoronix.com/news/Fluorite-Toyota-Game-Engine
2•computer23•1h ago•0 comments

Typing for Love or Money: The Hidden Labor Behind Modern Literary Masterpieces

https://publicdomainreview.org/essay/typing-for-love-or-money/
1•prismatic•1h ago•0 comments

Show HN: A longitudinal health record built from fragmented medical data

https://myaether.live
1•takmak007•1h ago•0 comments
Open in hackernews

Evaluations for Testing Agentic AI

1•stichers•1w ago
What do you think about measuring agentic AI in practice. A few weeks ago I read something on Anthropic’s blog on evals for AI agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents, and then yesterday saw this on Medium https://medium.com/quantumblack/evaluations-for-the-agentic-world-c3c150f0dd5a Feels like this is becoming a thing. Anthropic talk about how to structure agent evals and what they’ve learned from running these internally. The QuantumBlack post clooks more programmatic or lifecyle focussed. How evals need to change once agents are combined and using tools. What to do whenthey're deployed, and how to factor them in early.

Curious what peeple are doing in the real world. Are you rolling your own task suites? Using offline + online evals? Or mostly vibes and logs for now?

Comments

alexhans•6d ago
> Curious what peeple are doing in the real world

It's hard to answer this as the right thing to do might depend a lot on the environment.

My pitch would be that automation is not a new problem and the fundamentals still apply:

- We can’t compare what we can’t measure

- I need to ask myself, can I trust this to run on its own?

- I need to be able to describe what I want

- I need to understand my risk profile

Once you're there, you can use evals-first in a TDD ish style, and keep control throughout your journey, not over-investing, not losing control and allowing people to join in understanding those fundamentals, regardless of whether they're coming from a very solid software engineering background or not.

To be specific, for cross functional teams with > 90% non software engineers, we invested in programmatic first tools (Think claude/codex/continue.dev cli) over "hype" UI solutions because we knew integration would come through MCPs (and then skills), we designed in such a way that the agents/clis/black boxes were decoupled and no particular change/resource limitation would destroy our ability to continue. Focusing a lot on "not having to maintain" (ZeroOps) whatever we do and not having to worry [1] about manually monitoring.

For evals I suggest you explore 2 different tools, such as promptfoo (Technically no programming required for an eval creator, just yaml) and deepevals/pydanticai-evals (python based) and see how you feel about using either based on your context/team composition needs. Once you start, some things will start to click such as the value of bounding non-determinism [2] as much as possible and you might even have proof for them. Other challenges will come later (such as cost, system tests, sandboxing and more).

- [1] https://alexhans.github.io/posts/series/zeroops/no-news-is-g...

- [2] https://alexhans.github.io/posts/series/evals/error-compound...