frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

AI Agent Testing

1•dinasorous•1h ago
As I work on AI agents, I find myself constantly thinking about how to effectively test them. As we integrate more knowledge sources and expand our agents' capabilities, testing becomes increasingly complex. As a standard practice, we use evals to ensure quality is maintained. But honestly, I feel like something is missing. The issue I’m seeing is that we, as engineers, sometimes lack sufficient domain knowledge to assess an agent's response accurately. At the same time, current tooling limits the possibility of collaborating with domain experts to perform testing together. For example, current tooling gives priority to dashboards over the readability of actual outcomes This has been my experience so far—I would love to hear your thoughts on this.

Comments

falcor84•1h ago
Your question is really abstract. Maybe give an explanation of your domain, the tooling you're currently using and its particular limitations?

Also worth saying that the priority given to metrics and dashboards over actual outcomes is a fundamental issue for any structured activity (see e.g. Goodhart's law and Campbell's Law), and has little to do with AI.

Song banned from Swedish charts for being AI creation

https://www.bbc.com/news/articles/cp829jey9z7o
1•breve•54s ago•0 comments

Free Your Music

https://avc.xyz/free-your-music
1•wslh•1m ago•0 comments

Was Renee Good Obligated to Comply with an ICE Agent's Orders? (Paywall)

https://www.nytimes.com/2026/01/15/us/renee-good-ice-agent-comply-legal.html
1•connor11528•3m ago•0 comments

Glasgow Interface Explorer Code of Conduct

https://glasgow-embedded.org/latest/conduct.html
1•todsacerdoti•6m ago•0 comments

Implementing Co, a Small Language with Coroutines #5: Adding Sleep

https://abhinavsarkar.net/posts/implementing-co-5/
1•todsacerdoti•6m ago•0 comments

Apache Celeborn: elastic high-performance service for shuffle and spilled data

https://github.com/apache/celeborn
1•tosh•8m ago•0 comments

GoogleSQL

https://docs.cloud.google.com/bigquery/docs/introduction-sql
1•tosh•10m ago•0 comments

Digitization of the Old Town Astronomical Clock: One of Prague's Monuments

https://connect.geant.org/2025/12/17/digitization-of-the-old-town-astronomical-clock-cesnet-revea...
1•taubek•17m ago•0 comments

Claude Tool Search Tool

https://platform.claude.com/docs/en/agents-and-tools/tool-use/tool-search-tool
1•tin7in•18m ago•0 comments

The Successor to Research Unix Was Plan 9 from Bell Labs

https://www.theregister.com/2024/02/21/successor_to_unix_plan_9/
1•bmacho•18m ago•0 comments

AI Token Usage Leaderboard

https://jtpck.com/leaderboard
1•piratebroadcast•18m ago•0 comments

Show HN: Open Royalties – Fund projects with revenue sharing, not equity

https://openroyalties.org
1•aroussi•18m ago•0 comments

An early look at the Graphite 2D graphics editor

https://lwn.net/Articles/1051242/
1•sohkamyung•18m ago•0 comments

A Systems‑Level Architecture for Integrative Rejuvenation

https://forum.effectivealtruism.org/posts/aNmqSbya5FrG88eAf/a-systems-level-architecture-for-inte...
1•k_n_gk•22m ago•0 comments

Chinese Universities Surge in Global Rankings as U.S. Schools Slip

https://www.nytimes.com/2026/01/15/us/harvard-global-ranking-chinese-universities-trump-cuts.html
3•janpot•23m ago•0 comments

MapQuest: The Brief, Glorious Era of Printed Directions

https://multiverseemployeehandbook.com/blog/mapquest-the-brief-glorious-era-of-printed-directions/
2•TMEHpodcast•25m ago•0 comments

Ask HN: Does GitHub Copilot now leave unsolicited PR review comments?

1•blenderob•25m ago•2 comments

Why Kant and Sontag Cannot Speak Otherwise

https://jimiwen.substack.com/p/some-matters-on-taste
2•jimiwen•32m ago•0 comments

Vertical Solar Panels Survive Storms by 'Swaying' Like Trees

https://www.scientificamerican.com/article/vertical-solar-panels-wind-resistant-trackers-for-high...
1•sohkamyung•34m ago•0 comments

Cracking DXP and SXD

https://www.os2museum.com/wp/cracking-dxp-and-sxd/
1•ingve•35m ago•0 comments

Show HN: US Bank Statement Converter to Excel Ready for LLMs

https://usstatementconverter.com/
2•aleks5678•35m ago•1 comments

Show HN: Glot – Find internationalization issues in Next.js app

https://github.com/Sukitly/glotctl
1•sukit•35m ago•0 comments

Machado Presents Trump with Her Nobel Peace Prize Medal

https://www.nytimes.com/2026/01/15/world/americas/machado-trump-meeting-nobel-peace-prize.html
2•rootlocus•36m ago•0 comments

AI is just starting to change the legal profession

https://www.understandingai.org/p/ai-is-just-starting-to-change-the
3•s-macke•38m ago•1 comments

Bucketing optimization in SQL to deal with skewed data (BigQuery example)

https://smallbigdata.substack.com/p/bucketing-optimization-in-sql-to
1•tosh•41m ago•0 comments

What Was the Metaverse?

https://www.fastcompany.com/91467599/metaverse-zuckerberg-facebook-ai
1•agluszak•42m ago•0 comments

Building an agentic memory system for GitHub Copilot

https://github.blog/ai-and-ml/github-copilot/building-an-agentic-memory-system-for-github-copilot/
2•agluszak•42m ago•0 comments

World's Safest Airline Rankings for 2026

https://www.airlineratings.com/articles/worlds-safest-airlines-for-2026
2•austinallegro•44m ago•0 comments

Show HN: Native PyAnnote (speaker diarizer) in Rust

https://github.com/RustedBytes/pyannote-rs
1•yehors•45m ago•0 comments

Show HN: Automated tech news site with custom multi-LLM agent pipelines

https://wayr.today/how-it-works/
2•siddkgn•46m ago•2 comments