frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: EvalView pytest style tests for AI agents (budgets, hallucinations)

https://github.com/hidai25/eval-view
1•hidai25•1h ago

Comments

hidai25•1h ago
Hi HN, I built EvalView after an agent that worked fine in dev started inventing numbers in prod. Tracing showed me what happened after the fact, but I wanted CI to fail the deploy the moment the agent drifted. EvalView is basically pytest for agents: you write a YAML test, run it N times (to catch flakiness), and fail the build if behavior regresses. Example:

```yaml name: "Refund policy doesn't hallucinate" runs: 10 pass_rate: 0.8 input: query: "What's our refund policy?" assert: - tool_called: "kb_search" - no_unsupported_claims: true - max_cost_usd: 0.05 ```

Instead of exact text matching, the checks focus on constraints: did it call the right tools, did it make claims not supported by tool results/context, and did it stay within cost/latency budgets. I also added optional local LLM-as-judge via Ollama so evals don’t burn API credits on every run. If you’re shipping agents to prod, what’s been your worst failure mode: tool misuse, budget blowups, or confident nonsense? Happy to answer questions. Hidai

The Difference Between the Alarm and the Panic

https://fafi25.substack.com/p/the-difference-between-the-alarm
1•andrewstetsenko•49s ago•0 comments

Developers can now submit apps to ChatGPT

https://openai.com/index/developers-can-now-submit-apps-to-chatgpt/
1•tananaev•55s ago•0 comments

Show HN: CCS – A Multi-Account Switcher and Model Manager for Claude Code

https://github.com/kaitranntt/ccs
1•dhiyaan•2m ago•0 comments

In-progress Call causes Screen Flickering

https://github.com/anthropics/claude-code/issues/769
1•ximeng•3m ago•1 comments

How Cal.com shipped an iOS/Android App in 3 weeks

https://cal.com/blog/how-cal.com-shipped-an-ios-android-app-using-expo-and-chrome-firefox-using-w...
1•sdko•3m ago•0 comments

Railway Incident December 16th, 2025

https://blog.railway.com/p/incident-report-december-16-2025
1•sdko•4m ago•0 comments

Building ChatGPT Apps with Supabase Edge Functions and MCP-Use

https://supabase.com/blog/building-chatgpt-apps-with-supabase
1•luigipederzani•5m ago•0 comments

How America's Education System Became a Weapon Against Itself

https://sleuthfox.substack.com/p/the-trojan-horse-how-americas-education
1•mhb•7m ago•0 comments

A look back: LANPAR, the first spreadsheet

https://technicallywewrite.com/2025/12/16/lanpar
1•rbanffy•8m ago•0 comments

China's Big AI Diffusion Plan Is Here. Will It Work?

https://mattsheehan.substack.com/p/chinas-big-ai-diffusion-plan-is-here
1•toomuchtodo•9m ago•0 comments

Backchanneling Is Becoming a Crutch

https://www.cristina.com/blog/backchanneling
1•cristinacordova•9m ago•0 comments

Saturn's biggest moon might not have an ocean after all

https://phys.org/news/2025-12-saturn-biggest-moon-ocean.html
2•bikenaga•11m ago•0 comments

Gemini 3 Flash Rivals Frontier Models at a Fraction of the Cost

https://thenewstack.io/googles-new-gemini-3-flash-rivals-frontier-models-at-a-fraction-of-the-cost/
2•coloneltcb•11m ago•1 comments

Billionaire Jared Isaacman, confirmed as NASA chief

https://www.bbc.com/news/articles/c5ydvlx28kwo
1•belter•11m ago•0 comments

Ask HN: Can upside down faces solve face recognition while wearing N95 masks?

2•amichail•11m ago•0 comments

Detrans AI: The collective consciousness of detransitioners

https://detrans.ai/
2•nettol•16m ago•0 comments

Making GitHub Actions Fast(er) & Cheaper with Dedicated Runners

https://ali-dev.medium.com/making-github-ci-cd-fast-er-cheaper-with-dedicated-runners-55612586afd7
1•stringtoint•19m ago•0 comments

Fei-Fei Li of World Labs: AI is incomplete without spatial intelligence

https://www.ft.com/content/d8fec7b5-f64a-4c5b-8439-6b8fe557be95
1•bookofjoe•20m ago•1 comments

From pr0n to playlists and paperclips, trio of breaches spills data of millions

https://www.theregister.com/2025/12/16/trio_of_breaches/
2•Bender•21m ago•0 comments

Show HN: A browser game about the AI alignment problem

https://thechoicebeforeus.com/
1•NickSharp•22m ago•0 comments

From Georgia to Essex, AI datacenters are testing public goodwill

https://www.theregister.com/2025/12/16/datacenter_development_controversy/
1•Bender•23m ago•0 comments

Cisco says Chinese hackers are exploiting its customers with a new zero-day

https://techcrunch.com/2025/12/17/cisco-says-chinese-hackers-are-exploiting-its-customers-with-a-...
6•fortran77•23m ago•0 comments

Browser 'privacy' extensions have eye on your AI, log all your chats

https://www.theregister.com/2025/12/16/chrome_edge_privacy_extensions_quietly/
2•Bender•24m ago•0 comments

Boom Supersonic raises $300M to build natural gas turbines for data centers

https://techcrunch.com/2025/12/09/boom-supersonic-raises-300m-to-build-natural-gas-turbines-for-c...
1•CGMthrowaway•27m ago•1 comments

Not as intelligent as they are thought to be

1•wef•27m ago•0 comments

OpenAI's the State of Enterprise AI

https://newsletter.eng-leadership.com/p/openais-report-the-state-of-enterprise
1•rbanffy•27m ago•0 comments

StorageReview Sets New Pi Record: 314T Digits on a Dell PowerEdge R7725

https://www.storagereview.com/review/storagereview-sets-new-pi-record-314-trillion-digits-on-a-de...
2•rbanffy•28m ago•0 comments

Skills vs. Dynamic MCP Loadouts

https://lucumr.pocoo.org/2025/12/13/skills-vs-mcp/
1•gmays•28m ago•0 comments

Linux Patches Begin Adapting Raid Code to Use Folios

https://www.phoronix.com/news/Linux-RAID-MD-Folios
2•doener•29m ago•0 comments

Bad CSS-Dad Jokes

https://alvaromontoro.com/blog/68087/bad-css-dad-jokes-v
1•ulrischa•29m ago•0 comments