news newest ask show jobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Effects of Zepbound on Stool Quality

https://twitter.com/ScottHickle/status/2020150085296775300

1•aloukissas•3m ago•0 comments

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

https://seedance.ai/

1•bigbromaker•6m ago•0 comments

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

1•andrewstuart•12m ago•1 comments

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

https://www.cbsnews.com/news/pentagon-says-its-cutting-ties-with-woke-harvard-discontinuing-milit...

2•alephnerd•14m ago•1 comments

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

https://cds.cern.ch/record/405662/files/PhysRev.47.777.pdf

1•northlondoner•15m ago•1 comments

Kessler Syndrome Has Started [video]

https://www.tiktok.com/@cjtrowbridge/video/7602634355160206623

1•pbradv•17m ago•0 comments

Complex Heterodynes Explained

https://tomverbeure.github.io/2026/02/07/Complex-Heterodyne.html

3•hasheddan•18m ago•0 comments

EVs Are a Failed Experiment

https://spectator.org/evs-are-a-failed-experiment/

2•ArtemZ•29m ago•4 comments

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

https://www.databricks.com/blog/memalign-building-better-llm-judges-human-feedback-scalable-memory

1•superchink•30m ago•0 comments

CCC (Claude's C Compiler) on Compiler Explorer

https://godbolt.org/z/asjc13sa6

2•LiamPowell•32m ago•0 comments

Homeland Security Spying on Reddit Users

https://www.kenklippenstein.com/p/homeland-security-spies-on-reddit

3•duxup•35m ago•0 comments

Actors with Tokio (2021)

https://ryhl.io/blog/actors-with-tokio/

1•vinhnx•36m ago•0 comments

Can graph neural networks for biology realistically run on edge devices?

https://doi.org/10.21203/rs.3.rs-8645211/v1

1•swapinvidya•48m ago•1 comments

Deeper into the shareing of one air conditioner for 2 rooms

1•ozzysnaps•50m ago•0 comments

Weatherman introduces fruit-based authentication system to combat deep fakes

https://www.youtube.com/watch?v=5HVbZwJ9gPE

3•savrajsingh•51m ago•0 comments

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

http://www.effacermonexistence.com/rcc-hn-1-1

1•formerOpenAI•53m ago•2 comments

A Curated List of ML System Design Case Studies

https://github.com/Engineer1999/A-Curated-List-of-ML-System-Design-Case-Studies

3•tejonutella•56m ago•0 comments

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

https://ponyalpha.pro

1•qzcanoe•1h ago•1 comments

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

https://github.com/Goofygiraffe06/tunbot

2•g1raffe•1h ago•0 comments

Open Problems in Mechanistic Interpretability

https://arxiv.org/abs/2501.16496

2•vinhnx•1h ago•0 comments

Bye Bye Humanity: The Potential AMOC Collapse

https://thatjoescott.com/2026/02/03/bye-bye-humanity-the-potential-amoc-collapse/

3•rolph•1h ago•0 comments

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

https://github.com/virattt/dexter

1•Lwrless•1h ago•0 comments

Digital Iris [video]

https://www.youtube.com/watch?v=Kg_2MAgS_pE

1•vermilingua•1h ago•0 comments

Essential CDN: The CDN that lets you do more than JavaScript

https://essentialcdn.fluidity.workers.dev/

1•telui•1h ago•1 comments

They Hijacked Our Tech [video]

https://www.youtube.com/watch?v=-nJM5HvnT5k

2•cedel2k1•1h ago•0 comments

Vouch

https://twitter.com/mitchellh/status/2020252149117313349

40•chwtutha•1h ago•6 comments

HRL Labs in Malibu laying off 1/3 of their workforce

https://www.dailynews.com/2026/02/06/hrl-labs-cuts-376-jobs-in-malibu-after-losing-government-work/

4•osnium123•1h ago•1 comments

Show HN: High-performance bidirectional list for React, React Native, and Vue

https://suhaotian.github.io/broad-infinite-list/

2•jeremy_su•1h ago•0 comments

Show HN: I built a Mac screen recorder Recap.Studio

https://recap.studio/

1•fx31xo•1h ago•1 comments

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

1•kachapopopow•1h ago•0 comments

Open in hackernews

Show HN: EvalView pytest style tests for AI agents (budgets, hallucinations)

https://github.com/hidai25/eval-view

1•hidai25•1mo ago

Comments

hidai25•1mo ago

Hi HN, I built EvalView after an agent that worked fine in dev started inventing numbers in prod. Tracing showed me what happened after the fact, but I wanted CI to fail the deploy the moment the agent drifted. EvalView is basically pytest for agents: you write a YAML test, run it N times (to catch flakiness), and fail the build if behavior regresses. Example:

```yaml name: "Refund policy doesn't hallucinate" runs: 10 pass_rate: 0.8 input: query: "What's our refund policy?" assert: - tool_called: "kb_search" - no_unsupported_claims: true - max_cost_usd: 0.05 ```

Instead of exact text matching, the checks focus on constraints: did it call the right tools, did it make claims not supported by tool results/context, and did it stay within cost/latency budgets. I also added optional local LLM-as-judge via Ollama so evals don’t burn API credits on every run. If you’re shipping agents to prod, what’s been your worst failure mode: tool misuse, budget blowups, or confident nonsense? Happy to answer questions. Hidai