news newest ask show jobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

https://cds.cern.ch/record/405662/files/PhysRev.47.777.pdf

1•northlondoner•34s ago•1 comments

Kessler Syndrome Has Started [video]

https://www.tiktok.com/@cjtrowbridge/video/7602634355160206623

1•pbradv•3m ago•0 comments

Complex Heterodynes Explained

https://tomverbeure.github.io/2026/02/07/Complex-Heterodyne.html

1•hasheddan•3m ago•0 comments

EVs Are a Failed Experiment

https://spectator.org/evs-are-a-failed-experiment/

1•ArtemZ•15m ago•3 comments

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

https://www.databricks.com/blog/memalign-building-better-llm-judges-human-feedback-scalable-memory

1•superchink•16m ago•0 comments

CCC (Claude's C Compiler) on Compiler Explorer

https://godbolt.org/z/asjc13sa6

1•LiamPowell•17m ago•0 comments

Homeland Security Spying on Reddit Users

https://www.kenklippenstein.com/p/homeland-security-spies-on-reddit

2•duxup•20m ago•0 comments

Actors with Tokio (2021)

https://ryhl.io/blog/actors-with-tokio/

1•vinhnx•21m ago•0 comments

Can graph neural networks for biology realistically run on edge devices?

https://doi.org/10.21203/rs.3.rs-8645211/v1

1•swapinvidya•34m ago•1 comments

Deeper into the shareing of one air conditioner for 2 rooms

1•ozzysnaps•35m ago•0 comments

Weatherman introduces fruit-based authentication system to combat deep fakes

https://www.youtube.com/watch?v=5HVbZwJ9gPE

2•savrajsingh•36m ago•0 comments

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

http://www.effacermonexistence.com/rcc-hn-1-1

1•formerOpenAI•38m ago•2 comments

A Curated List of ML System Design Case Studies

https://github.com/Engineer1999/A-Curated-List-of-ML-System-Design-Case-Studies

3•tejonutella•42m ago•0 comments

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

https://ponyalpha.pro

1•qzcanoe•46m ago•1 comments

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

https://github.com/Goofygiraffe06/tunbot

1•g1raffe•49m ago•0 comments

Open Problems in Mechanistic Interpretability

https://arxiv.org/abs/2501.16496

2•vinhnx•55m ago•0 comments

Bye Bye Humanity: The Potential AMOC Collapse

https://thatjoescott.com/2026/02/03/bye-bye-humanity-the-potential-amoc-collapse/

2•rolph•59m ago•0 comments

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

https://github.com/virattt/dexter

1•Lwrless•1h ago•0 comments

Digital Iris [video]

https://www.youtube.com/watch?v=Kg_2MAgS_pE

1•vermilingua•1h ago•0 comments

Essential CDN: The CDN that lets you do more than JavaScript

https://essentialcdn.fluidity.workers.dev/

1•telui•1h ago•1 comments

They Hijacked Our Tech [video]

https://www.youtube.com/watch?v=-nJM5HvnT5k

1•cedel2k1•1h ago•0 comments

Vouch

https://twitter.com/mitchellh/status/2020252149117313349

36•chwtutha•1h ago•6 comments

HRL Labs in Malibu laying off 1/3 of their workforce

https://www.dailynews.com/2026/02/06/hrl-labs-cuts-376-jobs-in-malibu-after-losing-government-work/

4•osnium123•1h ago•1 comments

Show HN: High-performance bidirectional list for React, React Native, and Vue

https://suhaotian.github.io/broad-infinite-list/

2•jeremy_su•1h ago•0 comments

Show HN: I built a Mac screen recorder Recap.Studio

https://recap.studio/

1•fx31xo•1h ago•1 comments

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

1•kachapopopow•1h ago•0 comments

Vectors and HNSW for Dummies

https://anvitra.ai/blog/vectors-and-hnsw/

1•melvinodsa•1h ago•0 comments

Sanskrit AI beats CleanRL SOTA by 125%

https://huggingface.co/ParamTatva/sanskrit-ppo-hopper-v5/blob/main/docs/blog.md

1•prabhatkr•1h ago•1 comments

'Washington Post' CEO resigns after going AWOL during job cuts

https://www.npr.org/2026/02/07/nx-s1-5705413/washington-post-ceo-resigns-will-lewis

4•thread_id•1h ago•1 comments

Claude Opus 4.6 Fast Mode: 2.5× faster, ~6× more expensive

https://twitter.com/claudeai/status/2020207322124132504

1•geeknews•1h ago•0 comments

Open in hackernews

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

https://arxiv.org/abs/2510.11977

1•adidoit•3mo ago

Comments

adidoit•3mo ago

https://x.com/sayashk/status/1978565190057869344

AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.