frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Tony Hoare has died

https://blog.computationalcomplexity.org/2026/03/tony-hoare-1934-2026.html
1485•speckx•10h ago•200 comments

U+237C ⍼ Is Azimuth

https://ionathan.ch/2026/02/16/angzarr.html
128•cokernel_hacker•3h ago•11 comments

Cloudflare crawl endpoint

https://developers.cloudflare.com/changelog/post/2026-03-10-br-crawl-endpoint/
110•jeffpalmer•3h ago•57 comments

Agents that run while I sleep

https://www.claudecodecamp.com/p/i-m-building-agents-that-run-while-i-sleep
224•aray07•6h ago•178 comments

Yann LeCun raises $1B to build AI that understands the physical world

https://www.wired.com/story/yann-lecun-raises-dollar1-billion-to-build-ai-that-understands-the-ph...
330•helloplanets•16h ago•323 comments

SSH Secret Menu

https://twitter.com/rebane2001/status/2031037389347406054
39•piccirello•22h ago•18 comments

Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon

https://github.com/RunanywhereAI/rcli
181•sanchitmonga22•8h ago•86 comments

Mesh over Bluetooth LE, TCP, or Reticulum

https://github.com/torlando-tech/columba
25•khimaros•6h ago•1 comments

Debian decides not to decide on AI-generated contributions

https://lwn.net/SubscriberLink/1061544/125f911834966dd0/
277•jwilk•10h ago•212 comments

Universal vaccine against respiratory infections and allergens

https://med.stanford.edu/news/all-news/2026/02/universal-vaccine.html
100•phony-account•3h ago•26 comments

FFmpeg-over-IP – Connect to remote FFmpeg servers

https://github.com/steelbrain/ffmpeg-over-ip
124•steelbrain•7h ago•50 comments

Invoker Commands API

https://developer.mozilla.org/en-US/docs/Web/API/Invoker_Commands_API
55•maqnius•2d ago•12 comments

RISC-V Is Sloooow

https://marcin.juszkiewicz.com.pl/2026/03/10/risc-v-is-sloooow/
167•todsacerdoti•5h ago•156 comments

Tell HN: Apple development certificate server seems down?

60•strongpigeon•5h ago•26 comments

Exploring the ocean with Raspberry Pi–powered marine robots

https://www.raspberrypi.com/news/exploring-the-ocean-with-raspberry-pi-powered-marine-robots/
46•Brajeshwar•3d ago•6 comments

Meta acquires Moltbook

https://www.axios.com/2026/03/10/meta-facebook-moltbook-agent-social-network
415•mmayberry•11h ago•269 comments

Intel Demos Chip to Compute with Encrypted Data

https://spectrum.ieee.org/fhe-intel
221•sohkamyung•12h ago•84 comments

Bippy: React Internals Toolkit

https://www.bippy.dev/
15•handfuloflight•2d ago•2 comments

Online age-verification tools for child safety are surveilling adults

https://www.cnbc.com/2026/03/08/social-media-child-safety-internet-ai-surveillance.html
541•bilsbie•12h ago•303 comments

After outages, Amazon to make senior engineers sign off on AI-assisted changes

https://arstechnica.com/ai/2026/03/after-outages-amazon-to-make-senior-engineers-sign-off-on-ai-a...
446•ndr42•12h ago•396 comments

Launch HN: Didit (YC W26) – Stripe for Identity Verification

51•rosasalberto•10h ago•51 comments

I put my whole life into a single database

https://howisfelix.today/
418•lukakopajtic•15h ago•202 comments

Show HN: How I topped the HuggingFace open LLM leaderboard on two gaming GPUs

https://dnhkng.github.io/posts/rys/
280•dnhkng•12h ago•83 comments

Networking with agents: Put them in the right conversations with Tailscale

https://blog.firetiger.com/networking-with-agents-how-to-put-them-in-the-right-conversations/
15•matsur•7h ago•2 comments

Rebasing in Magit

https://entropicthoughts.com/rebasing-in-magit
184•ibobev•12h ago•122 comments

Billion-Parameter Theories

https://www.worldgov.org/complexity.html
89•seanlinehan•7h ago•69 comments

Roblox is minting teen millionaires

https://www.bloomberg.com/news/articles/2026-03-06/roblox-s-teen-millionaires-are-disrupting-the-...
55•petethomas•3d ago•58 comments

Open Weights isn't Open Training

https://www.workshoplabs.ai/blog/open-weights-open-training
78•addiefoote8•1d ago•28 comments

HyperCard discovery: Neuromancer, Count Zero, Mona Lisa Overdrive (2022)

https://macintoshgarden.org/apps/neuromancer-count-zero-mona-lisa-overdrive
97•naves•6h ago•29 comments

The Gervais Principle, or the Office According to “The Office” (2009)

https://www.ribbonfarm.com/2009/10/07/the-gervais-principle-or-the-office-according-to-the-office/
288•janandonly•3d ago•123 comments
Open in hackernews

Ask HN: How are people doing AI evals these days?

7•yelmahallawy•19h ago
With the buzz that's happening with all the new AI models that get released (what feels like every other week), how are companies running internal AI evals to determine which model is best for their use case?

Comments

alexhans•16h ago
Very, very heterogenous and fast moving space.

Depending on how they're made up, different teams do vastly different things.

No evals at all, integration tests with no tooling, some use mixed observability tools like LangFuse in their CI/CD. Some other tools like arize phoenix, deepeval, braintrust, promptfoo, pydanticai throughout their development.

It's definitely an afterthought for most teams although we are starting to see increased interest.

My hope is that we can start thinking about evals as a common language for "product" across role families so I'm trying some advocacy [1] trying to keep it very simple including wrapping coding agents like Claude. Sandboxing and observability "for the masses" is still quite a hard concept but UX getting better with time.

What are you doing for yourself/teams? If not much yet, i'd recommend to just start and figure out where the friction/value is for you.

- [1] https://ai-evals.io/ (practical examples https://github.com/Alexhans/eval-ception)

bisonbear•5h ago
assume you're referencing coding agents - I don't think people are. If they are, it's likely using

- AI to evaluate itself (eg ask claude to test out its own skill) - custom built platform (I see interest in this space)

I've actually been thinking about this problem a lot and am working on making a custom eval runner for your codebase. What would your usecase be for this?

celestialcheese•1h ago
mix of promptfoo and ad-hoc python scripts, with langfuse observability.

Definitely not happy with it, but everything is moving too fast to feel like it's worth investing in.

kelseyfrog•1h ago
Automated benchmarking.

We were lucky enough to have PMs create a set of questions, we did a round of generation and labeled pass/fail annotations on each response.

From there we bootstrapped AI-as-a judge and approximately replicated the results. Then we plug in new models, change prompts, pipelines while being able to approximate the original feedback signal. It's not an exact match, but it's wildly better than one-off testing and the regressions it brings.

We're able to confidently make changes without accidentally breaking something else. Overall win, but it can get costly if the iteration count is high.

maxalbarello•1h ago
Also wondering how to evals agentic pipelines. For instance, I generated memories from my chatGPT conversation history, how do I know whether they are accurate or not?

I would like a single number that I would use to optimize the pipeline with but I find it hard to figure out what that number should be measuring.

dkoy•1h ago
Curious who’s used OpenAI Evals