frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Agent-evals – Claude skill to build your own evals

https://github.com/fsilavong/agent-eval
7•sauercrowd•12h ago
I’ve spent the past 10 years working on AI in finance, with much of that time focused on building evaluation systems for production environments.

As agents become more widely adopted, more software engineering and product people have start building them. But I’ve noticed that many teams are not yet fluent in systematic evaluation, or in the processes needed to keep agent quality high over time.

For large organizations, that gap is rarely the bottleneck due to dedicated teams. But after speaking with a number of startups, it became clear that building strong, up-to-date evals is much harder in a fast startup, especially when the team does not have a data science background.

So I tried to condense as much of my experience as possible into a Claude Skill: a practical starting point for evaluating your agent.

The idea is simple: tell Claude you need evals, and it will set up a solid baseline directly in your codebase - that's it! The evals will follow patterns I've seen many times before, and will get you a summary of what your agent does well and what it doesnt.

Looking forward to your feedback!

Comments

johnjudeh•10h ago
Thanks for sharing! It’s way easier to build an agent that can complete a task than to make sure it works across all the cases you care about. Especially when the output quality is really subjective

Show HN: Brainio – Markdown notepad that turns notes into visual mind maps

https://brainio.com/
4•havlenao•34m ago•2 comments

Show HN: nfsdiag – A NFS diagnostic application

https://github.com/lsferreira42/nfsdiag
68•lsferreira42•2d ago•4 comments

Show HN: I Built a Museum Exhibit

https://knhash.in/built-an-exhibit/
28•kn81198•2d ago•1 comments

Show HN: SongShift, an advanced, AI-powered song conversion service

https://songshift.reachnick.co
4•lobf•2h ago•0 comments

Show HN: Retroguard – Verifiably secure AI guardrails

https://retroguard.ai
4•ttttonyhe•2h ago•0 comments

Show HN: I built a native macOS audio player and it changed my life

https://github.com/chrisallick/light-crime-audio-player
10•chrisallick•7h ago•1 comments

Show HN: I indexed 8,643 BSides talks across 227 chapters and 6 continents

https://allbsides.com/
9•Parkado•9h ago•3 comments

Show HN: A tiny C program where an LLM rewires its DAG while running

https://github.com/kouhxp/liteflow
8•mrkn1•6h ago•0 comments

Show HN: Apple's SHARP running in the browser via ONNX runtime web

https://github.com/bring-shrubbery/ml-sharp-web
182•bring-shrubbery•1d ago•44 comments

Show HN: Ableton Live MCP

https://github.com/bschoepke/ableton-live-mcp
115•bschoepke•1d ago•78 comments

Show HN: Kanban-CLI – a web UI for local Markdown todo lists

https://github.com/Vochsel/kanban-cli
6•vochsel•7h ago•0 comments

Show HN: Yames – A distraction-free desktop metronome built with Rust and Tauri

https://turutupa.github.io/yames/
4•turutupa•8h ago•0 comments

Show HN: Node-Vmm – Linux MicroVMs in Pure Node.js for Mac/Windows/Linux in ~1s

https://github.com/misaelzapata/node-vmm
6•misaelzapata•9h ago•0 comments

Show HN: State of the Art of Coding Models, According to Hacker News Commenters

https://hnup.date/hn-sota
157•yunusabd•2d ago•86 comments

Show HN: Pollen – distributed WASM runtime, no control plane, single binary

https://github.com/sambigeara/pollen
133•sambigeara•4d ago•59 comments

Show HN: DAC – open-source dashboard as code tool for agents and humans

https://github.com/bruin-data/dac
114•karakanb•5d ago•35 comments

Show HN: I built a RISC-V emulator that runs DOOM

https://github.com/lalitshankarch/rvcore
48•Flex247A•1d ago•2 comments

Show HN: NeuralScript – A pure-Rust AOT compiler

https://github.com/bwiemz/NSL
3•AkaiNa•11h ago•0 comments

Show HN: Agent-evals – Claude skill to build your own evals

https://github.com/fsilavong/agent-eval
7•sauercrowd•12h ago•1 comments

Show HN: Genosyn – Run Autonomous Companies

https://genosyn.com
4•ndhandala•12h ago•0 comments

Show HN: NoReporter – AI-only newsroom, $1/year

https://noreporter.ai
4•egberjustin•13h ago•0 comments

Show HN: Software Engineer to Novelist: Writing a Book Like Coding

https://frequal.com/forwriters/
22•TeaVMFan•1d ago•5 comments

Show HN: WhatCable, a tiny menu bar app for inspecting USB-C cables

https://github.com/darrylmorley/whatcable
557•sleepingNomad•3d ago•166 comments

Show HN: Parrot – a fun, skeuomorphic audio recorder to hear yourself

https://www.zkhrv.com/parrot
18•zkhrv•1d ago•2 comments

Show HN: Muesli – If Granola and Wisprflow had an open source on device baby

https://freedspeech.xyz
14•pHequals7•15h ago•9 comments

Show HN: AI CAD Harness

https://fusion.adam.new/install
98•zachdive•3d ago•95 comments

Show HN: Mljar Studio – local AI data analyst that saves analysis as notebooks

https://mljar.com/
70•pplonski86•2d ago•18 comments

Show HN: Bonsai 1.7B ternary model at 442T/s on M4 Max

https://agents2agents.ai/bonsai
13•hhuytho•16h ago•3 comments

Show HN: Visual SSL TLS Handshake Visualizer

https://www.sitesecurityscore.com/tools/ssl-tls-handshake-checker
3•lemax2•16h ago•0 comments

Show HN: Browser-based light pollution simulator using real photometric data

https://iesna.eu/?wasm=skyglow_demo
43•holg•2d ago•16 comments