frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Token-to-Credit Conversion: Avoiding Floating-Point Errors in AI Billing Systems

https://app.writtte.com/read/kZ8Kj6R
1•lasgawe•18s ago•1 comments

The Story of Heroku (2022)

https://leerob.com/heroku
1•tosh•37s ago•0 comments

Obey the Testing Goat

https://www.obeythetestinggoat.com/
1•mkl95•1m ago•0 comments

Claude Opus 4.6 extends LLM pareto frontier

https://michaelshi.me/pareto/
1•mikeshi42•1m ago•0 comments

Brute Force Colors (2022)

https://arnaud-carre.github.io/2022-12-30-amiga-ham/
1•erickhill•4m ago•0 comments

Google Translate apparently vulnerable to prompt injection

https://www.lesswrong.com/posts/tAh2keDNEEHMXvLvz/prompt-injection-in-google-translate-reveals-ba...
1•julkali•4m ago•0 comments

(Bsky thread) "This turns the maintainer into an unwitting vibe coder"

https://bsky.app/profile/fullmoon.id/post/3meadfaulhk2s
1•todsacerdoti•5m ago•0 comments

Software development is undergoing a Renaissance in front of our eyes

https://twitter.com/gdb/status/2019566641491963946
1•tosh•6m ago•0 comments

Can you beat ensloppification? I made a quiz for Wikipedia's Signs of AI Writing

https://tryward.app/aiquiz
1•bennydog224•7m ago•1 comments

Spec-Driven Design with Kiro: Lessons from Seddle

https://medium.com/@dustin_44710/spec-driven-design-with-kiro-lessons-from-seddle-9320ef18a61f
1•nslog•7m ago•0 comments

Agents need good developer experience too

https://modal.com/blog/agents-devex
1•birdculture•8m ago•0 comments

The Dark Factory

https://twitter.com/i/status/2020161285376082326
1•Ozzie_osman•8m ago•0 comments

Free data transfer out to internet when moving out of AWS (2024)

https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-internet-when-moving-out-of-aws/
1•tosh•9m ago•0 comments

Interop 2025: A Year of Convergence

https://webkit.org/blog/17808/interop-2025-review/
1•alwillis•11m ago•0 comments

Prejudice Against Leprosy

https://text.npr.org/g-s1-108321
1•hi41•12m ago•0 comments

Slint: Cross Platform UI Library

https://slint.dev/
1•Palmik•16m ago•0 comments

AI and Education: Generative AI and the Future of Critical Thinking

https://www.youtube.com/watch?v=k7PvscqGD24
1•nyc111•16m ago•0 comments

Maple Mono: Smooth your coding flow

https://font.subf.dev/en/
1•signa11•17m ago•0 comments

Moltbook isn't real but it can still hurt you

https://12gramsofcarbon.com/p/tech-things-moltbook-isnt-real-but
1•theahura•20m ago•0 comments

Take Back the Em Dash–and Your Voice

https://spin.atomicobject.com/take-back-em-dash/
1•ingve•21m ago•0 comments

Show HN: 289x speedup over MLP using Spectral Graphs

https://zenodo.org/login/?next=%2Fme%2Fuploads%3Fq%3D%26f%3Dshared_with_me%25253Afalse%26l%3Dlist...
1•andrespi•22m ago•0 comments

Teaching Mathematics

https://www.karlin.mff.cuni.cz/~spurny/doc/articles/arnold.htm
2•samuel246•24m ago•0 comments

3D Printed Microfluidic Multiplexing [video]

https://www.youtube.com/watch?v=VZ2ZcOzLnGg
2•downboots•24m ago•0 comments

Abstractions Are in the Eye of the Beholder

https://software.rajivprab.com/2019/08/29/abstractions-are-in-the-eye-of-the-beholder/
2•whack•25m ago•0 comments

Show HN: Routed Attention – 75-99% savings by routing between O(N) and O(N²)

https://zenodo.org/records/18518956
1•MikeBee•25m ago•0 comments

We didn't ask for this internet – Ezra Klein show [video]

https://www.youtube.com/shorts/ve02F0gyfjY
1•softwaredoug•26m ago•0 comments

The Real AI Talent War Is for Plumbers and Electricians

https://www.wired.com/story/why-there-arent-enough-electricians-and-plumbers-to-build-ai-data-cen...
2•geox•29m ago•0 comments

Show HN: MimiClaw, OpenClaw(Clawdbot)on $5 Chips

https://github.com/memovai/mimiclaw
1•ssslvky1•29m ago•0 comments

I Maintain My Blog in the Age of Agents

https://www.jerpint.io/blog/2026-02-07-how-i-maintain-my-blog-in-the-age-of-agents/
3•jerpint•29m ago•0 comments

The Fall of the Nerds

https://www.noahpinion.blog/p/the-fall-of-the-nerds
1•otoolep•31m ago•0 comments
Open in hackernews

Evals in 2025: going beyond simple benchmarks to build models people can use

https://github.com/huggingface/evaluation-guidebook/blob/main/yearly_dives/2025-evaluations-for-useful-models.md
80•jxmorris12•4mo ago

Comments

aplassard•4mo ago
I think cost should also be a direct consideration. Model performance varies wildly on benchmarks when given a budget. https://substack.com/@andrewplassard/note/p-173487568?r=2fqo...
elemeno•4mo ago
I’ve been building a tool to help with this - Safety Evals In-a-Box [https://github.com/elemeno/seibox]. It’s a work in progress and not quite ready for public release, but its a multi-model eval runner (primarily for safety oriented evals, but no reason why it can run other types as well!) and includes cost and latency in it reporting.
andy99•4mo ago
These can be useful for labs training models. I don't see them as particularly valuable for building AI systems. Real performance depends on how the system is built, much more so than the underlying LLM.

Evaluating the system you build on relevant inputs is most important. Beyond that it would be nice to see benchmarks that give guidance on how and LLM should be used as a system component, not just which is "better" at something.

axpy906•4mo ago
My thoughts were this. The moment it is public it’s probably in the training data set. The real evals are the ones that you have to make an a problem you’re trying to solve and the data you are using.
gdiamos•4mo ago
How can the community tell if models overfit to these benchmarks?
kovezd•4mo ago
By the composition of evals. Plus secondary metrics like parameter size, and token cost.

Not perfect, but useful.

dustrider•4mo ago
Move beyond benchmarks… proceed to list a bunch of benchmarks.

The problem for me is that it’s not worth running these myself, yeah I may pay attention to which model is better at tool calling. But what matters is how well it does at my use case.

6Az4Mj4D•4mo ago
I see there are lots of courses being sold for Evals in Maven. Some are as costly as USD 3500. Are they worth it? https://maven.com/parlance-labs/evals