frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Scientists reverse Alzheimer's in mice and restore memory (2025)

https://www.sciencedaily.com/releases/2025/12/251224032354.htm
1•walterbell•2m ago•0 comments

Compiling Prolog to Forth [pdf]

https://vfxforth.com/flag/jfar/vol4/no4/article4.pdf
1•todsacerdoti•4m ago•0 comments

Show HN: Cymatica – an experimental, meditative audiovisual app

https://apps.apple.com/us/app/cymatica-sounds-visualizer/id6748863721
1•_august•5m ago•0 comments

GitBlack: Tracing America's Foundation

https://gitblack.vercel.app/
1•martialg•5m ago•0 comments

Horizon-LM: A RAM-Centric Architecture for LLM Training

https://arxiv.org/abs/2602.04816
1•chrsw•6m ago•0 comments

We just ordered shawarma and fries from Cursor [video]

https://www.youtube.com/shorts/WALQOiugbWc
1•jeffreyjin•7m ago•1 comments

Correctio

https://rhetoric.byu.edu/Figures/C/correctio.htm
1•grantpitt•7m ago•0 comments

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

https://chillphysicsenjoyer.substack.com/p/trying-to-make-an-automated-ecologist
1•crescit_eundo•11m ago•0 comments

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

https://www.twz.com/air/watch-ukraines-minigun-firing-drone-hunting-turboprop-in-action
1•breve•12m ago•0 comments

Free Trial: AI Interviewer

https://ai-interviewer.nuvoice.ai/
1•sijain2•12m ago•0 comments

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

https://www.fda.gov/news-events/press-announcements/fda-intends-take-action-against-non-fda-appro...
7•randycupertino•13m ago•2 comments

Supernote e-ink devices for writing like paper

https://supernote.eu/choose-your-product/
3•janandonly•15m ago•0 comments

We are QA Engineers now

https://serce.me/posts/2026-02-05-we-are-qa-engineers-now
1•SerCe•16m ago•0 comments

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

https://arxiv.org/abs/2602.01465
2•NBenkovich•16m ago•0 comments

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

https://www.latent.space/p/adversarial-reasoning
1•swyx•16m ago•0 comments

Show HN: Poddley.com – Follow people, not podcasts

https://poddley.com/guests/ana-kasparian/episodes
1•onesandofgrain•24m ago•0 comments

Layoffs Surge 118% in January – The Highest Since 2009

https://www.cnbc.com/2026/02/05/layoff-and-hiring-announcements-hit-their-worst-january-levels-si...
7•karakoram•24m ago•0 comments

Papyrus 114: Homer's Iliad

https://p114.homemade.systems/
1•mwenge•25m ago•1 comments

DicePit – Real-time multiplayer Knucklebones in the browser

https://dicepit.pages.dev/
1•r1z4•25m ago•1 comments

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

https://arxiv.org/abs/2601.14340
2•PaulHoule•26m ago•0 comments

Show HN: AI Agent Tool That Keeps You in the Loop

https://github.com/dshearer/misatay
2•dshearer•28m ago•0 comments

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

https://drmowinckels.io/blog/2026/sitrep-functions/
1•todsacerdoti•28m ago•0 comments

Achieving Ultra-Fast AI Chat Widgets

https://www.cjroth.com/blog/2026-02-06-chat-widgets
1•thoughtfulchris•30m ago•0 comments

Show HN: Runtime Fence – Kill switch for AI agents

https://github.com/RunTimeAdmin/ai-agent-killswitch
1•ccie14019•32m ago•1 comments

Researchers surprised by the brain benefits of cannabis usage in adults over 40

https://nypost.com/2026/02/07/health/cannabis-may-benefit-aging-brains-study-finds/
2•SirLJ•34m ago•0 comments

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

https://fortune.com/2026/02/04/peter-thiel-antichrist-greta-thunberg-end-of-modernity-billionaires/
4•randycupertino•35m ago•2 comments

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

https://www.twz.com/sea/uss-preble-used-helios-laser-to-zap-four-drones-in-expanding-testing
3•breve•40m ago•0 comments

Show HN: Animated beach scene, made with CSS

https://ahmed-machine.github.io/beach-scene/
1•ahmedoo•41m ago•0 comments

An update on unredacting select Epstein files – DBC12.pdf liberated

https://neosmart.net/blog/efta00400459-has-been-cracked-dbc12-pdf-liberated/
3•ks2048•41m ago•0 comments

Was going to share my work

1•hiddenarchitect•44m ago•0 comments
Open in hackernews

Principles for production AI agents

https://www.app.build/blog/six-principles-production-ai-agents
128•carlotasoto•6mo ago

Comments

carlotasoto•6mo ago
Practical lessons from building production agentic systems
roadside_picnic•6mo ago
Did we just give up on evaluations these days?

Over, and over again my experience building production AI tools/systems has been that evaluations are vital for improving performance.

I've also see a lot of people proposing some variation of "LLM as critic" as a solution to this, but I've never seen empirical evidence that this works. Further more, I've worked with a pretty well respected researcher in this space and in our internal experiment we found that LLMs where not good critics.

Results are always changing, so I'm very open to the possibility that someone has successfully figured out how to use "LLM as critic" but without the foundations of some basic evals to compare by, I remain skeptical.

Aurornis•6mo ago
Evals are a core part of any up to date LLM team. If some team was just winging it without robust eval practices they’re not to be trusted.

> Further more, I've worked with a pretty well respected researcher in this space and in our internal experiment we found that LLMs where not good critics

This is an idea that seems so obvious in retrospect, after using LLMs and getting so many flattering responses telling us we’re right and complementing our inputs.

For what it’s worth, I’ve heard from some people who said they were getting better results by intentionally using different LLM models for the eval portion. Feels like having a model in the same family evaluate its own output triggers too many false positives.

Uehreka•6mo ago
I once asked Claude Code (Opus 4) to review a codebase I’d built, and threw in at the end of my prompt something like “No need to be nice about it.”

Now granted, you could say it was “flattering that instruction”, but it sure didn’t flatter me. It absolutely eviscerated my code, calling out numerous security issues (which were real), all manner of code smells and bad architectural decisions, and ended by saying that the codebase appeared to have been thrown together in a rush with no mind toward future maintenance (which was… half true… maybe more true than I’d like to admit).

All this to say that it is far from obvious that LLMs are intrinsically bad critics.

Herring•6mo ago
I have an idea. What if we used a third LLM to evaluate how good the secondary LLM is at critiquing the primary LLM.
colonCapitalDee•6mo ago
The problem isn't that LLMs can't be critical, it's that LLMs don't have taste. It's easy to get an LLM to give praise, and it's easy to get an LLM to give criticism, but getting an LLM to praise good things and criticize bad things is currently impossible for non-trival inputs. That's not say that prompting your LLM to generate criticism is useless, it's just that any LLM prompted to generate criticism is going to criticize things are that actually fine, just like how an LLM prompted to generate praise (which is effectively the default behavior) is going to praise things that are deeply not fine.
bubblyworld•6mo ago
Absolutely matches my experience - it can still be super helpful, but AI have an extreme version of an anchoring bias.
jauhar_•6mo ago
Another issue is that the behaviour of the LLMs is not very consistent.
sudhirb•6mo ago
For coding agents, evaluations are tricky - thorough evaluation tasks tend to be slow and/or expensive and/or display a high degree of variance over N attempts. You could run a whole benchmark like SWE Bench or Terminal Bench against a coding agent on every change but it quickly becomes infeasible.
roadside_picnic•6mo ago
I used to own the eval suite for a coding agent, it's certainly doable, even when it requires SQL + tables etc. We even had support for a wide range of data options ranging from canned csv data to plugging into prod to simulate the user experience, all easily configurable at eval run time. It also supported agentic flows where the results from one eval could be chained to the next (with a known correct answer being an optional send to check the framework end to end in the case of node failure).

Interestingly enough, we started with hundreds of evals, but after that experience my advice has become: less evals tied more closely to specific features and product ambitions.

By that I mean: some evals should serve as a warning ("uh oh, that eval failed, don't push to prod"), others as a mile stone ("woohoo! we got it work!"), and all should be informed by the product road map. You basically should understand where the product is going just by looking over the eval suite.

And, if you don't have evals, you really don't know if you're moving the needle at all. There were multiple situations where a tweak to a prompt passed an initial vibe check, but when run against the full eval suite, clearly performed worse.

The other piece of advice would be: evals don't have to sophisticated, just repeatable and agnostic to who's running them. Heck even "vibe checks" can be good evals, if they're written down and they need to pass some consensus among multiple people around whether they passed or not.

criemen•6mo ago
Running evals aren't the problem, the problem is acquiring or building a high-quality, non-contaminated dataset.

https://arxiv.org/abs/2506.12286 makes a very compelling case that swebench (and in extension, anything that's based on public source code) is most likely overestimating your agents actual capabilities.

simonw•6mo ago
This is the best guide I've seen to the LLM-as-judge pattern: https://hamel.dev/blog/posts/llm-judge/index.html
glial•6mo ago
This is fantastic, thank you for sharing.
edmundsauto•6mo ago
Hamel has a ton of great and free content on YouTube. He and Shreya Shankar are a breath of fresh air.
abhgh•6mo ago
Evals somehow seem to be very very underrated, which is concerning in a world where we are moving towards (or trying to) systems with more autonomy.

Your skepticism of "llm-as-a-judge" setups is spot on. If your LLM can make mistakes/hallucinate, then of course, your judge llm can too. In practice, you need to validate your judges and possibly adapt to your task based on sample annotated data. You might adapt them by trial and error, or prompt optimization, e.g., using DSPy [1], or learning a small correction model on top of their outputs, e.g., LLM-Rubric [2] or Prediction Powered Inference [3].

In the end, using the LLM as a judge confers just these benefits:

1. It is easy to express complex evaluation criteria. This does not guarantee correctness.

2. Seen as a model, it is easy to "train", i.e., you get all the benefits of in-context learning, e.g., prompt based, few-shot.

But you still need to evaluate and adapt them. I have notes from a NeurIPS workshop from last year [4]. Btw, love your username!

[1]https://dspy.ai/

[2]https://aclanthology.org/2024.acl-long.745/

[3]https://www.youtube.com/watch?v=TlFpVpFx7JY

[4] https://blog.quipu-strands.com/eval-llms

prats226•6mo ago
I see that in tool calling, we usually specify just the inputs to functions and not what typed output is expected from function.

In DSL style agents, giving LLMs info about what structured inputs are needed to call functions as well as what are outputs expected would probably result in better planning?

SrslyJosh•6mo ago
"Don't."
lacoolj•6mo ago
Always hard to take an article seriously when it has typos, some of which are repeated ("promt" in the graphic on Principle 2)
henriquegodoy•6mo ago
I've been tinkering with agentic systems for a while now, and this post nails some key pain points that hit close to home. The emphasis on splitting context and designing tight feedback loops feels spot on—I've seen agents go off the rails without them, hallucinating solutions because the prompt was too bloated or the validation was half-baked. It's like building a machine where every part needs to click just right, or else you're debugging forever.

What really resonates is the bit about frustrating behaviors signaling deeper system issues, not just model quirks. In my own experiments, I've had agents stubbornly ignore tools because I forgot to expose the right APIs, and it made me rethink how we treat these as "intelligent" when they're really just following our flawed setups. It pushes us toward more robust orchestration, where humans handle the high-level intentions and AI fills in the execution gaps seamlessly.

This ties into broader ideas on how AI interfaces will evolve as models get smarter. I extrapolate more of this thinking and dive deeper into human–AI interfaces on my blog if anyone’s interested in checking it out: https://henriquegodoy.com/blog/stream-of-consciousness