frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Ask HN: How much of your token use is fixing the bugs Claude Code causes?

1•laurex•2m ago•0 comments

Show HN: Agents – Sync MCP Configs Across Claude, Cursor, Codex Automatically

https://github.com/amtiYo/agents
1•amtiyo•2m ago•0 comments

Hello

1•otrebladih•4m ago•0 comments

FSD helped save my father's life during a heart attack

https://twitter.com/JJackBrandt/status/2019852423980875794
1•blacktulip•6m ago•0 comments

Show HN: Writtte – Draft and publish articles without reformatting, anywhere

https://writtte.xyz
1•lasgawe•8m ago•0 comments

Portuguese icon (FROM A CAN) makes a simple meal (Canned Fish Files) [video]

https://www.youtube.com/watch?v=e9FUdOfp8ME
1•zeristor•10m ago•0 comments

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

https://www.hpcwire.com/off-the-wire/brookhaven-labs-rhic-concludes-25-year-run-with-final-collis...
2•gnufx•12m ago•0 comments

Transcribe your aunts post cards with Gemini 3 Pro

https://leserli.ch/ocr/
1•nielstron•16m ago•0 comments

.72% Variance Lance

1•mav5431•17m ago•0 comments

ReKindle – web-based operating system designed specifically for E-ink devices

https://rekindle.ink
1•JSLegendDev•19m ago•0 comments

Encrypt It

https://encryptitalready.org/
1•u1hcw9nx•19m ago•1 comments

NextMatch – 5-minute video speed dating to reduce ghosting

https://nextmatchdating.netlify.app/
1•Halinani8•20m ago•1 comments

Personalizing esketamine treatment in TRD and TRBD

https://www.frontiersin.org/articles/10.3389/fpsyt.2025.1736114
1•PaulHoule•21m ago•0 comments

SpaceKit.xyz – a browser‑native VM for decentralized compute

https://spacekit.xyz
1•astorrivera•22m ago•0 comments

NotebookLM: The AI that only learns from you

https://byandrev.dev/en/blog/what-is-notebooklm
1•byandrev•22m ago•1 comments

Show HN: An open-source starter kit for developing with Postgres and ClickHouse

https://github.com/ClickHouse/postgres-clickhouse-stack
1•saisrirampur•23m ago•0 comments

Game Boy Advance d-pad capacitor measurements

https://gekkio.fi/blog/2026/game-boy-advance-d-pad-capacitor-measurements/
1•todsacerdoti•23m ago•0 comments

South Korean crypto firm accidentally sends $44B in bitcoins to users

https://www.reuters.com/world/asia-pacific/crypto-firm-accidentally-sends-44-billion-bitcoins-use...
2•layer8•24m ago•0 comments

Apache Poison Fountain

https://gist.github.com/jwakely/a511a5cab5eb36d088ecd1659fcee1d5
1•atomic128•26m ago•2 comments

Web.whatsapp.com appears to be having issues syncing and sending messages

http://web.whatsapp.com
1•sabujp•26m ago•2 comments

Google in Your Terminal

https://gogcli.sh/
1•johlo•27m ago•0 comments

Shannon: Claude Code for Pen Testing: #1 on Github today

https://github.com/KeygraphHQ/shannon
1•hendler•28m ago•0 comments

Anthropic: Latest Claude model finds more than 500 vulnerabilities

https://www.scworld.com/news/anthropic-latest-claude-model-finds-more-than-500-vulnerabilities
2•Bender•32m ago•0 comments

Brooklyn cemetery plans human composting option, stirring interest and debate

https://www.cbsnews.com/newyork/news/brooklyn-green-wood-cemetery-human-composting/
1•geox•32m ago•0 comments

Why the 'Strivers' Are Right

https://greyenlightenment.com/2026/02/03/the-strivers-were-right-all-along/
1•paulpauper•34m ago•0 comments

Brain Dumps as a Literary Form

https://davegriffith.substack.com/p/brain-dumps-as-a-literary-form
1•gmays•34m ago•0 comments

Agentic Coding and the Problem of Oracles

https://epkconsulting.substack.com/p/agentic-coding-and-the-problem-of
1•qingsworkshop•35m ago•0 comments

Malicious packages for dYdX cryptocurrency exchange empties user wallets

https://arstechnica.com/security/2026/02/malicious-packages-for-dydx-cryptocurrency-exchange-empt...
1•Bender•35m ago•0 comments

Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

https://github.com/pheonix-delta/axiom-voice-agent
1•shubham-coder•36m ago•0 comments

Penisgate erupts at Olympics; scandal exposes risks of bulking your bulge

https://arstechnica.com/health/2026/02/penisgate-erupts-at-olympics-scandal-exposes-risks-of-bulk...
4•Bender•36m ago•0 comments
Open in hackernews

Ask HN: Who is honestly evaluating AI outputs and how?

2•toddmorey•1mo ago
Especially with multimodal AI conversations, evaluating and benchmarking these models is an increasingly complex topic, but a frustrating interaction with AI can really leave customers feeling sour about your whole product / service.

For an in-product AI assistant (with grounding, doc retrieval, and tool calling) I'm having a hard time wrapping my head around how to evaluate and monitor its success with customer interactions, prompt adherence, correctness and appropriateness, etc.

Any tips or resources that have been helpful to folks investing this challenge? Would love to learn. What does your stack / process look like?

Comments

helain•1mo ago
Before everything i want to tell you that i am working on a RAG project and you can check https://www.ailog.fr and our app https://app.ailog.fr/ . You can check it out if you want a production-ready RAG ( we have an API and we can scale to enterprise level if necessary ).

Next for the feedback part :

Evaluate LLM systems as three separate layers: model, retrieval or grounding, and tools. Measure each with automated tests plus continuous human sampling. A single accuracy metric hides user frustration. Instrument failures, not just averages.

Practical framework you can implement quickly:

Human in the loop: Review 1 to 5 percent of production sessions for correctness, safety, and helpfulness. Train a lightweight risk flagger.

Synthetic tests: 100 to 500 canned conversations covering happy paths, edge cases, adversarial prompts, and multimodal failures. Run on every change.

Retrieval and hallucinations: Track precision at k, MRR, and grounding coverage. Use entailment checks against retrieved documents.

Tools and integrations: Validate schemas, assert idempotency, run end to end failure simulations. Track tool call and rollback rates.

Telemetry and drift: Log embeddings, latency, feedback, and escalations. Alert on drift, hallucination spikes, and tool failures.

Weekly metrics: correctness, hallucination rate, retrieval precision at 5 and MRR, tool success rate, CSAT, latency, escalation rate. Pilot plan: one week to wire logging, two weeks to build a 100 scenario suite, then nightly synthetic tests and daily human review.

You can check out https://app.ailog.fr/en/tools to get some insight on way to evaluate your RAG, we have free tools here for you to check and use :)