frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Introduce the Vouch/Denouncement Contribution Model

https://github.com/ghostty-org/ghostty/pull/10559
1•DustinEchoes•41s ago•0 comments

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

https://github.com/sultanvaliyev/sshcode
1•sultanvaliyev•55s ago•0 comments

Microsoft appointed a quality czar. He has no direct reports and no budget

https://jpcaparas.medium.com/microsoft-appointed-a-quality-czar-he-has-no-direct-reports-and-no-b...
1•RickJWagner•2m ago•0 comments

Multi-agent coordination on Claude Code: 8 production pain points and patterns

https://gist.github.com/sigalovskinick/6cc1cef061f76b7edd198e0ebc863397
1•nikolasi•3m ago•0 comments

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

https://www.nytimes.com/2026/02/07/technology/washington-post-will-lewis.html
1•jbegley•3m ago•0 comments

DevXT – Building the Future with AI That Acts

https://devxt.com
2•superpecmuscles•4m ago•2 comments

A Minimal OpenClaw Built with the OpenCode SDK

https://github.com/CefBoud/MonClaw
1•cefboud•4m ago•0 comments

The silent death of Good Code

https://amit.prasad.me/blog/rip-good-code
2•amitprasad•5m ago•0 comments

The Internal Negotiation You Have When Your Heart Rate Gets Uncomfortable

https://www.vo2maxpro.com/blog/internal-negotiation-heart-rate
1•GoodluckH•6m ago•0 comments

Show HN: Glance – Fast CSV inspection for the terminal (SIMD-accelerated)

https://github.com/AveryClapp/glance
2•AveryClapp•7m ago•0 comments

Busy for the Next Fifty to Sixty Bud

https://pestlemortar.substack.com/p/busy-for-the-next-fifty-to-sixty-had-all-my-money-in-bitcoin-...
1•mithradiumn•8m ago•0 comments

Imperative

https://pestlemortar.substack.com/p/imperative
1•mithradiumn•9m ago•0 comments

Show HN: I decomposed 87 tasks to find where AI agents structurally collapse

https://github.com/XxCotHGxX/Instruction_Entropy
1•XxCotHGxX•13m ago•1 comments

I went back to Linux and it was a mistake

https://www.theverge.com/report/875077/linux-was-a-mistake
1•timpera•14m ago•1 comments

Octrafic – open-source AI-assisted API testing from the CLI

https://github.com/Octrafic/octrafic-cli
1•mbadyl•15m ago•1 comments

US Accuses China of Secret Nuclear Testing

https://www.reuters.com/world/china/trump-has-been-clear-wanting-new-nuclear-arms-control-treaty-...
2•jandrewrogers•16m ago•1 comments

Peacock. A New Programming Language

1•hashhooshy•21m ago•1 comments

A postcard arrived: 'If you're reading this I'm dead, and I really liked you'

https://www.washingtonpost.com/lifestyle/2026/02/07/postcard-death-teacher-glickman/
2•bookofjoe•22m ago•1 comments

What to know about the software selloff

https://www.morningstar.com/markets/what-know-about-software-stock-selloff
2•RickJWagner•26m ago•0 comments

Show HN: Syntux – generative UI for websites, not agents

https://www.getsyntux.com/
3•Goose78•26m ago•0 comments

Microsoft appointed a quality czar. He has no direct reports and no budget

https://jpcaparas.medium.com/ab75cef97954
2•birdculture•27m ago•0 comments

AI overlay that reads anything on your screen (invisible to screen capture)

https://lowlighter.app/
1•andylytic•28m ago•1 comments

Show HN: Seafloor, be up and running with OpenClaw in 20 seconds

https://seafloor.bot/
1•k0mplex•28m ago•0 comments

Tesla turbine-inspired structure generates electricity using compressed air

https://techxplore.com/news/2026-01-tesla-turbine-generates-electricity-compressed.html
2•PaulHoule•30m ago•0 comments

State Department deleting 17 years of tweets (2009-2025); preservation needed

https://www.npr.org/2026/02/07/nx-s1-5704785/state-department-trump-posts-x
3•sleazylice•30m ago•1 comments

Learning to code, or building side projects with AI help, this one's for you

https://codeslick.dev/learn
1•vitorlourenco•30m ago•0 comments

Effulgence RPG Engine [video]

https://www.youtube.com/watch?v=xFQOUe9S7dU
1•msuniverse2026•32m ago•0 comments

Five disciplines discovered the same math independently – none of them knew

https://freethemath.org
4•energyscholar•32m ago•1 comments

We Scanned an AI Assistant for Security Issues: 12,465 Vulnerabilities

https://codeslick.dev/blog/openclaw-security-audit
1•vitorlourenco•33m ago•0 comments

Amazon no longer defend cloud customers against video patent infringement claims

https://ipfray.com/amazon-no-longer-defends-cloud-customers-against-video-patent-infringement-cla...
2•ffworld•34m ago•0 comments
Open in hackernews

Ask HN: Who is honestly evaluating AI outputs and how?

2•toddmorey•1mo ago
Especially with multimodal AI conversations, evaluating and benchmarking these models is an increasingly complex topic, but a frustrating interaction with AI can really leave customers feeling sour about your whole product / service.

For an in-product AI assistant (with grounding, doc retrieval, and tool calling) I'm having a hard time wrapping my head around how to evaluate and monitor its success with customer interactions, prompt adherence, correctness and appropriateness, etc.

Any tips or resources that have been helpful to folks investing this challenge? Would love to learn. What does your stack / process look like?

Comments

helain•1mo ago
Before everything i want to tell you that i am working on a RAG project and you can check https://www.ailog.fr and our app https://app.ailog.fr/ . You can check it out if you want a production-ready RAG ( we have an API and we can scale to enterprise level if necessary ).

Next for the feedback part :

Evaluate LLM systems as three separate layers: model, retrieval or grounding, and tools. Measure each with automated tests plus continuous human sampling. A single accuracy metric hides user frustration. Instrument failures, not just averages.

Practical framework you can implement quickly:

Human in the loop: Review 1 to 5 percent of production sessions for correctness, safety, and helpfulness. Train a lightweight risk flagger.

Synthetic tests: 100 to 500 canned conversations covering happy paths, edge cases, adversarial prompts, and multimodal failures. Run on every change.

Retrieval and hallucinations: Track precision at k, MRR, and grounding coverage. Use entailment checks against retrieved documents.

Tools and integrations: Validate schemas, assert idempotency, run end to end failure simulations. Track tool call and rollback rates.

Telemetry and drift: Log embeddings, latency, feedback, and escalations. Alert on drift, hallucination spikes, and tool failures.

Weekly metrics: correctness, hallucination rate, retrieval precision at 5 and MRR, tool success rate, CSAT, latency, escalation rate. Pilot plan: one week to wire logging, two weeks to build a 100 scenario suite, then nightly synthetic tests and daily human review.

You can check out https://app.ailog.fr/en/tools to get some insight on way to evaluate your RAG, we have free tools here for you to check and use :)