frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

It's time for the world to boycott the US

https://www.aljazeera.com/opinions/2026/2/5/its-time-for-the-world-to-boycott-the-us
1•HotGarbage•34s ago•0 comments

Show HN: Semantic Search for terminal commands in the Browser (No Back end)

https://jslambda.github.io/tldr-vsearch/
1•jslambda•38s ago•0 comments

The AI CEO Experiment

https://yukicapital.com/blog/the-ai-ceo-experiment/
2•romainsimon•2m ago•0 comments

Speed up responses with fast mode

https://code.claude.com/docs/en/fast-mode
2•surprisetalk•5m ago•0 comments

MS-DOS game copy protection and cracks

https://www.dosdays.co.uk/topics/game_cracks.php
2•TheCraiggers•6m ago•0 comments

Updates on GNU/Hurd progress [video]

https://fosdem.org/2026/schedule/event/7FZXHF-updates_on_gnuhurd_progress_rump_drivers_64bit_smp_...
2•birdculture•7m ago•0 comments

Epstein took a photo of his 2015 dinner with Zuckerberg and Musk

https://xcancel.com/search?f=tweets&q=davenewworld_2%2Fstatus%2F2020128223850316274
5•doener•7m ago•2 comments

MyFlames: Visualize MySQL query execution plans as interactive FlameGraphs

https://github.com/vgrippa/myflames
1•tanelpoder•9m ago•0 comments

Show HN: LLM of Babel

https://clairefro.github.io/llm-of-babel/
1•marjipan200•9m ago•0 comments

A modern iperf3 alternative with a live TUI, multi-client server, QUIC support

https://github.com/lance0/xfr
3•tanelpoder•10m ago•0 comments

Famfamfam Silk icons – also with CSS spritesheet

https://github.com/legacy-icons/famfamfam-silk
1•thunderbong•11m ago•0 comments

Apple is the only Big Tech company whose capex declined last quarter

https://sherwood.news/tech/apple-is-the-only-big-tech-company-whose-capex-declined-last-quarter/
2•elsewhen•14m ago•0 comments

Reverse-Engineering Raiders of the Lost Ark for the Atari 2600

https://github.com/joshuanwalker/Raiders2600
2•todsacerdoti•15m ago•0 comments

Show HN: Deterministic NDJSON audit logs – v1.2 update (structural gaps)

https://github.com/yupme-bot/kernel-ndjson-proofs
1•Slaine•19m ago•0 comments

The Greater Copenhagen Region could be your friend's next career move

https://www.greatercphregion.com/friend-recruiter-program
2•mooreds•19m ago•0 comments

Do Not Confirm – Fiction by OpenClaw

https://thedailymolt.substack.com/p/do-not-confirm
1•jamesjyu•20m ago•0 comments

The Analytical Profile of Peas

https://www.fossanalytics.com/en/news-articles/more-industries/the-analytical-profile-of-peas
1•mooreds•20m ago•0 comments

Hallucinations in GPT5 – Can models say "I don't know" (June 2025)

https://jobswithgpt.com/blog/llm-eval-hallucinations-t20-cricket/
1•sp1982•20m ago•0 comments

What AI is good for, according to developers

https://github.blog/ai-and-ml/generative-ai/what-ai-is-actually-good-for-according-to-developers/
1•mooreds•20m ago•0 comments

OpenAI might pivot to the "most addictive digital friend" or face extinction

https://twitter.com/lebed2045/status/2020184853271167186
1•lebed2045•21m ago•2 comments

Show HN: Know how your SaaS is doing in 30 seconds

https://anypanel.io
1•dasfelix•22m ago•0 comments

ClawdBot Ordered Me Lunch

https://nickalexander.org/drafts/auto-sandwich.html
3•nick007•23m ago•0 comments

What the News media thinks about your Indian stock investments

https://stocktrends.numerical.works/
1•mindaslab•24m ago•0 comments

Running Lua on a tiny console from 2001

https://ivie.codes/page/pokemon-mini-lua
1•Charmunk•24m ago•0 comments

Google and Microsoft Paying Creators $500K+ to Promote AI Tools

https://www.cnbc.com/2026/02/06/google-microsoft-pay-creators-500000-and-more-to-promote-ai.html
3•belter•27m ago•0 comments

New filtration technology could be game-changer in removal of PFAS

https://www.theguardian.com/environment/2026/jan/23/pfas-forever-chemicals-filtration
1•PaulHoule•28m ago•0 comments

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

https://github.com/Momciloo/fun-with-clip-path
2•momciloo•28m ago•0 comments

Kinda Surprised by Seadance2's Moderation

https://seedanceai.me/
1•ri-vai•28m ago•2 comments

I Write Games in C (yes, C)

https://jonathanwhiting.com/writing/blog/games_in_c/
2•valyala•28m ago•1 comments

Django scales. Stop blaming the framework (part 1 of 3)

https://medium.com/@tk512/django-scales-stop-blaming-the-framework-part-1-of-3-a2b5b0ff811f
2•sgt•29m ago•0 comments
Open in hackernews

Ask HN: Who is honestly evaluating AI outputs and how?

2•toddmorey•1mo ago
Especially with multimodal AI conversations, evaluating and benchmarking these models is an increasingly complex topic, but a frustrating interaction with AI can really leave customers feeling sour about your whole product / service.

For an in-product AI assistant (with grounding, doc retrieval, and tool calling) I'm having a hard time wrapping my head around how to evaluate and monitor its success with customer interactions, prompt adherence, correctness and appropriateness, etc.

Any tips or resources that have been helpful to folks investing this challenge? Would love to learn. What does your stack / process look like?

Comments

helain•1mo ago
Before everything i want to tell you that i am working on a RAG project and you can check https://www.ailog.fr and our app https://app.ailog.fr/ . You can check it out if you want a production-ready RAG ( we have an API and we can scale to enterprise level if necessary ).

Next for the feedback part :

Evaluate LLM systems as three separate layers: model, retrieval or grounding, and tools. Measure each with automated tests plus continuous human sampling. A single accuracy metric hides user frustration. Instrument failures, not just averages.

Practical framework you can implement quickly:

Human in the loop: Review 1 to 5 percent of production sessions for correctness, safety, and helpfulness. Train a lightweight risk flagger.

Synthetic tests: 100 to 500 canned conversations covering happy paths, edge cases, adversarial prompts, and multimodal failures. Run on every change.

Retrieval and hallucinations: Track precision at k, MRR, and grounding coverage. Use entailment checks against retrieved documents.

Tools and integrations: Validate schemas, assert idempotency, run end to end failure simulations. Track tool call and rollback rates.

Telemetry and drift: Log embeddings, latency, feedback, and escalations. Alert on drift, hallucination spikes, and tool failures.

Weekly metrics: correctness, hallucination rate, retrieval precision at 5 and MRR, tool success rate, CSAT, latency, escalation rate. Pilot plan: one week to wire logging, two weeks to build a 100 scenario suite, then nightly synthetic tests and daily human review.

You can check out https://app.ailog.fr/en/tools to get some insight on way to evaluate your RAG, we have free tools here for you to check and use :)