fp.

Open in hackernews

RAG Eval Comparing Vertex/Bedrock/Azure/OpenAI

https://github.com/colon-md/retrievalci

2•colon-md•1h ago

Comments

colon-md•1h ago

Last week, I read Karpathy's gist on building a personal wiki LLM (https://gist.github.com/karpathy/442a6bf555914893e9891c11519...) and decided to try it.

The RAG pitch is take your own corpus of docs, layer an LLM over it, get a thing that answers questions grounded in your stuff. Wiki+RAG hybrid as the interesting architectural variant.

So I started building the "traditional" retrieval architectures (pure dense, BM25, hybrid RRF, rerank) to pit against the wiki+RAG variant with structure layered over the chunks.

After few days of code cleanup I have an eval testbench and a wiki LLM is only 50% built. I'm releasing the testbench now because I think the testbench is just as valuable as the RAG design itself.

What the repo does: runs four hosted RAG services against identical inputs (same 81-doc enterprise corpus, same 50 questions stratified across single-hop / multi-hop / contradiction / unanswerable, same retrieve-only scoring of 0.7×recall + 0.3×precision):

  - Azure AI Search: 84.0  (recall 90.9%, precision 67.8%)
  - Vertex AI RAG Engine: 82.6  (94.5%, 54.7%)
  - Bedrock Knowledge Bases: 82.5  (87.9%, 70.1%)
  - OpenAI File Search: 78.5  (89.3%, 53.4%)

Here's a surprise finding (maybe not a surprise to you): all four major RAG services hallucinate on every unanswerable question. 0/5 abstention correctness across the board. Was sort of expecting enterprise RAG providers like GCP, AWS, Azure, and OpenAI to respond "I don't know" to unanswerable questions.

Darkest Dungeon devs will "never, ever" use GenAI to replace narrator Wayne June

https://www.rockpapershotgun.com/his-voice-and-delivery-was-human-darkest-dungeon-developers-will...

1•latexr•1m ago•0 comments

The Silence That Meets the Rape of Palestinians

https://www.nytimes.com/2026/05/11/opinion/israel-palestinians-sexual-violence.html

3•lorecore•3m ago•1 comments

The new Shai-Hulud worm threatens to wipe your machine if you revoke its token

https://cybersecurityreach.org/investigations/ifyourevokethistokenitwillwipethecomputeroftheowner...

1•Leonardm•4m ago•0 comments

Extraordinary Ordinals

https://text.marvinborner.de/2026-04-09-17.html

2•marvinborner•14m ago•0 comments

Db-fortress – Scanner for the 4 vibe-coded auth bugs Wiz documents

1•omji-krypto•16m ago•0 comments

Can you help reconcile my first/second-hand LLM Experience with HN's Experience?

2•didigamma•19m ago•2 comments

Show HN: A browser-friendly mirror of the war.gov UFO/UAP Release 01 files

https://0.2.fastfilelink.com/x5CbXd5k

2•bear330•25m ago•1 comments

Open Questions – AGI

https://handsdiff.substack.com/p/open-questions-agi

1•rajeevn•27m ago•0 comments

Christophe Pettus: PHP Goes BSD

https://thebuild.com/blog/2026/04/30/php-goes-bsd/

2•PaulHoule•28m ago•0 comments

Months long delays for Firefox extension submission reviews

https://old.reddit.com/r/uBlockOrigin/comments/1taigxt/ublock_origin_ubo_171_announcement_thread_...

1•kholdstayr•29m ago•1 comments

Moving from lsp-mode in GNU Emacs to Eglot

https://utcc.utoronto.ca/~cks/space/blog/programming/EmacsLspModeToEglot

1•susam•30m ago•0 comments

Wayland.fyi minimalist Wayland special interest group

https://wayland.fyi/

2•birdculture•30m ago•0 comments

GitLab Act 2

https://simonwillison.net/2026/May/11/gitlab-act-2/

3•digitallogic•31m ago•0 comments

Ask HN: How do you deal with AI fatigue?

1•carlos-menezes•32m ago•0 comments

Reuters: Sutskever says spent year proving sama dishonest

https://www.reuters.com/business/former-openai-executive-sutskever-discloses-nearly-7-billion-sta...

1•jjtheblunt•34m ago•1 comments

They Live (1988) inspired Adblocker

https://github.com/davmlaw/they_live_adblocker

2•tokenburner•39m ago•0 comments

Could This Fish Be a Notebook?

https://reasonstobecheerful.world/great-lakes-fish-interview/

1•cainxinth•40m ago•2 comments

Complaints About Iran War Leaks Prompt Aggressive DOJ Investigations

https://www.wsj.com/politics/national-security/trumps-complaints-about-iran-war-leaks-prompt-aggr...

2•petethomas•42m ago•0 comments

The Inference Shift – Stratechery

https://stratechery.com/2026/the-inference-shift/

1•chermanowicz•44m ago•0 comments

Motion Picture Editor's Guild Stress Survival Kit

https://www.editorsguild.com/Stress-Survival-Kit

1•gmays•44m ago•0 comments

Show HN: Safe-install – safer NPM installs with trusted build dependencies

https://www.npmjs.com/package/@gkiely/safe-install

2•gkiely•46m ago•0 comments

Ancient Secrets

https://www.nationalaffairs.com/blog/detail/findings-a-daily-roundup/ancient-secrets

2•paulpauper•49m ago•0 comments

The April every AI plan broke

https://thefinancialengineer.substack.com/p/the-april-every-ai-plan-broke

3•gmays•50m ago•0 comments

CRUD Is Broken

https://sawyer-p.me/crud-is-broken

3•bencornia•51m ago•0 comments

Today-dsa – a local-first engine that tells me what to study today

https://github.com/rasha-hantash/today-dsa

2•rasha1•56m ago•0 comments

Jon Caramanica is a bad cliché

https://bradmehldau.substack.com/p/jon-caramanica-is-a-bad-cliche

2•paulpauper•1h ago•0 comments

Why Dunkin' Donuts Failed in India

https://timesofindia.indiatimes.com/life-style/food-news/why-dunkin-failed-in-india/articleshow/1...

2•paulpauper•1h ago•0 comments

RAG Eval Comparing Vertex/Bedrock/Azure/OpenAI

https://github.com/colon-md/retrievalci

2•colon-md•1h ago•1 comments

Codex Pets for People in a Hurry

https://www.augmentedswe.com/p/how-to-use-codex-pets

2•wordsaboutcode•1h ago•0 comments

Graft – semantic memory for AI agents, without the LLM

https://github.com/AEndrix03/Graft

3•AEndrix03•1h ago•0 comments