LLMs ace bar exams, but even the best gets 1 in 12 local queries wrong

https://voygr-tech.github.io/llm-local-search-benchmark-report/

4•yamarkov•1h ago

Comments

yamarkov•1h ago

VOYGR team here. We built this because we kept running into the same problem: LLMs confidently recommending places that turned out to be closed, fabricated, or in the wrong neighborhood. We wanted to measure how bad it actually is.

Setup: 345 prompts across 50+ cities, 5 task types (discovery, place details, navigation, booking, sharing), each run across ChatGPT, Gemini, Claude, and Perplexity with search ON and OFF. 2,415 total evaluated responses. Every recommended place was verified against Google Search and Maps.

What surprised us:

1. Search makes booking tasks worse. Enabling web search improved discovery by ~8 points but hurt transactional tasks. Claude and Gemini both lost 5+ points on "help me book a table" prompts. Models switched from giving step-by-step advice to quoting search snippets.

2. Every model confidently books you a table at closed restaurants. We tested a permanently closed Buenos Aires restaurant. All 7 configs gave booking guidance and seating tips. Even search-equipped models didn't catch it.

3. The real gap is constraint matching. Models find real places but ignore parts of the prompt: price range, neighborhood, cuisine type. Ask for "affordable rooftop bars in Gangnam" and you get champagne lounges with $30 cocktails. This gap is 16 points between the best and worst provider.

The full methodology is in the report. We're planning to open-source the benchmark repo (all 345 prompts, evaluation pipeline, and raw results) in the coming weeks.

We built a *Business Validation API* designed for AI developers and agents, catching these failures before they reach production. Pass in a place name and address from any LLM response and get back: existence verification and operating status. These are the exact checks that would have caught fatal flaws in this benchmark. Link is in the report if you want to try it.

Happy to answer questions about methodology or anything else.

1M context window now generally available for Claude Opus and Sonnet 4.6

Ask HN: Is "fast, cheap, correct – pick two" still true in software development?

Pandas Exercises for Data Analysis (Interactive)

Why physical AI is becoming manufacturing's next advantage

AI‑driven fraud and corporate crime: Risks, controls and insurance implications

Nvidia: Parrot

Your Phone Is an Entire Computer

AI writing has a homogeneity problem

Adobe to Offer $75M in Free Services to Settle Government Lawsuit

Opus 4.6 1M is now the default Opus model for Claude Code users

Ask HN: Did Claude Code just bump Opus default to 1M context?

Adobe will pay $75M to settle US cancellation fee lawsuit

Why people in L.A. are strapping cameras on their bodies to do chores

How to Secure a Terraform Scripts

Adobe pays $75M to settle over termination fees, subscription cancellations

Show HN: Context Gateway – Compress agent context before it hits the LLM

Custom clothing is cheap and easy to order

MLX: Basics

Show HN: Stint – Fire-and-forget AI agent orchestration

Show HN: Let AI agents debug your Valkey/Redis

Aircraft Lease

Who is footing the AI energy bill? Debate over data center electricity costs

Productivity and Entropy

What's My ΔE(OK)JND?

DoShare Personal Cloud - Your Cloud, Your Rules

Tomorrow's World: Nellie the School Computer 15 February 1969 – BBC [video]

John Carmack about open source and anti-AI activists

Gamers' Worst Nightmares About AI Are Coming True

HSBC UK Banking App Blocks Use Until Sideloaded Bitwarden Is Removed

The Accidental Room (2018)