frontpage.

We built HalluMix to evaluate how well hallucination detectors perform in the kinds of messy, high-stakes environments where LLMs are actually deployed: long-form outputs, multi-document contexts, and domain-specific tasks like law, medicine, science, and news.

Most existing benchmarks focus on synthetic or short-form QA data. That didn’t reflect what we were seeing in production, so we built our own to test our hallucination detectors, and decided to open source it.

The dataset includes 6,500 examples across QA, summarization, and NLI tasks. We added distractor documents, shuffled the context, and removed assumptions about format (like requiring a question) to better reflect real-world conditions.

We ran 7 detection systems on it, both open-source models and commercial APIs. While some performed well on shorter examples, even the best struggled with long-form content and multi-document grounding -- precisely where hallucinations tend to be most harmful.

Would love feedback, especially from anyone working on evals, hallucination detection, or RAG.

Links: – HF Dataset: https://huggingface.co/datasets/quotient-ai/hallumix - HF Blog: https://huggingface.co/blog/quotientai/hallumix – Internal Blog: https://blog.quotientai.co/introducing-hallumix-a-task-agnos... – Paper: https://arxiv.org/abs/2505.00506

Meta Pixels Case Dismissed by Second Circuit–Solomon vs. Triller

Instagram Urged 'Groomers' to Connect with Minors

Gene Switch Reboots Sight and Sound: A Breakthrough in Sensory Regeneration

Helping the AI Industry Secure Unreleased Models Is a National Security Priority

Show HN: Korey – a product management agent for software teams

A Cool HTTP Feature for Simple Real-Time Updates

Tnote – command line note taking app

Show HN: Debate Uncle Bob – Is SQL Dead? (Voice RPG)

Who Wants to Be a Police Department's Favorite "Public Safety Technology" Vendor

Tariffs and Retaliation: A Brief Macroeconomic Analysis

OpenAI Will Get a Bit More Normal

The Most Beautiful Words in the English Language, According to Linguists

Why I Am Leaving the USA

Has Clothing Declined in Quality?

Windsurf Wave 8: Teams and Enterprise Features

Brush (Bo(u)rn(e) RUsty SHell) a POSIX and Bash-Compatible Shell in Rust

AI stories: daily benefits shine a light on bigger opportunities

I got, I got, I got, I got- LSD, got web data inside my MCP (check out LSD MCP)

Indian railways find a clever way to stop people from traveling without tickets

Trump Says California's High-Speed Train to Lose Federal Funding

Data Safety Levels Framework: The foundation of how we look at data in Block

Humblebundle Cybersecurity Bundle

Why Having a Cookie warning notice is a good idea

When Abandoned Mines Collapse

Show HN: Homescreen widget for Google Sheets (built in 2 days)

The simplest, fastest repository for training/finetuning small-sized VLMs

Real-Time MC Path Guiding for Global Illumination and Single Scattering

Sobol' Sequences with Guaranteed-Quality 2D Projections [video]

A Taxonomy for Rendering Engines

Lift controller in my building kills one of them twice faster (Jupyter notebook)

Show HN: HalluMix – A Benchmark for Real-World LLM Hallucination Detection