In vision-language systems, hallucination often means inventing objects, attributes, or actions that are not present in the image at all.
Examples:
- Describing people who don’t exist - Inferring actions that never occurred - Assigning attributes unsupported by visual evidence
As these models are increasingly used for e-commerce listings, accessibility captions, document extraction, and medical imaging, the consequences escalate quickly.
Most evaluation pipelines are still text-centric. They don’t verify whether the generated description is actually grounded in the image.
Detecting image hallucination requires multimodal evaluation that reasons over both the image and the output jointly.
Curious how teams here are approaching hallucination detection for vision-language models today.