This wasn't OCR error. The model didn't confuse a "7" for a "1." It generated a plausible-looking receipt from scratch — different store, different items, different prices. If I hadn't been holding the original, I might not have caught it.
Same image, different model (same parameter count, same hardware), five seconds later: every item correct, store name right, total accurate to the penny.
The models: minicpm-v 8B (fabricated) vs qwen3-vl 8B (accurate). Both open source, both ~6GB VRAM, both running locally via Ollama on an RTX 5080.
What I learned:
1. Vision model hallucination is qualitatively different from text hallucination. A text model gives you a wrong answer to a real question. A vision model gives you a confident answer to an image it didn't process. The second is harder to detect.
2. Model selection matters more than prompt engineering for vision. Same prompt, same image — one model fabricated, one read accurately. No prompt optimization fixes a model that invents data.
3. Confidence scoring is mandatory. I added a reconciliation check: do the extracted items sum to roughly the stated total? This catches fabrication that looks plausible at the individual line-item level.
4. The fix wasn't more money or a bigger model. Same size (8B), same hardware, same cost ($0). Just a different architecture that actually reads pixels instead of generating plausible text about them.
Full writeup with the pipeline architecture and code patterns: https://dev.to/rayne_robinson_e479bf0f26/my-ai-read-a-receipt-wrong-it-didnt-misread-it-it-made-one-up-4f5n