I recently built an ingredient analysis feature for a health app (Meadow Mentor). During the design phase, I tested two architectural approaches to handle the complexity of identifying unsafe ingredients.
The Hypothesis:
Could a lower-cost "Lite" model with a highly structured system prompt match the accuracy of a "Thinking" model (Reasoning), but with better unit economics?
*The Experiment:
I built and tested two configurations against the same validation set:
- Logic: Relied on the model's internal reasoning loop to process the image and determine safety.
Configuration B: The "Structure" Approach
- Stack: Gemini 2.5 Flash Lite.
- Logic: Disabled reasoning. Used a 4-step System Prompt to force a linear path (Extract -> Normalize -> Verify -> Format).
The Results:
Configuration B (Structured) outperformed Configuration A significantly in efficiency, while maintaining 100% accuracy on the test set.
- Tokens: Reduced by 61% (3,595 -> 1,396). The "Thinking" model generated massive internal token overhead.
- Latency: Reduced by 43% (21s -> 12s).
Conclusion:
For defined business logic, "Thinking" models introduce unnecessary cost and latency. "Dumb" models with smart prompts are still the superior engineering choice for production reliability.
reidkimball•42m ago
The Hypothesis: Could a lower-cost "Lite" model with a highly structured system prompt match the accuracy of a "Thinking" model (Reasoning), but with better unit economics?
*The Experiment: I built and tested two configurations against the same validation set:
Configuration A: The "Brain" Approach
- Stack: Gemini 2.5 Flash (Thinking Mode enabled).
- Logic: Relied on the model's internal reasoning loop to process the image and determine safety.
Configuration B: The "Structure" Approach
- Stack: Gemini 2.5 Flash Lite.
- Logic: Disabled reasoning. Used a 4-step System Prompt to force a linear path (Extract -> Normalize -> Verify -> Format).
The Results: Configuration B (Structured) outperformed Configuration A significantly in efficiency, while maintaining 100% accuracy on the test set.
- Tokens: Reduced by 61% (3,595 -> 1,396). The "Thinking" model generated massive internal token overhead.
- Latency: Reduced by 43% (21s -> 12s).
Conclusion: For defined business logic, "Thinking" models introduce unnecessary cost and latency. "Dumb" models with smart prompts are still the superior engineering choice for production reliability.
I wrote up the full case study on the design process here: https://reidkimball.com/case-studies/cutting-ai-feature-cost...