Since building a custom agent setup to replace copilot, adopting/adjusting Claude Code prompts, and giving it basic tools, gemini-3-flash is my go-to model unless I know it's a big and involved task. The model is really good at 1/10 the cost of pro, super fast by comparison, and some basic a/b testing shows little to no difference in output on the majority of tasks I used
Cut all my subs, spend less money, don't get rate limited
gemini-3-flash stands well above gemini-2.5-pro
I've been using the smaller models ever since. Nano/mini, flash, etc.
I have found out recently that Grok-4.1-fast has similar pricing (in cents) but 10x larger context window (2M tokens instead of ~128-200k of gpt-4-1-nano). And ~4% hallucination, lowest in blind tests in LLM arena.
I'm unwilling to look past Musk's politics, immorality, and manipulation on a global scale
Anecdotal experience about which model is better is pointless. There are too many variables, the gap in the benchmarks is minimal, and the tool wielder makes more difference.
Another big problem is it’s hard to set objectives is many cases, and for example maybe your customer service chat still passes but comes across worse for a smaller model.
Id be careful is all.
I'd push everyone to self-host models (even if it's on a shared compute arrangement), as no enterprise I've worked with is prepared for the churn of keeping up with the hosted model release/deprecation cadence.
(Potentially interesting aside: I’d say I trust new GLM models similarly to the big 3, but they’re too big for most people to self host)
For a medical use case, we tested multiple Anthropic and OpenAI models as well as MedGemma. Pleasantly surprised when the LLM as Judge scored gpt5-mini as the clear winner. I don't think I would have considered using it for the specific use cases - assuming higher reasoning was necessary.
Still waiting on human evaluation to confirm the LLM Judge was correct.
On the other hand, this would be interesting for measuring agents in coding tasks, but there's quite a lot of context to provide here, both input and output would be massive.
Any resources you can recommend to properly tackle this going forward?
- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N
Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps
Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition
Failures are fed back to the LLM so it can regenerate taking that feedback into account. People are much happier with it than I could have imagined, though it's definitely not cheap (but the cost difference is very OK for the tradeoff).
By having independent tests and then seeing if it passes them (yes or no) and then evaluating and having some (more complicated tasks) be valued more than not or how exactly.
IME deep thinking hgas moved from upfront architecture to post-prototype analysis.
Pre-LLM: Think hard → design carefully → write deterministic code → minor debugging
With LLMs: Prototype fast → evaluate failures → think hard about prompts/task decomposition → iterate
When your system logic is probabilistic, you can't fully architect in advance—you need empirical feedback. So I spend most time analyzing failure cases: "this prompt generated X which failed because Y, how do I clarify requirements?" Often I use an LLM to help debug the LLM.
The shift: from "design away problems" to "evaluate into solutions."
“You’re absolutely right! Nice catch how I absolutely fooled you”
Stop prompt engineering, put down the crayons. Statistical model outputs need to be evaluated.
It’s shocking to me how often it happens. Aside from just the necessity to be able to prove something works, there are so many other benefits.
Cost and model commoditization are part of it like you point out. There’s also the potential for degraded performance because of the shelf benchmarks aren’t generalizing how you expect. Add to that an inability to migrate to newer models as they come out, potentially leaving performance on the table. There’s like 95 serverless models in bedrock now, and as soon as you can evaluate them on your task they immediately become a commodity.
But fundamentally you can’t even justify any time spent on prompt engineering if you don’t have a framework to evaluate changes.
Evaluation has been a critical practice in machine learning for years. IMO is no less imperative when building with llms.
petcat•2h ago
It sounds like he's building some kind of ai support chat bot.
I despise these things.
r_lee•1h ago
montroser•1h ago
lorey•1h ago