I fire off the ide switch the model and think oh great this is better. I switch to something that worked before and man, this sucks now.
Context switching llm, Model Release Fatigue
Personally I only use Claude/Anthropic and ignore other providers because I understand it the more. It's smart enough, I rarely need the latest greatest.
One way to avoid this: stick with one LLM and bet on the company behind it (meaning, over time, they’ll always have the best offering). I’ve bet on OpenAI. Others can make different conclusions.
Hats off to the folks who have decided to deal with the nascent versions though.
I use AI mostly for problems on my fringes. Things like manipulating some Excel table somebody sent me with invoice data from one of our suppliers and some moderately complex question that they (pure business) don't know how to handle, where simple formulas would not be sufficient and I would have to start learning Power Query. I can tell the AI exactly what I want in human language and don't have to learn a system that I only use because people here use it to fill holes not yet served by "real" software (databases, automated EDI data exchange, and code that automates the business processes). It works great, and it saves me hours on fringe tasks that people outsource to me, but that I too don't really want to deal with too much.
For example, I also don't check various vendors and models against one another. I still stick to whatever the default is from the first vendor I signed up with, and so far it worked well enough. If I were to spend time checking vendors and models, the knowledge would be outdated far too quickly for my taste.
On the other hand, I don't use it for my core tasks yet. Too much movement in this space, I would have to invest many hours in how to integrate this new stuff when the "old" software approach is more than sufficient, still more reliable, and vastly more economical (once implemented).
Same for coding. I ask AI on the fringes where I don't know enough, but in the core that I'm sufficiently proficient with I wait for a more stable AI world.
I don't solve complex sciency problems, I move business data around. Many suppliers, many customers, different countries, various EDI formats, everybody has slightly different data and naming and procedures. For example, I have to deal with one vendor wanting some share of pre-payment early in the year, which I have to apply to thousands of invoices over the year and track when we have to pay a number of hundreds or thousands of invoices all with different payment conditions and timings. If I were to ask the AI I would have to be so super specific I may as well write the code.
But I love AI on the not-yet-automated edges. I'm starting to show others how they can ask some AI, and many are surprised how easy it is - when you have thee right task and know exactly hat you have and what you want. My last colleague-convert was someone already past retirement age (still working on the business side). I think this is a good time to gradually teach regular employees some small use cases to get them interested, rather than some big top-down approach that mostly creates more work and many people then rightly question what the point is.
About politically-touched questions like whether I should rather use an EU-made AI like the one this topic is about, or use one from the already much of the software-world dominating US vendor, I don't care at this point, because I'm not yet creating any significant dependencies. I am glad to see it happening though (as an EU country citizen).
Another nice thing about waiting a bit—one can see how much (if any) the EU models get from paying the “do things somewhat ethically” price. I suspect it won’t be much of a penalty.
The server is basically just my Windows gaming PC, and the client is my editor on a macOS laptop.
Most of this effort is so that I can prepare for the arrival of that mythical second half of 2026!
[1] https://github.com/ollama/ollama/blob/main/docs/faq.md#how-d...
[2] https://huggingface.co/collections/Qwen/qwen25-coder-66eaa22...
Not useful though, I just like the idea of having so much compressed knowledge on my machine in just 20gb. In fact I disabled all Siri features cause they're dogshit.
Maybe s.th. like a collective that buys the gpu's together and then uses them without leaking data can work.
In particular it’s important to get past the whole need-to-self-host thing. Like, I used to be holding out for when this stuff would plateau, but that keeps not happening, and the things we’re starting to be able to build in 2025 now that we have fairly capable models like Claude 4 are super exciting.
If you just want locally runnable commodity “boring technology that just works” stuff, sure, cool, keep waiting. If you’re interested in hacking on interesting new technology (glances at the title of the site) now is an excellent time to do so.
Fine tuning will work for niche business use cases better than promises of AGI.
I was listening to a Taiwanese news channel earlier today and although I wasn't paying much attention, I remember hearing about how Chinese AIs are biased towards Chinese political ideas and that some programme to create a more Taiwanese-aligned AI was being put in place.
I wouldn't be surprised if just for this reason, at least a few different open models kept being released, because even if they don't directly bring in money, several actors care more about spreading or defending their ideas and IAs are perfect for that.
One theory is that they believe the real endpoint value will be embodied AIs (i.e. robots), where they think they'll hold a long-term competitive advantage. The models themselves will become commoditized, under the pressure of the open-source models.
For wage workers, not learning the latest productivity tools will result in job loss. By the time it is expected of your role, if you have not learned already, you won't be given the leniency to catch up on company time. There is no impactful resistance to this through individual protest, only by organizing your peers in industry
I’m big on AI, but vibe coding is such a fuck around and find out situation.
Sites like simonwillison.net/2025/jul/ and channels like https://www.youtube.com/@aiexplained-official also cover new model releases pretty quickly for some "out of the box thinking/reasoning" evaluations.
For me and my usage I can really only tell if I start using the new model for tasks I actually use them for.
My personal benchmark andrew.ginns.uk/merbench has full code and data on GitHub if you want a staring point!
This used to be a good example of innovation that is hard to copy. But it doesn't apply anymore for two reasons:
1. Apple went from being an agile, pro-developers, creative company to an Oracle-style old-board milking-cow company; not much innovation is happening at Apple anymore.
2. To their surprise, much of what they call "innovative" is actually pretty easy to replicate on other platforms. It took 4 hours for Flutter folks to re-create Liquid Glass...
Steve Jobs did say they "patented the hell out of [the iPhone]" and went about saber-rattling, then came the patent wars which proved that Apple also rely on innovation by others, and that patent workarounds would still result in competitive products, and things calmed down afterwards.
Well, OpenAI copied the Deep Research feature from Google. They even used the same name (as does Mistral).
All of the major labs are innovating and copying one another.
Anthropic has all of the other labs trying to come up with an "agentic" protocol of their own. They also seem to be way ahead on interpretability research
Deepseek came up with multi-headed latent attention, and publishing an open-source model that's huge and SOTA.
Deepmind's way ahead on world models
...
Bear in mind that there are a lot of very strong _open_ STT models that Mistral's press-release didn't bother to compare to, making impression they are the best new open thing since Whisper. Here is an open benchmark: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard . The strongest model Mistral compared to is Scribe, ranked 10 here.
This benchmark is for English, but many of those models are multilingual (eg https://huggingface.co/nvidia/canary-1b-flash )
One element of comparison is OpenAI Whisper v3, which achieves 7.44 WER on the ASR leaderboard, and shows up as ~8.3 WER on FLEURS in the Voxtral announcement[0]. If FLEURS has +1 WER on average compared to ASR, it would imply that Voxtral does have a lead on ASR.
Also note that, Voxtral's capacity is not necessarily all devoted to speech, since it "Retains the text understanding capabilities of its language model backbone"
For a quick test I've uploaded a photo of my home office and asked the following prompt: "Retouch this photo to fix the gray panels at the bottom that are slightly ripped, make them look brand new"
Input image (rescaled): https://i.imgur.com/t0WCKAu.jpeg
Output image: https://i.imgur.com/xb99lmC.png
I think it did a fantastic job. The output image quality is ever so slightly worse than the original but that's something they'll improve with time I'm sure.
OpenAI just yesterday added the ability to do higher fidelity image edits with their model [1], though I'm not sure if the functionality is only in the API or if their chat UI will make use of this feature too. Same prompt and input image: [2]
I couldn't help but notice that you can still see the shadows of the rips in the fixed version. I wonder how hard it would be to get those fixed as well.
There is a lot of value to say engineers doing tradeoff studies using these tools as a huge head start.
Agreed about Google, accuracy is a little better on the paid version but the reports are still frustrating to read through. They're incredibly verbose, like an undergrad padding a report to get to a certain word count.
"Be terse" is a mandatory part of the prompt now.
Either it's to increase token counts so they can charge more, or to show better usage growth metrics internally or for shareholders, or just some odd effects of fine tuning / system prompt ... who knows.
jddj•3h ago