However, checking the results my personal overall winner if I had to pick only ONE probably would be
deepseek/deepseek-chat-v3-0324
which is a good compromise between fast, cheap and good :-) Only for specific tasks (write a poem...) I would prefer a thinking model.I tried signup for openai wayy too much friction, they start asking for payment without even you using any free credits, guess what that's one sure way to lose business.
same for claude, i couldn't even get claude through vertex as its available only in limited regions, and i am in asia pasific right now.
While this is true, you can download the OpenAI open source model and run it in Ollama.
The thinking is a little slow, but the results have been exceptional vs other local models.
My current favorite to run on my machine is OpenAI's gpt-oss-20b because it only uses 11GB of RAM and it's designed to run at that quantization size.
I also really like playing with the Qwen 3 family at various sizes and I'm fond of Mistral Small 3.2 as a vision LLM that works well.
If I have an internet connection I'll use GPT-5 or Claude 4 or Gemini 2.5 instead - they're better and they don't need me to dedicate a quarter of my RAM or run down my battery.
I find this the most surprising. I have yet to cross 50% threshold of bullshit to possibly truth. In any kind of topic I use LLMs for.
Once you've done that your success rate goes way up.
Although that changed this year with o3 (and now GPT-5) getting really good at using Bing for search: https://simonwillison.net/2025/Apr/21/ai-assisted-search/
I’m making a tool to analyse financial transactions for accountants and identify things like misallocated expenses. Initially I was getting an LLM to try and analyse hundreds of transactions in one go. It was correct roughly 40-50% of the time, inconsistent and hallucinated frequently.
I changed the method to simple yes no question and to analyse each transaction individually. Now it is correct 85% of the time and very consistent.
Same model, same question essentially but a different way of asking it.
I’m just not so sure it’s black and white. At least in my experience it hasn’t been.
But have it create a script or program in any language you want to do the same, I'm 99% sure it'll get it right the first time.
People use LLMs like graphing calculators, they're not. But you can have one MAKE a calculator and it'll get it right.
Can you put you intuition into words so we can learn from you ?
Most recent technical I can remember (and now would be a good time to have the actual prompt) was that I asked whether MySQL has a way to run UPDATE without waiting for lock. Basically ignore rows that are locked. It (Sonnet 4 IIRC) answered of course and gave me an invalid query in the form of `UDPATE ... SKIP LOCKED`;
I can't imagine what damage this does if people are using it for questions they don't/can't verify. Programming is relatively safe in this regard.
But as I noted in my other reply, there will be a bias on my side, as I probably disregard questions that I know how to easily find answers to. That's not something I'd applaud AI for.
This is surely the greatest weakness of current LLMs for any task needing a spark of creativity.
Surely it is a question of prompting some context(in UI mode) or with additional kicker of temperature (if using API)?
At the very least some set up prompt such as "Give me 5 scenarios for text adventure game" would break the sameness?
There have always been theories that OpenAI and other LLM providers cache some responses - this could be one hypothesis.
There must be something strange going on (most likely training on each others' wrong outputs, but I dunno)
Oh, and an interesting finding: Kagi's selector indicates that they're offering Deepseek Chat v3.1's non-reasoning version, but when I ran it without web search it appears to have messed up and output some of its chain of thought, so it clearly is thinking.
Add in multimodality, 1M context and it is such a Swiss army knife.
It is cheap and performant enough to run 100k queries. (Took a bit over a day and cost around 30 Euros for a major document classification task). Yes in theory this could have been done with fine-tuned BERT or maybe even with some older methods but it saved way too much time.
There is another factor that may explain why Flash is #1 in most categories on OpenRouter - Flash has gotten reasonably decent at less common human languages.
Most cheap (including Flash Lite) and local models mostly have English focused training.
> Grok I forgot about until it was too late.
I was surprised by how much I prefer Grok to others. Even its persona is how I prefer it, detailed without volunteering unwanted information or sycophanty. In general I'd use Grok-3 more than 4 which is good enough for common uses.
I suspect that Claude would be best, only if I gave it a long complex task with enough instructions up front so it could grind away on it while I was doing something else and not waiting on it.
The job was set on Friday and ready on Monday. On average it was about 5k tokens (documents ranging from 1k to 200k in size) and only about 10 tokens out.
Average response was about 1.5 seconds ~ 40 hours for full set.
I really did some heavy prompt testing to limit output.
Even then every few thousand queries you'd get some double token responses. That is Gemini would respond in duplicate - ie Daisy Daisy.
I have always juggled a multitude of API keys and this article has almost convinced me to try Open Router. It is easy enough to purchase compute credit from Chinese companies, but I could save myself some time.
The thing I liked most in the analysis was the emphasis on speed and cost.
The speed and cost issue is important. I recently read that many AI startups in the USA are favoring faster and cheaper models from China (China is quietly overtaking America with open AI models - iX Broker: https://ixbroker.com/blog/china-is-quietly-overtaking-americ...). The Economist has a similar article but it is paywalled.
This is not entirely true. It was entirely true for o3, but for GPT-5 it is only true for streaming and for reasoning summaries. I use GPT-5 with reasoning through the API without verifying my identity, I just don’t get the summary of the reasoning step. I don’t miss it, either. I never read them anymore now that the surprise has worn off.
I cannot use it directly in OpenAI's API, even today:
> $ llm -m gpt-5-mini Hello
> Error: Error code: 400 - {'error': {'message': 'Your organization must be verified to stream this model. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate.', 'type': 'invalid_request_error', 'param': 'stream', 'code': 'unsupported_value'}}
Until two days ago, OpenAI on OpenRouter was Bring-Your-Own-Key (BYOK), so it didn't work there either.
Since they dropped BYOK, I can indeed use it through OpenRouter:
> $ llm -m openrouter/openai/gpt-5-mini Hello
> Hi — hello! How can I help you today?
This is for both `gpt-5` and `gpt-5-mini`. The `-nano` has always worked. I'm going to try some of my evals on gpt-5-mini, but it doesn't feel like I can depend on it.
We use glowroot (and open source JAVA APM). I was trying to compile it on my mac and some of the protobuf Maven plugins threw an issue. I gave copilot the entire pom.xml and the specific error and the versions being used. It send me on a complete wild goose chase and hallucinated like crazy even suggesting version upgrades to versions that do not exist or recommending parameters that have no use in the plugin.
Long story short, I just went to the github issues page of the maven plugin and searched and someone had posted a solution. Again, the solution wasn't new. It was suggested around apple started using ARM for their laptops. It was there in github and yet copilot hallucinated.
So, I don't feel too confident of coding assistants. Yes, they do a decent enough job to get your boilerplate done. But they're hopeless to resolve specific issues.
giancarlostoro•5mo ago
There's other sites similar to perplexity that host multiple models as well, I have not tried the plethora of others, I feel like Perplexity does the most to make sure whatever model you pick it works right for you and all its output is usefully catalogued.
mark_l_watson•5mo ago
That said I have been having too much fun running Melisearch to build a local search index for many web sites that I use for reference and combine that with a small Python app that also uses local models running on Ollama. I will probably wrap this into an example to add to one of my books: not that practical but fun.
giancarlostoro•5mo ago
I've had this idea revolving how I would make a search engine that's more useful to developers, but not enough time to work it.