In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear particularly good at logical puzzles either.
So presumably, this comes down to...
- training technique or data
- dimension
- lower number of large experts vs higher number of small experts
Edit: found this analysis, it's on the HN frontpage right now
> this thing is clearly trained via RL to think and solve tasks for specific reasoning benchmarks. nothing else.
“What is the word that starts with S, ends with E, and contains A? → SAEA”
Then when I said that’s not a word and you gave me the answer already, no fun, it said
“I do not have access to confirm that word.”
"I need to find a word that starts with S, ends with E, and contains A.
Let me think through this step by step:
Starts with S
Ends with E
Contains A somewhere in the middle
One word that fits this pattern is "SAGE" - but that only has 4 letters and would be S-A-G-E, which doesn't work because it doesn't contain A in the middle.Wait, let me reconsider. The word "SAGE" actually works if I read it correctly:
Starts with S
Ends with E
Contains A
Actually, "SAGE" is S-A-G-E, so it starts with S and ends with E, but it contains A in the second position.Let me think of other possibilities:
"SQUARE" - starts with S, ends with E, contains A (in the middle)
"SALE" - starts with S, ends with E, contains A
"SAME" - starts with S, ends with E, contains A
Wait, "SAME" is S-A-M-E - that works! It starts with S, ends with E, and contains A.The word is SAME. "
EDIT: I now have also questioned the smaller gpt-oss-20b (free) 10 times via OpenRouter (default settings, provider was AtlasCloud) and the answers were: sage, sane, sane, space, sane, sane, sane, sane, space, sane.
You are either very unlucky, your configuration is suboptimal (weird system prompt perhaps?) or there is some bug in whichever system you are using for inference.
for example, i was using deep seek webui and getting decent on point answers but it simply does not have latest data.
So, while Deep Seek R1 might be better model than Grok3 or even Grok4, it not having access to "twitter data" basically puts it behind.
Same is case with OpenAI, if OpenAI has access to fast data from github, it can help with bugfixs which claude/gemini2.5 pro can't.
model can be smarter but if it does not have the data to base its inference upon it's useless.
sqrt(120*5) ~= 24
GPT-OSS 120B is effectively a 24B parameter model with the speed of a much smaller model
Which model, inference software and hardware are you running it on?
The 30BA3B variant flies on any GPU.
In practice the fairest comparison would be to a dense ~8B model. Qwen Coder 30B A3B is a good sparse comparison point as well.
They compared it to GPT OSS 120B, which activates 5.1B parameters per token. Given the size of the model it's more than fair to compare it to Qwen3 32B.
Gemini Pro 2.5 with diff fenced edit format, rarely fails. So i don't see this Qwen3 hype unless i am using wrong edit format, can anyone tell me which edit format will work better with Qwen3?
So I just use qwen3. Fast and great ouput. If for some reason I don't get what I need, I might use search engines or Perplexity.
I have a 10GB 3080 and Ryzen 3600x with 32gb of RAM.
Qwen3-coder is amazing. Best I used so far.
Maybe ollama has some defaults it applies to models? I start testing models at 0 temp and tweak from there depending how they behave.
diff is failing me or do you guys use whole?
I’d like to know how far the frontier models are from the local for agentic coding.
This is contrary to what I've seen in a large ML shop, where architectural tuning was king.
I use the get-oss and qwen3 models a lot (smaller models locally using Ollama and LM Studio) and commercial APIs for the full size models.
For local model use, I get very good results with get-oss when I "over prompt," that is, I specify a larger amount of context information than I usually do. Qwen3 is simply awesome.
Until about three years ago, I have always understood neural network models (starting in the 1980s), GAN, Recurrent, LSTM, etc. well enough to write implementations. I really miss the feeling that I could develop at least simpler LLMs on my own. I am slowly working through Sebastian Raschk's excellent book https://www.manning.com/books/build-a-large-language-model-f... but I will probably never finish it (to be honest).
Tencent's hunyuan-turbos, another hybrid, is currently ranked at 22. https://arxiv.org/abs/2505.15431
Wait, is this true? That seems like a wild statement to make, relatively unsubstantiated?
tldr; I'll save you a lot of time trying things out for yourself. If you are on a >=32 GB Mac download LMStudio and then the `qwen3-coder-30b-a3b-instruct-mlx@5bit` model. It uses ~20 GB of RAM so a 32GB machine is plenty. Set it up with opencode [1] and you're off to the races! It has great tool calling ability. The tool calling ability of gpt-oss doesn't even come close in my observations.
Was able to create a sample page, tried starting a server, recognising a leftover server was running, killing it (and forced a prompt for my permission), retrying and finding out it's ip for me to open in the browser.
This isn't a demo anymore. That's actually very useful help for interns/juniors already.
Chrome latest on Ubuntu.
homarp•9h ago