frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Qwen3-4B-Thinking-2507

https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507
186•IdealeZahlen•19h ago

Comments

gok•18h ago
So this 4B dense model gets very similar performance to the 30B MoE variant with 7.5x smaller footprint.
smallerize•17h ago
It gets similar performance to the old version of the 30B MoE model, but not the updated version. https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
Imustaskforhelp•16h ago
I still think that its still very commendable though.

I am running this beast on my dumb pc with no gpu, now we are talking!

esafak•18h ago
This one should work on personal computers! I'm thankful for Chinese companies raising the floor.
frontsideair•18h ago
According to the benchmarks, this one is improved in every one of them compared to the previous version, some better than 30B-A3B. Definitely worth a try, it’ll easily fit into memory and token generation speed will be pleasantly fast.
GaggiX•17h ago
There is a new Qwen3-30B-A3B, you are compare it to the old one. https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
tolerance•18h ago
Is there like a leaderboard or power rankings sort of thing that tracks these small open models and assigns ratings or grades to them based on particular use cases?
esafak•18h ago
https://artificialanalysis.ai/leaderboards/models?open_weigh...
cowpig•18h ago
Compare these rankings to actual usage: https://openrouter.ai/rankings

Claude is not cheap, why is it far and away the most popular if it's not top 10 in performance?

Qwen3 235b ranks highest on these benchmarks among open models, but I have never met someone who prefers its output over Deepseek R1. It's extremely wordy and often gets caught in thought loops.

My interpretation is that the models at the top of ArtificialAnalysis are focusing the most on public benchmarks in their training. Note I am not saying XAI is necessarily nefariously doing this, could just be that they decided it's better bang for the buck to rely on public benchmarks than to try to focus on building their own evaluation systems.

But Grok is not very good compared to the anthropic, openai, or google models despite ranking so highly in benchmarks.

GaggiX•18h ago
Claude Opus is in the top 10, also people via OpenRouter mostly use these models for coding and Claude models are particularly good at this, the benchmark doesn't account only for coding capacities tho
byefruit•17h ago
The openrouter rankings can be biased.

For example, Google's inexplicable design decisions around libraries and APIs means it's often worth the 5% premium to just use OpenRouter to access their models. In other cases it's about which models particular agents default to.

Sonnet 4 is extremely good for tool-usage agentic setups though - something I have found other models struggle to do over a long-context.

ImageXav•17h ago
Thanks for sharing that. Interesting that the leaderboard is dominated by Anthropic, Google and DeepSeek. Openai doesn't even register.
reilly3000•16h ago
OpenAI has a lot of share that simply doesn’t exist via OpenRouter. Typical enterprise chat bot apps use it directly without paying a tax and may use litellm with another vendor for fallback.
esafak•17h ago
I shared a link to small, open source models; Claude is neither.
whimsicalism•17h ago
grok is not bad, i think 4 is better than claude for most things other than tool calling.

of course, this is a politically charged subject now so fair assessments might be hard to come by - as evidenced by the downvotes i've already gotten on this comment

threeducks•16h ago
OpenRouter rankings conflate many factors like output quality, popularity, price and legal concerns. They can not tell us whether a model is popular because it is genuinely good, or because many people have heard about it, or because it is free, or because the lawyers trust the provider.
wkat4242•7h ago
> But Grok is not very good compared to the anthropic, openai, or google models despite ranking so highly in benchmarks.

That's political I think. I know several alt right types that swear by grok because "Elon doesn't give it any of that woke crap". They don't care that there's better, for them it's the only viable option.

decide1000•15h ago
Qwen3-30A-A3B-2507 is much faster on my machine compared to gpt-oss-20B. This leaderboard does not reflect that.
tolerance•15h ago
This is perfect. Thanks.
jampa•17h ago
I am reading this right, is this model way better than Gemma 3n[1]? (For only the benchmarks that are common among the models)

=====

LiveCodeBench

E4B IT: 13.2

Qwen: 55.2

===== AIME25

E4B IT: 11.6

Qwen: 81.3

[1]: https://huggingface.co/google/gemma-3n-E4B

meatmanek•16h ago
Reasoning models do a lot better at AIME than non-reasoning models, with o3 mini getting 85% and 4o-mini getting 11%. It makes some sense that this would apply to small models as well.
film42•17h ago
Is there a crowd-sourced sentiment score for models? I know all these scores are juiced like crazy. I stopped taking them at face value months ago. What I want to know is if other folks out there actually use them or if they are unreliable.
nurettin•17h ago
This has been around for a while https://lmarena.ai/leaderboard/text/coding
klohto•17h ago
openrouter usage stats
esafak•17h ago
https://openrouter.ai/rankings

The new qwen3 model is not out yet.

setsewerd•16h ago
Since the ranking is based on token usage, wouldn't this ranking be skewed by the fact that small models' APIs are often used for consumer products, especially free ones? Meanwhile reasoning models skew it in the opposite direction, but to what extent I don't know.

It's an interesting proxy, but idk how reliable it'd be.

matznerd•15h ago
Also, these small models are meant to be run local so not going to appear on openrouter...
hnfong•17h ago
Besides the LM Arena Leaderboard mentioned by a sibling comment, if go to the r/LocalLlama/ subreddit, you can very unscientifically get a rough sentiment of the performance of the models by reading the comments (and maybe even check the upvotes). I think the crowd's knee-jerk reaction is unreliable though, but that's what you asked for.
NitpickLawyer•15h ago
Not anymore tho. It used to be the place to vibe-check a model ~1 year ago, but lately it's filled with toxic my team vs. your team, memes about CEOs (wtf) and general poor takes on a lot of things.

For a while it was china vs. world, but lately it's even more divided, with heavy camping on specific models. You can still get some signal, but you have to either ban a lot of accounts, or read new during different tzs so you can get some of that "i'm just here for the tech stack" vibe from posters.

littlestymaar•15h ago
Yeah, some people just can't stop acting as if tech companies were sport teams, and it gets annoying fast.
parineum•10h ago
I don't really go there much anymore but, when I was, there seemed to be an innordinate amount of Chinese nationalism from young accounts speaking odd English.
svnt•16h ago
It is interesting to think about how they are achieving these scores. The evals are rated by GPT-4.1. Beyond just overfitting to benchmarks, is it possible the models are internalizing how to manipulate the ratings model/agent? Is anyone manually auditing these performance tables?
nisten•16h ago
If you want to have an opinion on it,

just install lmstudio and run the q8_0 version of it i.e. here https://huggingface.co/bartowski/Qwen_Qwen3-4B-Instruct-2507....

you can even run it on a 4gb raspberry pi Qwen_Qwen3-4B-Instruct-2507-Q4_K_L.gguf https://lmstudio.ai/

Keep in mind if you run it at the full 262144 tokens of context youll need ~65gb of ram.

Anyway if you're on mac you can search for "qwen3 4b 2507 mlx 4bit" and run the mlx version which is often faster on m chips. Crazy impressive what you get from a 2gb file in my opinion.

It's pretty good for summaries etc, can even make simple index.html sites if you're teaching students but it can't really vibecode in my opinion. However for local automation tasks like summarizing your emails, or home automation or whatever it is excellent.

It's crazy that we're at this point now.

Aeroi•16h ago
how about on apple silicon for the iphone
jasonjmcghee•16h ago
https://joejoe1313.github.io/2025-05-06-chat-qwen3-ios.html
esafak•16h ago
Thank you. To spare Mac readers time:

mlx 4bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...

mlx 5bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...

mlx 6bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...

mlx 8bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...

edit: corrected the 4b link

ckcheng•15h ago
Did you mean mlx 4bit:

https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...

belter•14h ago
This comment saved 3 tons of CO2
magnat•15h ago
> if you run it at the full 262144 tokens of context youll need ~65gb of ram

What is the relationship between context size and RAM required? Isn't the size of RAM related only to number of parameters and quantization?

DSingularity•15h ago
No. Your KV cache is kept in memory also.
Gracana•14h ago
The context cache (or KV cache) is where intermediate results are stored. One for each output token. Its size depends on the model architecture and dimensions.

KV cache size = 2 * batch_size * context_len * num_key_value_heads * head_dim * num_layers * element_size. The "2" is for the two parts, key and value. Element size is the precision in bytes. This model uses grouped query attention, which reduces num_key_value_heads compared to a multi head attention (MHA) model.

With batch size 1 (for low-latency single-user inference), 32k context (recommended in the model card), fp16 precision:

2 * 1 * 32768 * 8 * 128 * 36 * 2 = 4.5GiB.

I think, anyway. It's hard to keep up with this stuff. :)

wkat4242•7h ago
Yes but you can quantise the KV cache too just like you can the weights.
hnuser123456•14h ago
A 24GB GPU can run a ~30b parameter model at 4bit quantization at about 8k-12k context length before every GB of VRAM is occupied.
0x457•13h ago
I mean...where do you think context is stored?
aitchnyu•4h ago
Whats the space complexity for context size? And who is trying to drop it into linear complexity?
Demiurge•10h ago
I've been trying this today, and I'm getting a lot of hallucinations for suggestions. However, the analysis of problems really quite good.

"I closed MPEG on 2 Jun '20 when I left because obscure forces had hijacked it."

https://leonardo.chiariglione.org/
70•eggspurt•1h ago•21 comments

New AI Coding Teammate: Gemini CLI GitHub Actions

https://blog.google/technology/developers/introducing-gemini-cli-github-actions/
28•michael-sumner•1h ago•11 comments

We replaced passwords with something worse

https://blog.danielh.cc/blog/passwords
384•max__dev•9h ago•304 comments

About AI

https://priver.dev/blog/ai/about-ai/
23•emil_priver•2h ago•8 comments

Cracking the Vault: How we found zero-day flaws in HashiCorp Vault

https://cyata.ai/blog/cracking-the-vault-how-we-found-zero-day-flaws-in-authentication-identity-and-authorization-in-hashicorp-vault/
113•nihsy•4h ago•45 comments

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

https://www.baseten.co/blog/sota-performance-for-gpt-oss-120b-on-nvidia-gpus/
177•philipkiely•8h ago•95 comments

Claude Code IDE integration for Emacs

https://github.com/manzaltu/claude-code-ide.el
674•kgwgk•22h ago•229 comments

Gaybreaking

https://twitter.com/AlexReibman/status/1953229500973740058
25•miohtama•26m ago•6 comments

Debounce

https://developer.mozilla.org/en-US/docs/Glossary/Debounce
64•aanthonymax•2d ago•34 comments

Project Hyperion: Interstellar ship design competition

https://www.projecthyperion.org
283•codeulike•14h ago•211 comments

Rules by which a great empire may be reduced to a small one (1773)

https://founders.archives.gov/documents/Franklin/01-20-02-0213
182•freediver•11h ago•116 comments

A candidate giant planet imaged in the habitable zone of α Cen A

https://arxiv.org/abs/2508.03814
86•pinewurst•9h ago•28 comments

Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model

https://github.com/KittenML/KittenTTS
867•divamgupta•1d ago•333 comments

Children's movie leads art historian to long-lost Hungarian masterpiece (2014)

https://www.theguardian.com/world/2014/nov/27/stuart-little-art-historian-long-lost-hungarian-masterpiece
10•how-about-this•3d ago•0 comments

Litestar is worth a look

https://www.b-list.org/weblog/2025/aug/06/litestar/
289•todsacerdoti•15h ago•78 comments

Jules, our asynchronous coding agent

https://blog.google/technology/google-labs/jules-now-available/
301•meetpateltech•19h ago•199 comments

Writing a Rust GPU kernel driver: a brief introduction on how GPU drivers work

https://www.collabora.com/news-and-blog/blog/2025/08/06/writing-a-rust-gpu-kernel-driver-a-brief-introduction-on-how-gpu-drivers-work/
272•losgehts•19h ago•33 comments

Did Craigslist decimate newspapers? Legend meets reality

https://www.poynter.org/business-work/2025/did-craigslist-kill-newspapers-poynter-50/
18•zdw•3d ago•6 comments

Herbie detects inaccurate expressions and finds more accurate replacements

https://herbie.uwplse.org/
66•bwidlar•3d ago•6 comments

We'd be better off with 9-bit bytes

https://pavpanchekha.com/blog/9bit.html
156•luu•15h ago•274 comments

A fast, growable array with stable pointers in C

https://danielchasehooper.com/posts/segment_array/
197•ibobev•17h ago•72 comments

The Bluesky Dictionary

https://www.avibagla.com/blueskydictionary/
172•gaws•14h ago•51 comments

40 Years of the Amiga

https://www.goto10retro.com/p/40-years-of-the-amiga-from-commodore
55•rbanffy•3h ago•20 comments

What is the average length of a queue of cars? (2023)

https://e-dorigatti.github.io/math/2023/11/01/queue-length.html
24•alexmolas•3d ago•8 comments

Scientists have recreated the Universe's first molecule

https://www.sciencedaily.com/releases/2025/08/250803011840.htm
15•LAsteNERD•2d ago•8 comments

Automerge 3.0

https://automerge.org/blog/automerge-3/
322•surprisetalk•3d ago•29 comments

Mac history echoes in current Mac operating systems

http://tenfourfox.blogspot.com/2025/08/mac-history-echoes-in-mac-operating.html
120•classichasclass•8h ago•39 comments

Multics

https://www.multicians.org/multics.html
123•unleaded•18h ago•28 comments

Comptime.ts: compile-time expressions for TypeScript

https://comptime.js.org/
137•excalo•3d ago•29 comments

Breaking the sorting barrier for directed single-source shortest paths

https://www.quantamagazine.org/new-method-is-the-fastest-way-to-find-the-best-routes-20250806/
153•baruchel•20h ago•46 comments