Doesn't seem to be far ahead of existing proprietary implementations. But it's still good that someone's willing to push that far and release the results. Getting multimodal input to work even this well is not at all easy.
Relevant comparison is on page 15: https://arxiv.org/abs/2509.17765
Some of the reasons could be:
- mitigation of US AI supremacy
- Commodify AI use to push forward innovation and sell platforms to run them, e.g. if iPhone wins local intelligence, it benefits China, because China is manufacturing those phones
- talent war inside China
- soften the sentiment against China in the US
- they're just awesome people
- and many more
https://openrouter.ai/qwen/qwen3-235b-a22b-thinking-2507
Now with this I will use it to identify and caption meal pictures and user pictures for other workflows. Very cool!
natrys•2h ago
- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking
- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct