Doesn't seem to be far ahead of existing proprietary implementations. But it's still good that someone's willing to push that far and release the results. Getting multimodal input to work even this well is not at all easy.
Relevant comparison is on page 15: https://arxiv.org/abs/2509.17765
https://openrouter.ai/qwen/qwen3-235b-a22b-thinking-2507
Now with this I will use it to identify and caption meal pictures and user pictures for other workflows. Very cool!
natrys•2h ago
- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking
- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct