Qwen3-Omni: Native Omni AI model for text, image and video

143•meetpateltech•3h ago

Comments

chisleu•2h ago

Here is the demo video on it. The video w/ sound input -> sound output while doing translation from the video to another language was the most impressive display I've seen yet.

https://www.youtube.com/watch?v=_zdOrPju4_g

edude03•1h ago

The qwen thinker/speaker architecture is really fascinating and is more in line with how I imagine human multi modality works - IE, a picture of an apple, the text a p p l e and the sound all map to the same concept without going to text first.

adastra22•1h ago

Isn’t that how all LLMs work?

simonw•1h ago

The existing vision LLMs all work like this, which is most of the major models these days.

Multi-modal audio models are a lot less common. GPT-4o was meant to be able to do this natively from the start but they ended up shipping separate custom models based on it for their audio features. As far as I can tell GPT-5 doesn't have audio input/output at all - the OpenAI features for that still use GPT-4o-audio.

I don't know if Gemini 2.5 (which is multi-modal for vision and audio) shares the same embedding space for all three, but I expect it probably does.

adastra22•1h ago

What I mean is that all processing in an LLM occurs in state space. The next-token prediction is the very last step.

uniqueuid•1h ago

There are many more weird and complex architectures in models for video understanding.

For example, beyond video->text->llm and video->embedding in llm, you can also have an llm controlling/guiding a separate video extractor.

See this paper for a pretty thorough overview.

Tang, Y., Bi, J., Xu, S., Song, L., Liang, S., Wang, T., Zhang, D., An, J., Lin, J., Zhu, R., Vosoughi, A., Huang, C., Zhang, Z., Liu, P., Feng, M., Zheng, F., Zhang, J., Luo, P., Luo, J., & Xu, C. (2025). Video Understanding with Large Language Models: A Survey (No. arXiv:2312.17432). arXiv. https://doi.org/10.48550/arXiv.2312.17432

adastra22•1h ago

Sure but all of these find some way of mapping inputs (any medium) to state space concepts. That's the core of the transformer architecture.

ludwigschubert•46m ago

The user you originally replied to specifically mentioned > without going to text first

adastra22•45m ago

Yeah, and that's my understanding. Nothing goes video -> text, or audio -> text, or even text -> text without first going through state space. That's where the core of the transformer architecture is.

neilmovva•1h ago

The multilingual example in the launch graphic has Qwen3 producing the text:

> "Bonjour, pourriez-vous me dire comment se rendreà la place Tian'anmen?"

translation: "Hello, could you tell me how to get to Tiananmen Square?"

a bold choice!

OJFord•1h ago

Not really, it's a significant place which is why the protest (and hence massacre) was there, so especially for Chinese people (I expect) merely referencing it doesn't so immediately refer to the massacre, they have plenty of other connotations for it.

e.g. if something similar happened in Trafalgar Square, I expect it would still be primarily a major square in London to me, not oh my god they must be referring to that awful event. (In fact I think it was targeted in the 7/7 bombings for example.)

Or a better example to go with your translation - you can refer to the Bastille without 'boldly' invoking the histoire of its storming in the French Revolution.

No doubt the US media has referred to the Capitol without boldness many times since 6 Jan '21.

em500•1h ago

Not to mention, Tiananmen Square is one of the major tourist destinations in Beijing (similar to National Mall in Washington DC), for both domestic and foreign visitors.

ripped_britches•42m ago

Westerners only know it from the massacre but it’s actually just like Times Square for them

whimsicalism•10m ago

only really a reference with the date or at least 89

simonw•1h ago

The model weights are 70GB (Hugging Face recently added a file size indicator - see https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/tree... ) so this one is reasonably accessible to run locally.

I wonder if we'll see a macOS port soon - currently it very much needs an NVIDIA GPU as far as I can tell.

growthwtf•1h ago

A fun project for somebody who has more time than myself would be to see if they can get it working with the new Mojo stuff from yesterday for Apple. I don't know if the functionality would be fully baked out enough yet to actually do the port successfully, but it would be an interesting try.

a_e_k•1h ago

That's at BF16, so it should fit fairly well on 24GB GPUs after quantization to Q4, I'd think. (Much like the other 30B-A3B models in the family.)

I'm pretty happy about that - I was worried it'd be another 200B+.

dcreater•19m ago

is there an inference engine for this on macos?

vunderba•1h ago

Neat. I threw a couple simple audio clips at it and it was able to at least recognize the instrumentation (piano, drums, etc). I haven't seen a lot of multimodal LLM focus around recognizing audio outside of speech, so I'd love to see a deep dive of what the SOTA is.

wills_forward•1h ago

https://x.com/whowillrickwill/status/1920723985311903767

nmitchko•54m ago

Next steps for AI in general:

  - additional modalities
  - Faster FPS (inferences per second)
  - Reaction time tuning (latency vs quality tradeoff) for visual and audio inputs/outputs
  - built-in planning modules in the architecture (think premotor frontal lobe)
  - time awareness during inference (towards an always inferring / always learning architecture)

simonw•53m ago

You can try it out on https://chat.qwen.ai/ - sign in with Google or GitHub (signed out users can't use the voice mode) and then click on the voice icon.

It has an entertaining selection of different voices, including:

*Dylan* - A teenager who grew up in Beijing's hutongs

*Peter* - Tianjin crosstalk, professionally supporting others

*Cherry* - A sunny, positive, friendly, and natural young lady

*Ethan* - A sunny, warm, energetic, and vigorous boy

*Eric* - A Sichuan Chengdu man who stands out from the crowd

*Jada* - The fiery older sister from Shanghai

indigodaddy•38m ago

I only see Omni Flash, is that the one?

flockonus•19m ago

The voices are really fun, thanks for the laughs :)

hadlock•49m ago

Speech input + speech output is a big deal. In theory you can talk to it using voice, and it can respond in your language, or translate for someone else, without intermediary technologies. Right now you need wakeword, speech to text, and then text to speech, in addition to your core LLM. A couple can input speech, or output speech, but not both. It looks like they have at least 3 variants in the ~32b range.

Depending on the architecture this is something you could feasibly have in your house in a couple of years or in an expensive "ai toaster"

CamperBob2•35m ago

Seems like a big win for language learning, if nothing else. Also seems possible to run locally, especially once the unsloth guys get their hands on it.

data-ottawa•5m ago

The opportunities of plugging this into your home automation through tool calls is huge.

Ever since ChatGPT added this feature I've been waiting for anyone else to catch up.

They're are tons of hands free situations like cooking where this would be amazing ("read the next step please, my hands are covered in raw pork", "how much flour for the roux", "crap, I don't have any lemons, what can I substitute")

state_less•27m ago

Interesting, the pacing seemed very slow when conversing in english, but when I spoke to it in spanish, it sounded much faster. It's really impressive that these models are going to be able to do real time translation and much more.

The Chinese are going to end up owning the AI market if the American labs don't start competing on open weights. Americans may end up in a situation where they have some $1000-2000 device at home with an open Chinese model running on it, if they care about privacy or owning their data. What a turn of events!

tedivm•15m ago

This is exactly what I do. I have two 3090s at home, with Qwen3 on it. This is tied into my Home Assistant install, and I use esp32 devices as voice satellites. It works shockingly well.

tomatoman•8m ago

Seems interesting setup, do you have it documented anywhere, thinking of building one!

servercobra•3m ago

Ooo interesting, I'd love to hear more about the esp32's as voice satellites!

bilbo0s•15m ago

Americans may end up in a situation where they have some $1000-2000 device at home with an open Chinese model running on it

Wouldn't worry about that, I'm pretty sure the government is going to ban running Chinese tech in this space sooner or later. And we won't even be able to download it.

Not saying any of the bans will make any kind of sense, but I'm pretty sure they're gonna say this is a "strategic" space. And everything else will follow from there.

Download Chinese models while you can.

Sanzig•11m ago

When DeepSeek first hit the news, an American senator proposed adding it to ITAR so they could send people to prison for using it. Didn't pass, thankfully.

whimsicalism•10m ago

government hardly has the capacity to ban foreign weights

Qwen3-Omni: Native Omni AI model for text, image and video

Fine-grained HTTP filtering for Claude Code

Choose Your Own Adventure

A board member's perspective of the RubyGems controversy

OpenAI and Nvidia announce partnership to deploy 10GW of Nvidia systems

Cap'n Web: a new RPC system for browsers and web servers

Categorical Foundations for Cute Layouts

Why haven't local-first apps become popular?

SWE-Bench Pro

Diffusion Beats Autoregressive in Data-Constrained Settings

PlanetScale for Postgres is now GA

Is a movie prop the ultimate laptop bag?

Testing is better than data structures and algorithms

Mentra (YC W25) Is Hiring to build smart glasses

AI-generated “workslop” is destroying productivity?

Transforming recursion into iteration for LLVM loop optimizations

Unweaving warp specialization on modern tensor core GPUs

I'm spoiled by Apple Silicon but still love Framework

Cloudflare is sponsoring Ladybird and Omarchy

What happens when coding agents stop feeling like dialup?

I Was a Weird Kid: Jailhouse Confessions of a Teen Hacker

Easy Forth (2015)

The Beginner's Textbook for Fully Homomorphic Encryption

CompileBench: Can AI Compile 22-year-old Code?

Beyond the Front Page: A Personal Guide to Hacker News

What is algebraic about algebraic effects?

Human-Oriented Markup Language

A simple way to measure knots has come unraveled

The Collapse of the Tjörn Bridge, Sweden, 1980

SGI demos from long ago in the browser via WASM

Qwen3-Omni: Native Omni AI model for text, image and video

Fine-grained HTTP filtering for Claude Code

Choose Your Own Adventure

A board member's perspective of the RubyGems controversy

OpenAI and Nvidia announce partnership to deploy 10GW of Nvidia systems

Cap'n Web: a new RPC system for browsers and web servers

Categorical Foundations for Cute Layouts

Why haven't local-first apps become popular?

SWE-Bench Pro

Diffusion Beats Autoregressive in Data-Constrained Settings

PlanetScale for Postgres is now GA

Is a movie prop the ultimate laptop bag?

Testing is better than data structures and algorithms

Mentra (YC W25) Is Hiring to build smart glasses

AI-generated “workslop” is destroying productivity?

Transforming recursion into iteration for LLVM loop optimizations

Unweaving warp specialization on modern tensor core GPUs

I'm spoiled by Apple Silicon but still love Framework

Cloudflare is sponsoring Ladybird and Omarchy

What happens when coding agents stop feeling like dialup?

I Was a Weird Kid: Jailhouse Confessions of a Teen Hacker

Easy Forth (2015)

The Beginner's Textbook for Fully Homomorphic Encryption

CompileBench: Can AI Compile 22-year-old Code?

Beyond the Front Page: A Personal Guide to Hacker News

What is algebraic about algebraic effects?

Human-Oriented Markup Language

A simple way to measure knots has come unraveled

The Collapse of the Tjörn Bridge, Sweden, 1980

SGI demos from long ago in the browser via WASM

Qwen3-Omni: Native Omni AI model for text, image and video

Comments