Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

https://the-decoder.com/qwen3-vl-can-scan-two-hour-videos-and-pinpoint-nearly-every-detail/

265•thm•2mo ago

Comments

thot_experiment•2mo ago

anyone have a tl;dr for me on what the best way to get the video comprehension stuff going is? i use qwen-30b-vl all the time locally as my goto model because it's just so insanely fast, curious to mess with the video stuff, the vision comprehension works great and i use it for OCR and classification all the time

xrd•2mo ago

How much VRAM do you need for local usage may I ask?

moralestapia•2mo ago

To me, this qualifies as some sort ASI already.

visioninmyblood•2mo ago

I was using this for video understanding with inference form vlm.run infra. It definitely has outperformed Gemini which generally is much better than openai or Claude on videos. The detailed extraction is pretty good. With agents you can also crop into a segment and do more operations on it. have to see how the multi modal space progresses:

link to results: https://chat.vlm.run/c/82a33ebb-65f9-40f3-9691-bc674ef28b52

Quick demo: https://www.youtube.com/watch?v=78ErDBuqBEo

colechristensen•2mo ago

I found it pretty funny how bad Claude was at cropping an image. It was a cute little character with some text off to the side on a white background, all very clean cartoon vibes and it COULD NOT just select the character. I pursued it for 20 minutes because I thought it was funny. Of course it was 45 seconds to do it myself.

A lot of my side projects involve UIs and almost all of my problems with getting LLMs to write them for me involve "The UI isn't doing what you say it's doing" and struggling to get A) a reliable way to get it to look at the UI so it can continue its loop and B) getting it to understand what it's looking at well enough to do something about it

visioninmyblood•2mo ago

I agree claude and chatgpt and even gemini does a poor job in detecting and cropping into a region. Some of the simplest tasks, Qwen also is great at summerization but not into solving simple vision tasks like cropping, segmentetation and detection. Here is an examples where we compared claude, gemini, chatgpt and other frontier models for simple(and complicated) visual tasks https://chat.vlm.run/showdown#:~:text=Crop%20into%20the%20cl...

colechristensen•2mo ago

The part that was funny to me is I would respond "is that right?" and it would tell me exactly how it was wrong and proceed to do it incorrectly again in a very similar but different way. It was like a Monty Python sketch. I might have also been very tired and easily amused.

djmips•2mo ago

Does anyone else worry about this technology used for Big Brother type surveillance?

reactordev•2mo ago

Where have you been the last decade? It’s already in use, or models like it, by companies selling access to The State

https://deflock.me

Not to mention cloud platforms that collect evidence and process it with all the models and store that information for searching…

https://www.revir.ai

mptest•2mo ago

or if you prefer your depression in book format: surveillance capitalism by zuboff pegasus: a spy in your pocket laurent richard

eurekin•2mo ago

No mention of palantir?

bilbo0s•2mo ago

>It’s already in use, or models like it, by companies selling access to The State

Doesn't that pretty much cover Palantir as well?

bigyabai•2mo ago

Palantir's just the new guy on the block: https://en.wikipedia.org/wiki/Sentient_(intelligence_analysi...

fragmede•2mo ago

or even https://en.wikipedia.org/wiki/IBM_and_the_Holocaust

g-mork•2mo ago

warmly encourage you avoid reading the header files of the dahua camera SDK

71bw•2mo ago

Mind sharing a bit more insight?

basilgohar•2mo ago

How do you think this tech was developed in the first place? It's probably trained and used in the surveillance bid for a decade before it comes to consumers, and this probably isn't the SoA stuff that governments have access to, we're probably 5-10 years behind what's on the cutting edge.

speedgoose•2mo ago

I wouldn’t bet. IT innovation used to be lead by the defence industry, but that has changed and now consumer technology is driving the innovation from what I have been told.

I’m sure they have some cool secret stuff, but they are perhaps not 10 years ahead. Also, I find unlikely that those secrets wouldn’t make it to the public society now, as we are probably close the top of the AI bubble.

protocolture•2mo ago

We got Facial Rec and LPR first, those are more dangerous for surveillance.

ants_everywhere•2mo ago

Big Brother is a reference to George Orwell's critique of Communism in Nineteen Eighty-Four.

Qwen is a video model trained by a Communist government, or technically by a company with very close ties to the Chinese government. The Chinese government also has laws requiring AI be used to further the political goals of China in particular and authoritarian socialism in general.

In the light of all this, I think it's reasonable to conclude that this technology will be used for Big Brother type surveillance and quite possible that it was created explicitly for that purpose.

Intermernet•2mo ago

Just nitpicking here, but 1984 is a critique of totalitarianism. The only references to systems of government in the book refer to "The German Nazis and the Russian Communists".

Orwell was a democratic socialist. He was opposed to totalitarian politics, not communism per se.

ants_everywhere•2mo ago

It's true that it's about totalitarianism to some extent. But we have Orwell's actual words here that it's chiefly about communism

> [Nineteen Eighty-Four] was based chiefly on communism, because that is the dominant form of totalitarianism, but I was trying chiefly to imagine what communism would be like if it were firmly rooted in the English speaking countries, and was no longer a mere extension of the Russian Foreign Office.

And of course Animal Farm is only about communism (as opposed to communism + fascism). And the lesser known Homage to Catalonia depicts the communist suppression of other socialist groups.

By all this I just mean to say when you're reading Nineteen Eighty-Four what he's describing is barely a fictionalization of what was already going on in the Soviet Union. There's just not a lot in the book that is specifically Nazi or Fascist.

I don't have any opinion on whether he thought there were non-totalitarian forms of communism.

justsomejew•2mo ago

I think that Orwell understood his own people much more than Russians, so it might be useful, while reading him, to take a look at the mirror as well..

PunchyHamster•2mo ago

It was already used before current AI explosion.

This is why keeping our governments from eating that tasty apple of "if you can record AND analyse everything there will be so much less crime" and "just give us keys to all private communication, we swear we will just use it to find bad guys". Because someone will, and someone will use it to hit on people they don't like

bgwalter•2mo ago

In surveillance and police states like The Netherlands it has been used since forever:

https://www.theguardian.com/cities/2018/mar/01/smart-cities-...

Now people will say again that this project has been abandoned, which just isn't true (2024):

https://www.dutchnews.nl/2024/06/smart-street-surveillance-o...

thijson•2mo ago

I was watching a crime solving show from the UK. A huge percentage of the crimes are solved using camera footage. Also, they use geofencing, looking at which phones went in and out of the crime location at the time of the crime.

fy20•2mo ago

I would be surprised if this hasn't existed for a few decades already.

Back in 2009 I was working at a place where O2 was a client, and they gave us an API that could identify the cell tower (inc. lat/lng) any of their customers were connected to. The network needs to track this data internally to function, so the API is basically the equivalent of their DNS.

kelipso•2mo ago

This tech would be a massive waste of computational resources to do that. Technology for what you said is way more efficient and has been working well for years now.

yieldcrv•2mo ago

2009, you rang?

spwa4•2mo ago

It's so weird how that works with transformers.

Finetuning an LLM "backbone" (if I understand correctly: a fully trained but not instruction tuned LLM, usually small because students) with OCR tokens bests just about every OCR network out there.

And it's not just OCR. Describing images. Bounding boxes. Audio, both ASR and TTS, all works better that way. Now many research papers are only really about how to encode image/audio/video to feed it into a Llama or Qwen model.

zmmmmm•2mo ago

It is fascinating. Vision language models are unreasonably good compared to dedicated OCR and even the language tasks to some extent.

My take is it fits into the general concept that generalist models have significant advantages because so much more latent structure maps across domains than we expect. People still talk about fine tuning dedicated models being effective but my personal experience is it's still always better to use a larger generalist model than a smaller fine tuned one.

jepj57•2mo ago

Now apply that thinking to human-based neural nets...

kgeist•2mo ago

>People still talk about fine tuning dedicated models being effective

>it's still always better to use a larger generalist model than a smaller fine tuned one

Smaller fine-tuned models are still a good fit if they need to run on-premises cheaply and are already good enough. Isn't it their main use case?

bangaladore•2mo ago

Latency and size. Otherwise pretty much useless.

eurekin•2mo ago

Insane if true... now I wonder, if I use it to go through some old dance routing video catalogue to recognize and write individual move lists

iib•2mo ago

I have been looking for the same thing, either from Meta's SAM 3[1] model, either from things like the OP.

There has been some research specifically in this area with what appears to be classic ML models [2], but it's unclear to me if it can generalize to dances it has not been trained on.

[1] https://ai.meta.com/blog/segment-anything-model-3/

[2] https://arxiv.org/html/2405.19727v1

mikae1•2mo ago

Hope this on day will be used for auto-tagging all video assets with time codes. The dream of being able to search for running horse and find a clip containing a running horse at 4m42s in one of thousands of clips.

laidoffamazon•2mo ago

It’s not difficult to hack this together with CLIP. I did this with about a tenth of my movie collection last week with a GTX 1080 - though it lacks temporal understanding so you have to do the scene analysis yourself

dynode•2mo ago

Would you be willing to share more details of what you did?

laidoffamazon•2mo ago

Sure. I had a lot of help from Claude Opus 4.5, but it was roughly:

- Using pyscenedetect to split each video on a per scene level

- Using the decord library https://github.com/dmlc/decord to pull frames from each scene at a particular sample rate (specific rate I don't have handy right now, but it was 1-2 per scene)

- Aggregating frames in batches of around 256 frames to be normalized for CLIP embedding on GPU (had to re-write the normalization process for this because the default library does it on CPU)

- Uploading the frames along with metadata (timestamp, etc) into a vector DB, in my case Qdrant running locally along with a screenclip of the frame itself for debugging.

I'm bottlenecked by GPU compute so I also started experimenting with using Modal for the embedding work too, but then vacation ended :) Might pick it up again in a few weeks. I'd like to be able to have a temporal-aware and potentially enriched search so that I can say "Seek to the scene in Oppenheimer where Rami Malek testifies" and be able to get a timestamped clip from the movie.

vhcr•2mo ago

I'm guessing you're not storing the CLIP for every single frame, instead of every second or so? Also, are you using the cosine similarity? How are you finding the nearest vector?

laidoffamazon•2mo ago

I split per scene using pyscenedetect and sampled from each. Distance is via cosine similarity- I fed it into qdrant

ArnavAgrawal03•2mo ago

you can do that with Morphik already :)

We use an embedding model that processes videos and allows you to perform RAG on them.

bn-l•2mo ago

Rag as in the content is used to generate an answer or rag as in searching for a video?

eurekin•2mo ago

Would it allow me to query my library for every movie that contains dance routing move1-move2-move3 in that order?

tontonius•2mo ago

this is a solved problem already — check out https://getjumper.io where you can do exactly this (search through 100s of hours) offline and locally.

Disclaimer: co-founder

xnx•2mo ago

Gemini already does this (and has for awhile): https://ai.google.dev/gemini-api/docs/video-understanding

clusterhacks•2mo ago

I was playing around with Qwen3-VL to parse PDFs - meaning, do some OCR data extraction from a reasonably well-formated PDF report. Failed miserably, although I was using the 30B-A3B model instead of the larger one.

I like the Qwen models and use them for other tasks successfully. It is so interesting how LLMs will do quite well in one situation and quite badly in another.

totetsu•2mo ago

The opus models seems pretty adept and extracting structured data from ocr https://www.ocrarena.ai/battle

coppsilgold•2mo ago

> The test works by inserting a semantically important "needle" frame at random positions in long videos, which the system must then find and analyze.

This seems to be somewhat unwise. Such an insertion would qualify as an anomaly. And if it's also trained that way, would you not train the model to find artificial frames where they don't belong?

Would it not have been better to find a set of videos where something specific (common, rare, surprising, etc) happens at some time and ask the model about that?

bigmadshoe•2mo ago

Yeah the needle in a haystack tests are so stupid. It seems clear with LLMs that performance degrades massively with context size, yet those tests claim the model performs perfectly.

patates•2mo ago

As someone who abuses gemini regularly with a 90% full context, the model performance does degrade for sure but I wouldn't call it massively.

I can't show any evidence as I don't have such tests, but it's like coding normally vs coding after a beer or two.

For the massive effect, fill it 95% and we're talking vodka shots. 99%? A zombie who can code. But perhaps that's not fair when you have 1M token context size.

oceansweep•2mo ago

https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o...

IanCal•2mo ago

That rather depends on exactly how this is done, although it's a useful upper bound for many tasks either way. You could say the same for images and yet due to the way some work they straight up cannot see in certain ways.

This could describe adding a frame of nonsense into an existing video.

It also could describe finding a semantically useful thing in an actual video, where the exact location is randomised by looking at different time crops of the video. For example, finding a book on a desk in a video that's only there in a panning shot, and you then see if it can find it in a 10s cut, 20s cut, 10 minute cut, etc, and near the start/middle/end.

Here's the paper: https://arxiv.org/pdf/2511.21631

> To evaluate the model’s capability in processing long-context inputs, we construct a video “Needle-ina-Haystack” evaluation on Qwen3-VL-235B-A22B-Instruct. In this task, a semantically salient “needle” frame—containing critical visual evidence—is inserted at varying temporal positions within a long video. The model is then tasked with accurately locating the target frame from the long video and answering the corresponding question. During evaluation, videos are uniformly sampled at 1 FPS, and frame resolution is dynamically adjusted to maintain a constant visual token budget.

This potentially sounds more like the former, but I can't find more accurate information on how this works.

Regardless I'd say again that while not the whole story things like this really are useful to know, and can be very important to test - it's really not a given that models can always find anything in their context window, perhaps even more so for video.

chhxdjsj•2mo ago

Not so relevant to the thread but ive been uploading screenshots from citrix guis and asking qwen3-vl for the appropriate next action eg Mouseclick, and while it knows what to click it struggles to accurately return which pixel coordinates to click. Anyone know a way to get accurate pixel coordinates returned?

8f2ab37a-ed6c•2mo ago

Also curious about this. I tried https://moondream.ai/ as well for this task and it felt still far from being bulletproof.

jazzyjackson•2mo ago

Could you combine it with a classic OCR segmentation process, so that along with the image you also provide box coordinates of each string?

logankeenan•2mo ago

It’s been about a year since I looked into this sort of thing, but molmo will give you x,y coordinates. I hacked together a project about it. I also think Microsoft’s omniparser is good at finding coordinates too.

https://huggingface.co/allenai/Molmo-7B-D-0924

https://github.com/logankeenan/george

https://github.com/microsoft/OmniParser

chhxdjsj•2mo ago

Thanks ill try this!

visioninmyblood•2mo ago

you want get the exact coordinated by running a key point network to pinpoint which coordinates does the next click point is you can. here I show a example simple prompt which returns the keypoint location of the next botton to click and visually localize the point with a keypoint in the image

https://chat.vlm.run/c/e12f0153-7121-4599-9eb9-cd8c60bbbd69

hamasho•2mo ago

It's very not accurate, but sometimes instructing to return pyautogui code works.

  prompt: I attach a screenshot (1920x1080). Write code to click the submit button using pyautogui.
  attachment: <screenshot>
  reply:
    import pyautogui
    pyautogui.click(100, 200)

chhxdjsj•2mo ago

Ive been asking for pyautogui output already but it is still very hit and miss

spherelot•2mo ago

How do you prompt the model? In my experience, Qwen3-VL models have very accurate grounding capabilities (I’ve tested Qwen3-VL-30B-A3B-Instruct, Qwen3-VL-30B-A3B-Thinking, and Qwen3-VL-235B-A22B-Thinking-FP8).

Note that the returned values are not direct pixel coordinates. Instead, they are normalized to a 0–1000 range. For example, if you ask for a bounding box, the model might output:

```json [ {"bbox_2d": [217, 112, 920, 956], "label": "cat"} ] ```

Here, the values represent [x_min, y_min, x_max, y_max]. To convert these to pixel coordinates, use:

[x_min / 1000 * image_width, y_min / 1000 * image_height, x_max / 1000 * image_width, y_max / 1000 * image_height]

Also, if you’re running the model with vLLM > 0.11.0, you might be hitting this bug: https://github.com/vllm-project/vllm/issues/29595

chhxdjsj•2mo ago

Will give this a go, cheers :)

CSMastermind•2mo ago

Still not great at the use cases I tested it for but Gemini isn't either. I think we're still very early on video comprehension.

re5i5tor•2mo ago

For anyone using Qwen3-VL: where are you running it? I had tons of reliability problems with Qwen3-VL inference providers on OpenRouter — based on uptime graphs I wasn’t alone. But when it worked, Qwen3-VL was pack-leading good at AI Vision stuff.

m00dy•2mo ago

I run it on ollama

nicman23•2mo ago

the big boy model?

adastra22•2mo ago

It's not that big of a model?

mkl•2mo ago

235B-A22B is pretty big.

lreeves•2mo ago

I run the larger version of it on a Threadripper with 512GB RAM and a 32GB GPU for the non-expert layers and context, using llama.cpp. Performs great, however god forbid you try to get that much memory these days.

sosodev•2mo ago

I’ve noticed that the open weight models have a lot of issues on OpenRouter. You get a lot of inconsistency in quality due to varying quants at least. I’ve had some seriously nonsensical responses from models that I can’t replicate at all when I switch providers. Lots that just randomly fail to handle requests too. I would recommend finding a provider that works best for your needs and pinning it.

btian•2mo ago

My company's GPU cluster

m00dy•2mo ago

Ive used qwen3-VL on deepwalker lately. All I can stay is that this model is so underrated.

[0]: https://deepwalker.xyz

Alifatisk•2mo ago

Many of the models we have today seem to only perform OCR on the images you send and use the text retrieved for context when answering. However, Qwen-VL, and I guess Gemini now? Are different, they seem to "understanding" the image I send with my prompt. They manage to capture spatial relationships, objects, and semantics from the image, it's very impressive. I’ve been telling my friends about the Qwen3-VL model option in Qwen Chat for a while because I feel like it’s underrated.

neves•2mo ago

My favorite AI feature is to put a YouTube link in Gemini and ask it to summarize. Or even better: put the link of a 20min video "5 ways to" and ask "what are the 5 ways?"

I think Gemini analyzes the transcription.

Can I do the same for free with Qwen3?

seidleroni•2mo ago

I don't believe that it just analyzes the transcription. I asked Gemini to look at the youtube video referenced on the site below and "build" something that duplicates that device. It did a pretty good approximation that it could not have done without going through the full video.

https://bitsnpieces.dev/posts/a-synth-for-my-daughter/

IncreasePosts•2mo ago

Same - I see a lot of "vaguely interesting but no way I'm spending 40 minutes on that" kind of videos, and it usually works. However, I have noticed it occasionally will just summarize the wrong video for me. It might be if the video is very new, or something? I'm not sure.

crispyambulance•2mo ago

I didn't understand what is meant by "pinpoint nearly every detail". The article is titled with that but then firehoses a bunch of technical details.

The github spells it out much better: https://github.com/QwenLM/Qwen3-VL?tab=readme-ov-file#cookbo...

ta12653421•2mo ago

so, a great politician once said: "the internet is for p0rn" - and if we accept this, what will this thing pinpoint in p0rn videos? :-D LOL

spider-mario•2mo ago

There it is! Oh, it disappeared. There it is again! Oh, it disappeared. It’s back! Wait, no.

DrAwdeOccarim•2mo ago

Does anyone know how this actually was done? Like, did they export every frame as a PNG and then run them each one by one through the model? Or did they somehow "load" the video into the model directly (which then internally somehow steps through each frame?)

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Reinforcement Learning from Human Feedback

The AI boom is causing shortages everywhere else

The Waymo World Model

Start all of your commands with a comma (2009)

Selection Rather Than Prediction

Vocal Guide – belt sing without killing yourself

Speed up responses with fast mode

France's homegrown open source online office suite

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Software factories and the agentic moment

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Show HN: I spent 4 years building a UI design tool with only the features I use

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Reinforcement Learning from Human Feedback

The AI boom is causing shortages everywhere else

The Waymo World Model

Start all of your commands with a comma (2009)

Selection Rather Than Prediction

Vocal Guide – belt sing without killing yourself

Speed up responses with fast mode

France's homegrown open source online office suite

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Software factories and the agentic moment

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Show HN: I spent 4 years building a UI design tool with only the features I use

Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

Comments