*Introducing Molmo 2* : State-of-the-art video understanding, pointing, and tracking
Last year, Molmo helped push image understanding forward with pointing—grounded answers you can verify. Now, *Molmo 2 *brings those capabilities to video—so the model doesn’t just answer questions, it can show you where & when something is happening.
On major industry benchmarks, Molmo 2 *surpasses most open multimodal models* and even *rivals closed peers* like Gemini 3 Pro and Claude Sonnet 4.5.
Molmo 2 returns pixel coordinates + timestamps over videos and coordinates over images, enabling:
*◘ Video + image QA
◘ Counting-by-pointing
◘ Dense captioning
◘ Artifact detection
◘ Subtitle-aware analysis
…and more!*
Three variants depending on your needs:
*Molmo 2 (8B)*: Qwen 3 backbone, best overall performance
*Molmo 2 (4B)*: Qwen 3 backbone, fast + efficient
*Molmo 2-O (7B)*: Olmo backbone, fully open model flow
We’ve also *significantly upgraded the Ai2 Playground*
You can now upload a video or multiple images to try summarization, tracking, and counting—while seeing exactly where the model is looking.
maxloh•7h ago
*Introducing Molmo 2* : State-of-the-art video understanding, pointing, and tracking
Last year, Molmo helped push image understanding forward with pointing—grounded answers you can verify. Now, *Molmo 2 *brings those capabilities to video—so the model doesn’t just answer questions, it can show you where & when something is happening.
On major industry benchmarks, Molmo 2 *surpasses most open multimodal models* and even *rivals closed peers* like Gemini 3 Pro and Claude Sonnet 4.5.
Molmo 2 returns pixel coordinates + timestamps over videos and coordinates over images, enabling: *◘ Video + image QA ◘ Counting-by-pointing ◘ Dense captioning ◘ Artifact detection ◘ Subtitle-aware analysis …and more!*
Three variants depending on your needs: *Molmo 2 (8B)*: Qwen 3 backbone, best overall performance *Molmo 2 (4B)*: Qwen 3 backbone, fast + efficient *Molmo 2-O (7B)*: Olmo backbone, fully open model flow
Demos: *Counting objects & actions* (“How many times does the ball hit the ground?”)—returns the count plus space–time pointers for each event: https://www.youtube.com/watch?v=fvYfPTTTZ_w *Ask-it-anything long-video QA* (“Why does the player change strategy here?”)—points to the moments supporting the answer: https://www.youtube.com/watch?v=Ej3Hb3kRiac *Object tracking* (“Follow the red race car.”)—tracks it across frames with coordinates over time: https://www.youtube.com/watch?v=uot140v_h08
We’ve also *significantly upgraded the Ai2 Playground* You can now upload a video or multiple images to try summarization, tracking, and counting—while seeing exactly where the model is looking.
Try it and learn more: ▶ Playground: https://playground.allenai.org/ ⬇ Models: https://huggingface.co/collections/allenai/molmo2 Blog: https://allenai.org/blog/molmo2 Report: https://allenai.org/papers/molmo2 API coming soon