Molmo 2: State-of-the-art video understanding, pointing, and tracking multimodal

1•maxloh•1mo ago

Comments

maxloh•1mo ago

From Allen AI's Discord:

*Introducing Molmo 2* : State-of-the-art video understanding, pointing, and tracking

Last year, Molmo helped push image understanding forward with pointing—grounded answers you can verify. Now, *Molmo 2 *brings those capabilities to video—so the model doesn’t just answer questions, it can show you where & when something is happening.

On major industry benchmarks, Molmo 2 *surpasses most open multimodal models* and even *rivals closed peers* like Gemini 3 Pro and Claude Sonnet 4.5.

Molmo 2 returns pixel coordinates + timestamps over videos and coordinates over images, enabling: *◘ Video + image QA ◘ Counting-by-pointing ◘ Dense captioning ◘ Artifact detection ◘ Subtitle-aware analysis …and more!*

Three variants depending on your needs: *Molmo 2 (8B)*: Qwen 3 backbone, best overall performance *Molmo 2 (4B)*: Qwen 3 backbone, fast + efficient *Molmo 2-O (7B)*: Olmo backbone, fully open model flow

Demos: *Counting objects & actions* (“How many times does the ball hit the ground?”)—returns the count plus space–time pointers for each event: https://www.youtube.com/watch?v=fvYfPTTTZ_w *Ask-it-anything long-video QA* (“Why does the player change strategy here?”)—points to the moments supporting the answer: https://www.youtube.com/watch?v=Ej3Hb3kRiac *Object tracking* (“Follow the red race car.”)—tracks it across frames with coordinates over time: https://www.youtube.com/watch?v=uot140v_h08

We’ve also *significantly upgraded the Ai2 Playground* You can now upload a video or multiple images to try summarization, tracking, and counting—while seeing exactly where the model is looking.

Try it and learn more: ▶ Playground: https://playground.allenai.org/ ⬇ Models: https://huggingface.co/collections/allenai/molmo2 Blog: https://allenai.org/blog/molmo2 Report: https://allenai.org/papers/molmo2 API coming soon

They Hijacked Our Tech [video]

Vouch

HRL Labs in Malibu laying off 1/3 of their workforce

Show HN: High-performance bidirectional list for React, React Native, and Vue

Show HN: I built a Mac screen recorder Recap.Studio

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

Vectors and HNSW for Dummies

Sanskrit AI beats CleanRL SOTA by 125%

'Washington Post' CEO resigns after going AWOL during job cuts

Claude Opus 4.6 Fast Mode: 2.5× faster, ~6× more expensive

TSMC to produce 3-nanometer chips in Japan

Quantization-Aware Distillation

List of Musical Genres

Show HN: Sknet.ai – AI agents debate on a forum, no humans posting

University of Waterloo Webring

Large tech companies don't need heroes

Backing up all the little things with a Pi5

Game of Trees (Got)

Human Systems Research Submolt

The Threads Algorithm Loves Rage Bait

Search NYC open data to find building health complaints and other issues

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

Show HN: Grovia – Long-Range Greenhouse Monitoring System

Ask HN: The Coming Class War

Mind the GAAP Again

The Yardbirds, Dazed and Confused (1968)

Agent News Chat – AI agents talk to each other about the news

Do you have a mathematically attractive face?

Code only says what it does

The success of 'natural language programming'