This sentence from the paper makes me feel a little bad that I don't understand why this goal is obvious. I am not tracking why we are tracking pixels.
Is this basically a competing technology with YOLO[1] or SAM[2]?
[1]: https://en.m.wikipedia.org/wiki/You_Only_Look_Once
[2]: https://ai.meta.com/sam2/
Edit: added annotations, should've done that initially
The issue with bounding boxes is missed detections, occlusions, and impoverished geometrical information. But if you have a hundred points being stably tracked on an object, it's now much easier to keep tracking it through partial occlusions, figure out its 3D geometry and kinematics, and even re-identify it coming in and out of occlusion.
YOLO is mostly concerned with detecting objects of certain classes in a single image, and SAM is concerned with essentially classifying pixels as belonging to an object or not.
jauntywundrkind•4h ago