Imo the latter will be very useful for semantic planning and reasoning, but only after manipulation is solved.
A ballpark cost estimate -
- $10 to $20 hourly wages for the data collectors
- $100,000 to $200,000 per day for 10,000 hours of data
- ~1,500 to 2,500 data collectors doing 4 to 6 hours daily
- $750K to $1.25M on hardware costs at $500 per gripper
Fully loaded cost between $4M to $8M for 270,000 hours of data.
Not bad considering the alternatives.
For example, teleoperation is way less efficient - it's 5x-6x slower than human demos, and 2x-3x more expensive per hour of operator time. But could become feasible after low-level and mid-level manipulation and task planning is solved.
Thinking about it, I'm reminded of various "additive training" tricks. Teach an AI to do A, and then to do B, and it might just generalize that to doing A+B with no extra training. Works often enough on things like LLMs.
In this case, we use non-robot data to teach an AI how to do diverse tasks, and robot-specific data (real or sim) to teach an AI how to operate a robot body. Which might generalize well enough to "doing diverse tasks through a robot body".
Calling UMI an "exoskeleton" might be a stretch but the principle is the same - humans use a kinematically matched instrumented end affector to collect data that can be trivially replayed on the robot.
tyushk•2mo ago
krasin•2mo ago