Today, we're open-sourcing the whole stack, along with Stera-10M, a 200+ hour dataset, and 10M+ frames captured entirely through it.
FPV Labs began with one bet - the scaling law for embodied AI will need high-fidelity, multimodal real-world data, and the underlying infrastructure that produces this at scale without compromising downstream quality will determine how fast we build a general-purpose model.
Over the last 12 months, we've seen how high-fidelity data is locked behind gated hardware like Aria, which is out of reach for researchers, builders, and startups that want to work with high-quality multi-modal data, and how every single data lab ends up rebuilding the same harness for their own fleet.
This has led to immense data fragmentation over the last year and a race to collect data that is either low-fidelity or built on a heterogeneous stack, with multiple trade-offs in data quality.
Stera removes the need for gated hardware and turns a commodity iPhone into a high-fidelity spatial data capture engine, and provides open-source tooling via the Stera SDK to read, process, and export the results for downstream eval and training.
It fuses RGB, IMU, Depth, Lidar-guided depth, and 6DoF poses from ARKit out of the box and processes them via the Stera SDK to generate high-fidelity 4D data with spatial, semantic, action, and temporal understanding of the world
Each session from Stera includes a. what the wearer sees b. how the camera moves through space (6-DoF pose) how the hands move (21-joint, anchored in a global frame) c. what the depth geometry looks like (per-frame depth and a session-level room mesh) d. What the IMU measures e. What task, sub-goal, episode, atomic action, and objects are involved (hierarchical instruction tree)
We are also releasing Stera-10M publicly today, which is collected entirely through the Stera stack, so everyone can play with our datasets and reproduce them themselves, without having to rebuild any of the harnesses from scratch.
Think of Stera as a unified interface for the long tail of embodied AI research, allowing anyone to become a data lab today.
Downstream applications include, but are not limited to pre- and mid-training for VLA, World model, World Action Models, action recognition and temporal segmentation models, hand-object interaction modeling, human-to-robot motion retargeting, real-to-sim reconstruction, and so much more.
The data foundation for embodied AI should be open and accessible to every researcher/builder.
Link: https://www.fpvlabs.ai/essays/launching-stera