The problem: robot learning datasets contain bad demos (jerky movements, hesitation, inconsistent timing). Training on these hurts policy performance. Manual review doesn't scale.
pip install democlean
democlean analyze lerobot/pusht
democlean scores each episode by how predictable the actions are given the states. Smooth, purposeful motion scores high. Jerky, inconsistent motion scores low.Validation: I correlated MI scores with motion metrics on lerobot/pusht (human teleoperation data). High-MI episodes had 12% lower jerk (p=0.02) and 24% higher state-action correlation (p=0.03). I did not train policies to measure downstream improvement.
Limitations I want to be upfront about:
- MI correlates with episode length (r≈0.8). Longer episodes score higher.
- This measures motion smoothness, not task success.
- Works best with 50+ episodes from a single task.
- Inspired by DemInf (Hejna et al., RSS 2025) but uses raw KSG estimation instead of their VAE pipeline. Simpler, probably less accurate for high-dimensional observations.
Complements score_lerobot_episodes which catches visual issues (blur, lighting). This catches behavioral issues.
GitHub: https://github.com/dipampaul17/democlean
Happy to answer questions about the approach or validation.