I've been working on reproducing the UMI paper (https://umi-gripper.github.io/
) and their code. I've been relatively successful so far (see attached videos): most of the time the arm is able to pick up the cup, but it drops it at a higher-than-desired height over the saucer. I'm using their published code and model checkpoint.
I've tried several approaches to address the issue, including:
Adjusting lighting.
Tweaking latency configurations.
Enabling/disabling image processing from the mirrors.
I still haven’t been able to solve it.
My intuition is that the problem might be one of the following:
Model overfitting to the training cups. The exact list of cups used in training isn’t published. After reviewing the dataset, I see a red cup/saucer set, but I suspect its relative size is different from mine, so the model may be incorrectly estimating the right moment to release the cup.
The model might need fine-tuning with episodes recorded in my own environment using my specific cup/saucer set.
My gripper might lack the precision the original system had.
Residual jitter in the arm or gripper could also be contributing.
Other thoughts:
Depth estimation may be a bottleneck. Adding a depth camera or a secondary camera for stereo vision might help, but would likely require retraining the model from scratch.
Adding contact information could also improve performance, either via touch sensors or by borrowing ideas from ManiWAV (https://mani-wav.github.io/
), which uses a microphone mounted on the finger.
If anyone has been more successful with this setup, I’d love to exchange notes.
rgarreta•1h ago
I've tried several approaches to address the issue, including:
Adjusting lighting.
Tweaking latency configurations.
Enabling/disabling image processing from the mirrors.
I still haven’t been able to solve it.
My intuition is that the problem might be one of the following:
Model overfitting to the training cups. The exact list of cups used in training isn’t published. After reviewing the dataset, I see a red cup/saucer set, but I suspect its relative size is different from mine, so the model may be incorrectly estimating the right moment to release the cup.
The model might need fine-tuning with episodes recorded in my own environment using my specific cup/saucer set.
My gripper might lack the precision the original system had.
Residual jitter in the arm or gripper could also be contributing.
Other thoughts:
Depth estimation may be a bottleneck. Adding a depth camera or a secondary camera for stereo vision might help, but would likely require retraining the model from scratch.
Adding contact information could also improve performance, either via touch sensors or by borrowing ideas from ManiWAV (https://mani-wav.github.io/ ), which uses a microphone mounted on the finger.
If anyone has been more successful with this setup, I’d love to exchange notes.