Surat Wealth Management:IGOR: Image-GOal Representations
We introduce IGOR, a framework that learns latent actions from Internet-scale videos that enable cross-embodiment and cross-task generalization.
IGOR learns a unified latent action space for humans and robots by compressing visual changes between an image and its goal state on data from both robot and human activities. By labeling latent actions, IGOR facilitates the learning of foundation policy and world models from internet-scale human video data, covering a diverse range of embodied AI tasks. With a semantically consistent latent action space, IGOR enables human-to-robot generalization. The foundation policy model acts as a high-level controller at the latent action level, which is then integrated with a low-level policy to achieve effective robot control.Surat Wealth Management
Our dataset for pretraining comprises around 2.8M trajectories and video clips, where each trajectory contains a language instruction and a sequence of observations. The datasets are curated from , , , , and .
IGOR learns similar latent actions for image pairs with semantically similar visual changes. On out-of-distribution RT-1 dataset, we observe that pairs with similar embeddings have similar visual changes and similar sub-tasks in semantic, for example, “open the gripper”, “move left”, and “close the gripper”Kanpur Investment. Furthermore, we observe that latent actions are shared across different tasks specified by language instructions, thereby facilitating broader generalization.
IGOR successfully “migrates” the movements of objects in the one video to other videos. By applying latent actions extracted from one video, we generate new videos with similar movements on different objects with the world model. We observe that latent actions are semantically consistent across tasks with different objects.Chennai Stock
Impressively, IGOR learns latent actions that are semantically consistent across human and robots. With only one demonstration, IGOR can successfully migrate human behaviors to robot arms through only latent actions, which opens up new possibilities for few-shot human-to-robot transfer and control.
IGOR learns to control different object’s movements separately among multiple objects. Effects of applying different latent actions are presented in the video: (a,b) move the apple, (c, d) move the tennis, and (e,f) move the orange.
IGOR can follow language instructions via iteratively rolling out the foundation policy and world model. Starting from the same initial image, IGOR can generate diverse behaviors in videos that follow different instructions through latent actions.
We find that IGOR framework improves policy learning under a low-data regime on the Google robot tasks in the SIMPLER simulator shown in (a), potentially due to its capability to predict the next sub-task by leveraging internet-scale data, thereby enabling sub-task level generalization.
We also observe that image-goal pairs with similar latent actions are associated with similar low-level robot actions shown in (b). Our experiments indicate that IGOR’s learned action space reflects more information in robot movements than robot arm rotations and gripping.
Nagpur Investment