Robots that learn from videos of human activities and simulated interactions

Optimistic science fiction typically imagines a future where humans create art and pursue fulfilling pastimes while AI-enabled robots handle dull or dangerous tasks. In contrast, the AI systems of today display increasingly sophisticated generative abilities on ostensible creative tasks. But where are the robots? This gap is known as Moravec’s paradox, the thesis that the hardest problems in AI involve sensorimotor skills, not abstract thought or reasoning. To put it another way, “The hard problems are easy, and the easy problems are hard.”

Today, we are announcing two major advancements toward general-purpose embodied AI agents capable of performing challenging sensorimotor skills:

  • An artificial visual cortex (called VC-1): a single perception model that, for the first time, supports a diverse range of sensorimotor skills, environments, and embodiments. VC-1 is trained on videos of people performing everyday tasks from the groundbreaking Ego4D dataset created by Meta AI and academic partners. And VC-1 matches or outperforms best-known results on 17 different sensorimotor tasks in virtual environments.

  • A new approach called adaptive (sensorimotor) skill coordination (ASC), which achieves near-perfect performance (98 percent success) on the challenging task of robotic mobile manipulation (navigating to an object, picking it up, navigating to another location, placing the object, repeating) in physical environments.

Data powers both of these breakthroughs. AI needs data to learn from — and, specifically, embodied AI needs data that captures interactions with the environment. Traditionally, this interaction data is collected either by collecting large amounts of demonstrations or by allowing the robot to learn from interactions from scratch. Both approaches are too resource-intensive to scale toward the learning of a general embodied AI agent. In both of these works, we are developing new ways for robots to learn, using videos of human interactions with the real world and simulated interactions within photorealistic simulated worlds.

First, we’ve built a way for robots to learn from real-world human interactions, by training a general-purpose visual representation model (an artificial visual cortex) from a large number of egocentric videos. The videos include our open source Ego4D dataset, which shows first-person views of people doing everyday tasks, like going to the grocery store and cooking lunch. Second, we’ve built a way to pretrain our robot to perform long-horizon rearrangement tasks in simulation. Specifically, we train a policy in Habitat environments and transfer the policy zero-shot to a real Spot robot to perform such tasks in unfamiliar real-world spaces.

Toward an artificial visual cortex for embodied intelligence

A visual cortex is the region of the brain that (together with the motor cortex) enables an organism to convert vision into movement. We are interested in developing an artificial visual cortex — the module in an AI system that enables an artificial agent to convert camera input into actions.

Our FAIR team, together with academic collaborators, has been at the forefront of developing general-purpose visual representations for embodied AI trained from egocentric video datasets. The Ego4D dataset has been especially useful, since it contains thousands of hours of wearable camera video from research participants around the world performing daily life activities, including cooking, cleaning, sports, and crafts.

For instance, one prior work from our team (R3M) uses temporal and text-video alignment within Ego4D video frames to learn compact universal visual representations for robotic manipulation. Another work (VIP) uses Ego4D frames to learn an effective actionable visual representation that can also perform zero-shot visual reward-specification for training embodied agents. These are illustrative of the broader trend in the research community (e.g., PVROVRLMVP) toward pretraining visual representations from web images and egocentric videos.

Although prior work has focused on a small set of robotic tasks, a visual cortex for embodied AI should work well for a diverse set of sensorimotor tasks in diverse environments across diverse embodiments. While prior works on pretraining visual representations give us a glimpse of what may be feasible, they are fundamentally incommensurable, with different ways of pretraining the visual representations on different datasets, evaluated on different embodied AI tasks. The lack of consistency meant there was no way of knowing which of the existing pretrained visual representations were best.

As a first step, we curated CortexBench, consisting of 17 different sensorimotor tasks in simulation, spanning locomotion, navigation, and dexterous and mobile manipulation, implementing the community standard for learning the policy for each task. The visual environments span from flat infinite planes to tabletop settings to photorealistic 3D scans of real-world indoor spaces. The agent embodiments vary from stationary arms to dexterous hands to idealized cylindrical navigation agents to articulated mobile manipulators. The learning conditions vary from few-shot imitation learning to large-scale reinforcement learning. This allowed us to perform a rigorous and consistent evaluation of existing and new pretrained models. Prior to our work, the best performance for each task in CortexBench was achieved by a model or algorithm specifically designed for that task. In contrast, what we want is one model and/or algorithm that achieves competitive performance on all tasks. Biological organisms have one general-purpose visual cortex, and that is what we seek for embodied agents.

We set out to pretrain a single general-purpose visual cortex that can perform well on all of these tasks. A critical choice for pretraining is the choice of dataset. It was entirely unclear what a good pretraining dataset for embodied AI would look like. There are massive amounts of video data available online, yet it isn’t practical to try out all combinations of those existing datasets.

We start with Ego4D as our core dataset and then explore whether adding additional datasets improves pretrained models. Having egocentric video is important because it enables robots to learn to see from a first-person perspective. Since Ego4D is heavily focused on everyday activities like cooking, gardening, and crafting, we also consider egocentric video datasets that explore houses and apartments. Finally, we also study whether static image datasets help improve our models.

Cumulatively, our work represents the largest and most comprehensive empirical study to date of visual representations for embodied AI, spanning 5+ pretrained visual representations from prior work, and multiple ablations of VC-1 trained on 4,000+ hours of egocentric human video from seven diverse datasets, which required over 10,000 GPU-hours of training and evaluation.

Today, we are open-sourcing VC-1, our best visual cortex model following FAIR’s values of open research for the benefit of all. Our results show VC-1 representations match or outperform learning from scratch on all 17 tasks. We also find that adapting VC-1 on task-relevant data results in it becoming competitive with or outperforming best-known results on all tasks in CortexBench. To the best of our knowledge, VC-1 is the first visual pretrained model that has shown to be competitive with state-of-the art results on such a diverse set of embodied AI tasks. We are sharing our detailed learnings, such as how scaling model size, dataset size, and diversity impact performance of pretrained models, in a related research paper.

Adaptive skill coordination for robotic mobile manipulation

While VC-1 demonstrates strong performance on sensorimotor skills in CortexBench, these are short-horizon tasks (navigating, picking up an object, in-hand manipulation of an object, etc.). The next generation of embodied AI agents (deployed on robots) will also need to accomplish long-horizon tasks and adapt to new and changing environments, including unexpected real-world disturbances.

Our second announcement focuses on mobile pick-and-place — a robot is initialized in a new environment and tasked with moving objects from initial to desired locations, emulating the task of tidying a house. The robot must navigate to a receptacle with objects, like the kitchen counter (the approximate location is provided to it), search for and pick an object, navigate to its desired place receptacle, place the object, and repeat.

To tackle such long-horizon tasks, we and our collaborators at Georgia Tech developed a new technique called Adaptive Skill Coordination (ASC), which consists of three components:

  • A library of basic sensorimotor skills (navigation, pick, place)

  • A skill coordination policy that chooses which skills are appropriate to use at which time

  • A corrective policy that adapts pretrained skills when out-of-distribution states are perceived

All sensorimotor policies are “model-free.” We use sensor-to-actions neural networks with no task-specific modules, like mapping or planning. The robot is trained entirely in simulation and transferred to the physical world without any real-world training data.

We demonstrate the effectiveness of ASC by deploying it on the Boston Dynamics’ Spot robot in new/unknown real-world environments. We chose the Boston Dynamics Spot robot because of robust sensing, navigation, and manipulation capabilities. However, operating Spot today involves a large amount of human intervention. For example, picking an object requires a person to click on the object on the robot’s tablet. Our aim is to build AI models that can sense the world from onboard sensing and motor commands through Boston Dynamics APIs.

Using the Habitat simulator, and the HM3D and ReplicaCAD datasets, which include indoor 3D scans of 1,000 homes, we teach a simulated Spot robot to move around an unseen house, pick up out-of-place objects, and put them in the right location. Next, we deploy this policy zero-shot in the real-world (sim2real) without explicitly building a map in the real world, and instead rely on our robot to use its learned notion of what houses look like.

When we put our work to the test, we used two significantly different real-world environments where Spot was asked to rearrange a variety of objects — a fully furnished 185-square-meter apartment and a 65-square-meter university lab. Overall, ASC achieved near-perfect performance, succeeding on 59 of 60 (98 percent) episodes, overcoming hardware instabilities, picking failures, and adversarial disturbances like moving obstacles or blocked paths. In comparison, traditional baselines like task and motion planning succeed in only 73 percent of cases, because of an inability to recover from real-world disturbances. We also study robustness to adversarial perturbations, such as changing the layout of the environment, walking in front of the robot to repeatedly block its path, or moving target objects mid-episode. Despite being trained entirely in simulation, ASC is robust to such disturbances, making it well suited for many long-horizon problems in robotics and reinforcement learning.

This opens avenues for sim2real research to expand to even more challenging real-world tasks, such as assistance in everyday tasks like cooking and cleaning, and even human-robot collaboration. Our work is a step toward scalable, robust, and diverse robot assistants of the future that can operate in new environments out of the box and do not require expensive real-world data collection.

Rethinking sim2real transfer

One of the most important tasks in sim2real learning is to build simulation models that truthfully reflect the robot’s behavior in the real world. However, this is challenging, since the real world is vast and constantly changing, and the simulator needs to capture this diversity. No simulator is a perfect replica of reality, and the main challenge is overcoming the gap between the robot’s performance in simulation and in the real world. The default operating hypothesis of this field is that reducing the sim2real gap involves creating simulators of high physics fidelity and using them to learn robot policies.

Over the past year, we have taken a counterintuitive approach to sim2real. Instead of building high-fidelity simulations of the world, we built an abstracted simulator of Spot, which does not model low-level physics in simulation, and learn a policy that can reason on a higher level (like where to go rather than how to move the legs). We call this a kinematic simulation, where the robot is teleported to a location and the target object is attached to the robot arm, when it is close to the gripper and in view. In the real world, Boston Dynamics controllers are used for achieving the actions commanded by this high-level policy.

Robots pretrained in sim2real have mostly been limited to short-horizon tasks and visual navigation, without any interaction with the environment. Mobile pick-and-place is a long-horizon task, and it requires interacting with the environment and switching between different phases of navigation, picking, placing, etc. This is typically very challenging for reinforcement learning, and requires demonstrations, or sophisticated hand-designed rewards. Our high-level abstraction and kinematic simulation let us learn long-horizon tasks, with sparse rewards, without requiring to reason about low-level physics.

Future areas of exploration

While we haven’t yet applied visual cortex to our object rearrangement robot, we hope to integrate it into a single system. With so many unpredictable variables in the real world, having strong visual representations and pretraining on a diverse number of egocentric videos showing many different activities and environments will be an important step toward building even better robots.

Voice is one area we are particularly interested in exploring. For example, instead of providing a task definition, natural language processing could be integrated, so someone could use their voice to tell their assistant to take the dishes from the dining room and move them to the kitchen sink. We also want to explore how our robot can better perform around people, such as by anticipating their needs and helping them with a multistep task, like baking a cake.

These are just some of the many areas that call for more research and exploration. We believe that with a strong visual cortex pretrained on egocentric video and visuomotor skills pretrained in simulation, these advancements could one day serve as building blocks for AI-powered experiences where virtual assistants and physical robots can assist humans and interact seamlessly with the virtual and physical world.

Read the paper: Adaptive Skill Coordination (ASC)

Read the paper: Visual Cortex

Get the Visual Cortex code

We would like to acknowledge the contributions of the following people:

Visual Cortex: Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, Yixin Lin, Oleksandr Maksymets, and Aravind Rajeswaran

Adaptive Skill Coordination: Naoki Yokoyama, Alexander William Clegg, Eric Undersander, Sergio Arnaud, Jimmy Yang, and Sehoon Ha

Leave a Reply

Your email address will not be published. Required fields are marked *