GenAug: Retargeting behaviors to unseen situations via Generative Augmentation

1University of Washington, 2Meta

We show that GenAug policies can achieve widespread real-world generalization for tabletop manipulation, even when they are only provided with a few demonstrations in a simple training environment.

GenAug: Generative Augmentation for Real-World Data Collection

Interpolate start reference image.

Given the observation of the demonstration environment, GenAug automatically generates “augmented” RGBD images for entirely different and realistic environments, which display the visual realism and complexity of scenes that a robot might encounter in the real world.


Robot learning methods have the potential for widespread generalization across tasks, environments, and objects. However, these methods require large diverse datasets that are expensive to collect in real-world robotics settings. For robot learning to generalize, we must be able to leverage sources of data or priors beyond the robot’s own experience. In this work, we posit that image-text generative models, which are pre-trained on large corpora of web-scraped data, can serve as such a data source. We show that despite these generative models being trained on largely non-robotics data, they can serve as effective ways to impart priors into the process of robot learning in a way that enables widespread generalization. In particular, we show how pre-trained generative models can serve as effective tools for semantically meaningful data augmentation. By leveraging these pre-trained models for generating appropriate “semantic” data augmentations, we propose a system GenAug that is able to significantly improve policy generalization. We apply GenAug to tabletop manipulation tasks, showing the ability to re-target behavior to novel scenarios, while only requiring marginal amounts of real-world data. We demonstrate the efficacy of this system on a number of object manipulation problems in the real world, showing a 40% improvement in generalization to novel scenes and objects.


Real-World Experiments

By training on a dataset extrapolated from only 10 demonstrations in this simple environment on the left, the robot is able
to solve the task in entirely different environments and objects.

Simulation Experiments

To further study in depth the effectiveness of GenAug, we conduct large-scale experiments with other baselines in simulation.
In particular, we organize baseline methods as (1) in-domain augmentation methods and (2) learning from out-of-domain priors.

Table-top Manipulation Tasks

Behavior Cloning Tasks

In addition, we show GenAug can apply to Behavior Cloning tasks such as ”close the top drawer”, with a different robot:"Fetch".
In particular, we collected 100 demonstrations and trained a CNN-MLP behavior cloning policy finetuned with R3M embeddings.
The input is the RGB observation and the output is a 8-dim action vector.
We tested on 100 unseen backgrounds using iGibson rooms and observed GenAug is able to achieve 60% success rate
while policy without Genaug is only 1%, leading to almost 60% improvement.


1. GenAug:

Given a demonstration on one simple environment, GenAug can automatically add distractor objects,
change the object texture, change object classes and change table and background.

2. Data Generation
We randomly select the mode and augment the initial dataset into a large and diverse dataset.

3. System
We generate a large and diverse augmented dataset from a small amount of human demonstrations collected in a simple environment, and use this augmented dataset to train a language-conditioned robot policy and deploy it in the real-world.

Interpolate start reference image.

4. Data Collection
To collect demonstrations in the real-world, a user generates a small dataset by specifying pick and place locations. These 2D locations are projected to the 3D points in the robot coordinates using calibrated depth maps.

Concurrent Work

Check out other awesome work that also leverages pretrained generative models for robot learning!
Scaling Robot Learning with Semantically Imagined Experience by Yu et al. also shows applying text-to-image diffusion models on top of the existing robotic manipulation datasets can improve policies on unseen scenes like new objects and distractors.
CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning by Mandi et al. shows using generative models like stable diffusion to add distractors of the scene can robustify the multi-task policy learning.
DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics by Kapelyukh et al leverages DALL-E to generate goal images for rearrangement tasks.


  title={GenAug: Retargeting behaviors to unseen situations via Generative Augmentation},
  author={Chen, Zoey and Kiami, Sho and Gupta, Abhishek and Kumar, Vikash},
  journal={arXiv preprint arXiv:2302.06671},