RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

CoRL 2024 (Oral Presentation)

1University of Southern California, 2Peking University, 3Stanford University
* Equal contributions

Overview of our proposed RAM: (a) We extract unified affordance representation from in-the-wild multi-source demonstrations, including robotic data, HOI data, and custom data, to construct a large-scale affordance memory. Given language instructions, RAM hierarchically retrieves and transfers the 2D affordance from memory and lifts it to 3D for robotic manipulation. (b-d) Our framework shows robust generalizability across diverse objects and embodiments in various settings.

Abstract

This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation, dubbed RAM, featuring generalizability across various objects, environments, and embodiments. Unlike existing approaches that learn manipulation from expensive in-domain demonstrations, RAM capitalizes on a retrieval-based affordance transfer paradigm to acquire versatile manipulation capabilities from abundant out-of-domain data. First, RAM extracts unified affordance at scale from diverse sources of demonstrations including robotic data, human-object interaction (HOI) data, and custom data to construct a comprehensive affordance memory. Then given a language instruction, RAM hierarchically retrieves the most similar demonstration from the affordance memory and transfers such out-of-domain 2D affordance to in-domain 3D executable affordance in a zero-shot and embodiment-agnostic manner. Extensive simulation and real-world evaluations demonstrate that our RAM consistently outperforms existing works in diverse daily tasks. Additionally, RAM shows significant potential for downstream applications such as automatic and efficient data collection, one-shot visual imitation, and LLM/VLM-integrated long-horizon manipulation.


Video

Real-Robot Rollouts


Zero-Shot Robotic Manipulation (2x)

NOTE: All rollouts are fully autonomous, WITHOUT any heuristic grasping.

Our method is able to perform real-world everyday tasks in a zero-shot manner, featuring generalizability across various objects, environments, and embodiments.
Open the drawer
Open the microwave

Open the drawer
Open the cabinet

Pick up the banana
Pick up the bottle

Pick up the mug
Pick up the bowl

One-Shot Visual Imitation with Human Preference (2x)

Apart from utilizing out-of-domain demonstration retrieval for manipulation, our method is naturally adaptable for one-shot visual imitation for better controllability, given a specific in-domain or out-of-domain demonstration.
Human picking up tissue paper
Robot picking up tissue paper

Human picking up tissue box
Robot picking up tissue box

The following example of Tom and Jerry shows that our method is able to bridge the great domain gap between the real world and cartoon images, thanks to the generalizability of visual foundation models.
Cat opening the drawer
(Recommend Chrome for better compatibility)
Robot opening the drawer

Cat closing the drawer
(Recommend Chrome for better compatibility)
Robot closing the drawer

LLM/VLM Integration (3x)

Our method can also be easily integrated with LLMs/VLMs for open-set instructions and long-horizon tasks, by decomposing them into smaller ones suitable for affordance transfer and other action primitives. Deployment on the Unitree B1Z1 also shows our method's cross-embodiment nature.
"Clear the table."

BibTeX

@article{kuang2024ram,
  title={RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation},
  author={Kuang, Yuxuan and Ye, Junjie and Geng, Haoran and Mao, Jiageng and Deng, Congyue and Guibas, Leonidas and Wang, He and Wang, Yue},
  journal={arXiv preprint arXiv:2407.04689},
  year={2024}
}