RAM

RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

CoRL 2024 (Oral Presentation)

¹University of Southern California, ²Peking University, ³Stanford University

* Equal contributions

Abstract

This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation, dubbed RAM, featuring generalizability across various objects, environments, and embodiments. Unlike existing approaches that learn manipulation from expensive in-domain demonstrations, RAM capitalizes on a retrieval-based affordance transfer paradigm to acquire versatile manipulation capabilities from abundant out-of-domain data. First, RAM extracts unified affordance at scale from diverse sources of demonstrations including robotic data, human-object interaction (HOI) data, and custom data to construct a comprehensive affordance memory. Then given a language instruction, RAM hierarchically retrieves the most similar demonstration from the affordance memory and transfers such out-of-domain 2D affordance to in-domain 3D executable affordance in a zero-shot and embodiment-agnostic manner. Extensive simulation and real-world evaluations demonstrate that our RAM consistently outperforms existing works in diverse daily tasks. Additionally, RAM shows significant potential for downstream applications such as automatic and efficient data collection, one-shot visual imitation, and LLM/VLM-integrated long-horizon manipulation.

Zero-Shot Robotic Manipulation (2x)

NOTE: All rollouts are fully autonomous, WITHOUT any heuristic grasping.

Our method is able to perform real-world everyday tasks in a zero-shot manner, featuring generalizability across various objects, environments, and embodiments.

Open the drawer

Open the microwave

Open the drawer

Open the cabinet

Pick up the banana

Pick up the bottle

Pick up the mug

Pick up the bowl

One-Shot Visual Imitation with Human Preference (2x)

Apart from utilizing out-of-domain demonstration retrieval for manipulation, our method is naturally adaptable for one-shot visual imitation for better controllability, given a specific in-domain or out-of-domain demonstration.

Human picking up tissue paper

Robot picking up tissue paper

Human picking up tissue box

Robot picking up tissue box

The following example of Tom and Jerry shows that our method is able to bridge the great domain gap between the real world and cartoon images, thanks to the generalizability of visual foundation models.

Cat opening the drawer
(Recommend Chrome for better compatibility)

Robot opening the drawer

Cat closing the drawer
(Recommend Chrome for better compatibility)

Robot closing the drawer

LLM/VLM Integration (3x)

Our method can also be easily integrated with LLMs/VLMs for open-set instructions and long-horizon tasks, by decomposing them into smaller ones suitable for affordance transfer and other action primitives. Deployment on the Unitree B1Z1 also shows our method's cross-embodiment nature.

"Clear the table."

BibTeX

@article{kuang2024ram, title={RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation}, author={Kuang, Yuxuan and Ye, Junjie and Geng, Haoran and Mao, Jiageng and Deng, Congyue and Guibas, Leonidas and Wang, He and Wang, Yue}, journal={arXiv preprint arXiv:2407.04689}, year={2024} }