GenDP: 3D Semantic Fields for Category-Level
Generalizable Diffusion Policy

CoRL 2024

1Columbia University, 2University of Illinois Urbana-Champaign, 3Boston Dynamics AI Institute


GenDP is an imitation learning framework capable of category-level generalization by using the 3D semantic fields.


GenDP shows category-level generalization capabilities across instances with diverse geometries, textures, and apprearances
in various challenging manipulation tasks requiring semantic understanding, such as spreading toothpaste.

Abstract

Diffusion-based policies have shown remarkable capability in executing complex robotic manipulation tasks but lack explicit characterization of geometry and semantics, which often limits their ability to generalize to unseen objects and layouts. To enhance the generalization capabilities of Diffusion Policy, we introduce a novel framework that incorporates explicit spatial and semantic information via 3D semantic fields. We generate 3D descriptor fields from multi-view RGBD observations with large foundational vision models, then compare these descriptor fields against reference descriptors to obtain semantic fields. The proposed method explicitly considers geometry and semantics, enabling strong generalization capabilities in tasks requiring category-level generalization, resolving geometric ambiguities, and attention to subtle geometric details. We evaluate our method across eight tasks involving articulated objects and instances with varying shapes and textures from multiple object categories. Our method demonstrates its effectiveness by increasing Diffusion Policy's average success rate on unseen instances from 20% to 93%. Additionally, we provide a detailed analysis and visualization to interpret the sources of performance gain and explain how our method can generalize to novel instances.



Video

GenDP



Method Overview. The top row (a) shows a sequence of real policy rollouts in the aligning shoe task. We first take in multi-view RGBD observations (i), then extract the 3D descriptor field, with each point possessing a corresponding high-dimensional descriptor (ii) ~\cite{wang2023d3fields}. We then select reference features from 2D reference images. By computing the cosine similarity between the descriptor field and 2D reference semantic features, we could obtain several semantic fields (iii). These semantic fields, concatenated with the point cloud, are then input into PointNet++ and the diffusion policy to output predicted actions (iv).



Interactive Visualization

Visualize GenDP for

Raw Observation

3D Semantic Field with Actions


BibTeX

@inproceedings{wang2023gendp,
    title={GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy},
    author={Wang, Yixuan and Yin, Guang and Huang, Binghao and Kelestemur, Tarik and Wang, Jiuguang and Li, Yunzhu},
    booktitle={8th Annual Conference on Robot Learning},
    year={2024}
}