Vision Foundation Models for an embodiment and environment agnostic scene representation for robotic manipulation

Kevin Riou; Kevin Subrin; Patrick Le Callet

Communication Dans Un Congrès Année : 2024

Vision Foundation Models for an embodiment and environment agnostic scene representation for robotic manipulation

(1) , (2) , (1, 3)

1
2
3

Kevin Riou

Fonction : Auteur

Image Perception Interaction

Kevin Subrin

Fonction : Auteur

Robots and Machines for Manufacturing, Society and Services

Patrick Le Callet

Fonction : Auteur

Image Perception Interaction

Institut universitaire de France

Résumé

Traditional Imitation Learning (IL) approaches often rely on teleoperation to collect training data, which ensures consistency between training and deployment action and observation spaces. However, teleoperation slows data acquisition, distorts expert behavior and data can be affected by the lack of teleoperation skills. To overcome these limitations, IL training on human demonstrations requires visual representations that are agnostic to both embodiment and environment. Recent advancements in Vision Foundation Models, such as Grounded-Segment-Anything (Grounded-SAM), offer a solution by extracting meaningful scene information while filtering out irrelevant details without manual annotation. In this work, we collected 50 human video demonstrations of a manipulation task from the RLBench benchmark. We evaluated Grounded-SAM's ability to automatically annotate objects of interest and proposed a 3D visual representation using depth maps. This representation was used to train a diffusion policy, which successfully generalized to simulated robot deployment in RLBench, despite being trained exclusively on real-world human demonstrations. Our results demonstrate that efficient training can be achieved with just 50 demonstrations and halfan-hour training time.

Mots clés

Imitation Learning vision-and-language models scene representation

Domaines

Informatique [cs]

Fichier principal

iros_workshop_2024-2.pdf (1.77 Mo)

Origine	Fichiers produits par l'(les) auteur(s)

Kevin Riou : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04751375

Soumis le : jeudi 24 octobre 2024-10:25:08

Dernière modification le : samedi 26 octobre 2024-03:21:07

Dates et versions

hal-04751375 , version 1 (24-10-2024)

Identifiants

HAL Id : hal-04751375 , version 1

Citer

Kevin Riou, Kevin Subrin, Patrick Le Callet. Vision Foundation Models for an embodiment and environment agnostic scene representation for robotic manipulation. International Conference on Intelligent Robots and Systems (IROS), on Brain over Brawn Workshop (BoB) (https://bob-workshop.github.io/), Oct 2024, Abu Dhabi, United Arab Emirates. ⟨hal-04751375⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA EC-NANTES UNAM LS2N LS2N-IPI LS2N-ROMAS NANTES-UNIVERSITE

0 Consultations

0 Téléchargements

Vision Foundation Models for an embodiment and environment agnostic scene representation for robotic manipulation

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager