Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering - Traitement du Langage Parlé
Conference Papers Year : 2023

Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering

Abstract

We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities (KVQAE). KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base. Therefore, the interaction between the modalities is paramount to retrieve information and must be captured with complex fusion models. As these models require a lot of training data, we design this pre-training task from existing work in textual Question Answering. It consists in considering a sentence as a pseudo-question and its context as a pseudo-relevant passage and is extended by considering images near texts in multimodal documents. Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading comprehension, respectively, over a no-pre-training baseline.
Fichier principal
Vignette du fichier
ecir-2023-vf-authors.pdf (3.67 Mo) Télécharger le fichier
Origin Files produced by the author(s)
licence

Dates and versions

hal-03933089 , version 1 (10-01-2023)
hal-03933089 , version 2 (15-12-2023)

Licence

Identifiers

  • HAL Id : hal-03933089 , version 2

Cite

Paul Lerner, Olivier Ferret, Camille Guinaudeau. Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering. European Conference on Information Retrieval (ECIR 2023), Apr 2023, Dublin, Ireland. ⟨hal-03933089v2⟩
361 View
110 Download

Share

More