Operation LiLi: Using Crowd-Sourced Data and Automatic Alignment to Investigate the Phonetics and Phonology of Less-Resourced Languages

Mathilde Hutin; Marc Allassonnière-Tang

doi:10.3390/languages7030234

Article Dans Une Revue Languages Année : 2022

Operation LiLi: Using Crowd-Sourced Data and Automatic Alignment to Investigate the Phonetics and Phonology of Less-Resourced Languages

(1, 2) , (3)

1
2
3

Mathilde Hutin

Fonction : Auteur
PersonId : 744483
IdHAL : mathilde-hutin
ORCID : 0000-0002-6411-5478

Laboratoire Interdisciplinaire des Sciences du Numérique

Traitement du Langage Parlé - LISN

Marc Allassonnière-Tang

Fonction : Auteur
PersonId : 183666
IdHAL : marc-at
ORCID : 0000-0002-9057-642X
IdRef : 269821023

Éco-Anthropologie

Résumé

Less-resourced languages are usually left out of phonetic studies based on large corpora. We contribute to the recent efforts to fill this gap by assessing how to use open-access, crowd-sourced audio data from Lingua Libre for phonetic research. Lingua Libre is a participative linguistic library developed by Wikimedia France in 2015. It contains more than 670k recordings in approximately 150 languages across nearly 740 speakers. As a proof of concept, we consider the Inventory Size Hypothesis, which predicts that, in a given system, variation in the realization of each vowel will be inversely related to the number of vowel categories. We investigate data from 10 languages with various numbers of vowel categories, i.e., German, Afrikaans, French, Catalan, Italian, Romanian, Polish, Russian, Spanish, and Basque. Audio files are extracted from Lingua Libre to be aligned and segmented using the Munich Automatic Segmentation System. Information on the formants of the vowel segments is then extracted to measure how vowels expand in the acoustic space and whether this is correlated with the number of vowel categories in the language. The results provide valuable insight into the question of vowel dispersion and demonstrate the wealth of information that crowd-sourced data has to offer.

Domaines

Linguistique

Fichier principal

languages-07-00234.pdf (3.29 Mo)

Origine	Publication financée par une institution

Mathilde Hutin : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03778651

Soumis le : jeudi 8 décembre 2022-14:35:58

Dernière modification le : mercredi 24 avril 2024-10:26:31

Dates et versions

hal-03778651 , version 1 (08-12-2022)

Identifiants

HAL Id : hal-03778651 , version 1
DOI : 10.3390/languages7030234

Citer

Mathilde Hutin, Marc Allassonnière-Tang. Operation LiLi: Using Crowd-Sourced Data and Automatic Alignment to Investigate the Phonetics and Phonology of Less-Resourced Languages. Languages, 2022, 7 (3), pp.234. ⟨10.3390/languages7030234⟩. ⟨hal-03778651⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

MNHN CNRS INRIA CENTRALESUPELEC UNIV-PARIS-SACLAY ANR LISN GS-COMPUTER-SCIENCE LISN-TLP INEE-CNRS ECO-ANTHROPOLOGIE

197 Consultations

37 Téléchargements

Operation LiLi: Using Crowd-Sourced Data and Automatic Alignment to Investigate the Phonetics and Phonology of Less-Resourced Languages

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager