Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

Angelina Mcmillan-Major; Zaid Alyafeai; Stella Biderman; Kimbo Chen; Francesco de Toni; Gérard Dupont; Hady Elsahar; Chris Emezue; Alham Fikri Aji; Suzana Ilić; Nurulaqilla Khamis; Colin Leong; Maraim Masoud; Aitor Soroa; Pedro Ortiz Suarez; Zeerak Talat; Daniel van Strien; Yacine Jernite

Pré-Publication, Document De Travail Année : 2022

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

(1, 2) , (3) , (4, 5) , (6) , (7) , (6) , (6) , (6) , (6) , (1) , (6) , (6) , (6) , (8) , (9, 10) , (6) , (11) , (1)

1
2
3
4
5
6
7
8
9
10
11

Angelina Mcmillan-Major

Fonction : Auteur

Hugging Face

University of Washington [Seattle]

Zaid Alyafeai

Fonction : Auteur

King Fahd University of Petroleum and Minerals

Stella Biderman

Fonction : Auteur

Booz Hallen Hamilton Inc

EleutherAI

Kimbo Chen

Fonction : Auteur

Chercheur indépendant

Francesco de Toni

Fonction : Auteur

The University of Western Australia

Gérard Dupont

Fonction : Auteur

Chercheur indépendant

Hady Elsahar

Fonction : Auteur

Chercheur indépendant

Chris Emezue

Fonction : Auteur

Chercheur indépendant

Alham Fikri Aji

Fonction : Auteur

Chercheur indépendant

Suzana Ilić

Fonction : Auteur

Hugging Face

Nurulaqilla Khamis

Fonction : Auteur

Chercheur indépendant

Colin Leong

Fonction : Auteur

Chercheur indépendant

Maraim Masoud

Fonction : Auteur

Chercheur indépendant

Aitor Soroa

Fonction : Auteur

Universidad del País Vasco [Espainia] / Euskal Herriko Unibertsitatea [España] = University of the Basque Country [Spain] = Université du pays basque [Espagne]

Pedro Ortiz Suarez

Fonction : Auteur
PersonId : 178412
IdHAL : pedro-ortiz-suarez
ORCID : 0000-0003-0343-8852
IdRef : 264210743

Automatic Language Modelling and ANAlysis & Computational Humanities

Sorbonne Université

Zeerak Talat

Fonction : Auteur

Chercheur indépendant

Daniel van Strien

Fonction : Auteur

British Library

Yacine Jernite

Fonction : Auteur

Hugging Face

Résumé

In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor.

Mots clés

Collaborative Resource Construction & Crowdsourcing LR Infrastructures and Architectures Tools Systems Applications

Domaines

Informatique et langage [cs.CL]

Pedro Ortiz Suarez : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03550289

Soumis le : mardi 1 février 2022-00:16:45

Dernière modification le : jeudi 31 octobre 2024-11:56:04

Dates et versions

hal-03550289 , version 1 (01-02-2022)

Licence

Paternité

Identifiants

HAL Id : hal-03550289 , version 1
ARXIV : 2201.10066

Citer

Angelina Mcmillan-Major, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Francesco de Toni, et al.. Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources. 2022. ⟨hal-03550289⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 INRIA IRISA INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES SORBONNE-UNIVERSITE ANR UR1-MATH-NUM

177 Consultations

0 Téléchargements

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager