Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS - Groupe d'Études et de Recherche Interdisciplinaire en Information et COmmunication Access content directly
Conference Papers Year : 2022

Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS

Abstract

Consolidation of the research information improves the quality of data integration, reducing duplicates between systems and enabling the required flexibility and scalability when processing various data sources. We assume that the combination of a data lake as a data repository and a data wrangling process should allow low-quality or "bad" data to be identified and eliminated, leaving only high-quality data, referred to as "research information" in the Research Information System (RIS) domain, allowing for the most accurate insights gained on their basis. This, however, would lead to increased value of both the data themselves and data-driven actions contributing to more accurate and aware decision-making. This cleansed research information is then entered into the appropriate target Current Research Information System (CRIS) so that it can be used for further data processing steps. In order to minimize the effort for the analysis, the proliferation and enrichment of large amounts of data and metadata, as well as to achieve far-reaching added value in information retrieval for CRIS employees, developers and end users, this paper outlines the concept of a curated data lake with the data wrangling process, showing how it can be used in CRIS to clean up data from heterogeneous data sources during their collection and integration.
Fichier principal
Vignette du fichier
CRIS2022_Azeroual et al.- final HAL.pdf (977.84 Ko) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-03694519 , version 1 (13-06-2022)
hal-03694519 , version 2 (17-06-2022)

Identifiers

  • HAL Id : hal-03694519 , version 1

Cite

Otmane Azeroual, Joachim Schöpfel, Dragan Ivanovic, Anastasija Nikiforova. Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS. CRIS2022: 15th International Conference on Current Research Information Systems, May 2022, Dubrovnik, Croatia. ⟨hal-03694519v1⟩
135 View
216 Download

Share

Gmail Facebook X LinkedIn More