Data Preparation as a Service Based on Apache Spark

Nivethika Mahasivam; Nikolay Nikolov; Dina Sukhobok; Dumitru Roman

doi:10.1007/978-3-319-67262-5_10

Conference Papers Year : 2017

Data Preparation as a Service Based on Apache Spark

(1) , (1) , (1) , (1)

Nivethika Mahasivam

Function : Author
PersonId : 1026125

SINTEF Digital - Stiftelsen for INdustriell og TEknisk Forskning Digital [Trondheim]

Nikolay Nikolov

Function : Author
PersonId : 1026126

SINTEF Digital - Stiftelsen for INdustriell og TEknisk Forskning Digital [Trondheim]

Dina Sukhobok

Function : Author
PersonId : 1026127

SINTEF Digital - Stiftelsen for INdustriell og TEknisk Forskning Digital [Trondheim]

Dumitru Roman

Function : Author
PersonId : 1000503

SINTEF Digital - Stiftelsen for INdustriell og TEknisk Forskning Digital [Trondheim]

Abstract

Data preparation is the process of collecting, cleaning and consolidating raw datasets into cleaned data of certain quality. It is an important aspect in almost every data analysis process, and yet it remains tedious and time-consuming. The complexity of the process is further increased by the recent tendency to derive knowledge from very large datasets. Existing data preparation tools provide limited capabilities to effectively process such large volumes of data. On the other hand, frameworks and software libraries that do address the requirements of big data, require expert knowledge in various technical areas. In this paper, we propose a dynamic, service-based, scalable data preparation approach that aims to solve the challenges in data preparation on a large scale, while retaining the accessibility and flexibility provided by data preparation tools. Furthermore, we describe its implementation and integration with an existing framework for data preparation – Grafterizer. Our solution is based on Apache Spark, and exposes application programming interfaces (APIs) to integrate with external tools. Finally, we present experimental results that demonstrate the improvements to the scalability of Grafterizer.

Keywords

Domains

Computer Science [cs]

Fichier principal

449571_1_En_10_Chapter.pdf (945.7 Ko)

Origin	Files produced by the author(s)
licence	CC BY 4.0 - Attribution

Connect in order to contact the contributor

https://inria.hal.science/hal-01677626

Submitted on : Monday, January 8, 2018-3:01:36 PM

Last modification on : Thursday, December 3, 2020-9:26:02 AM

Long-term archiving on : Thursday, May 3, 2018-4:33:42 PM

Dates and versions

hal-01677626 , version 1 (08-01-2018)

Licence

CC BY 4.0 - Attribution

Identifiers

HAL Id : hal-01677626 , version 1
DOI : 10.1007/978-3-319-67262-5_10

Cite

Nivethika Mahasivam, Nikolay Nikolov, Dina Sukhobok, Dumitru Roman. Data Preparation as a Service Based on Apache Spark. 6th European Conference on Service-Oriented and Cloud Computing (ESOCC), Sep 2017, Oslo, Norway. pp.125-139, ⟨10.1007/978-3-319-67262-5_10⟩. ⟨hal-01677626⟩

Data Preparation as a Service Based on Apache Spark

Abstract

Keywords

Domains

Dates and versions

Licence

Identifiers

Cite

Export

Collections

Altmetric

Share