Big Data Preprocessing as the Bridge between Big Data and Smart Data: BigDaPSpark and BigDaPFlink Libraries
Metadatos
Mostrar el registro completo del ítemAutor
García Gil, Diego Jesús; Alcalde Barros, Alejandro; Luengo Martín, Julián; García López, Salvador; Herrera Triguero, FranciscoEditorial
ScitePress
Materia
Big Data Apache spark Data Preprocessing Smart Data Imbalanced Classification
Fecha
2019Referencia bibliográfica
García-Gil, D., Alcalde-Barros, A., Luengo, J., García, S., & Herrera, F. (2019). Big Data Preprocessing as the Bridge between Big Data and Smart Data: BigDaPSpark and BigDaPFlink Libraries. In IoTBDS (pp. 324-331). [DOI: 10.5220/0007738503240331]
Patrocinador
Spanish National Research Project TIN2017-89517-PResumen
With the advent of Big Data, terabytes of data are generated and stored every second. This raw data is far from
being perfect, it contains many imperfections (noise, missing values, etc.) and is not suitable for analysis,
as it will led to wrong conclusions. Data preprocessing is the set of techniques devoted to polish, clean,
fix, and improve that raw data. With this preprocessed data, we would be able to find more patterns in it,
and to better explain the underlaying distribution of the data. This is what is called Smart Data, raw data
that has been preprocessed and is ready for being analyzed, data that contains valuable information that will
led to knowledge. In this work, we present two Big Data libraries for achieving Smart Data from Big Data,
BigDaPSpark and BigDaPFlink. They are built on top of two Big Data frameworks, Apache Spark and Apache
Flink. Both libraries contain a series of algorithms for Big Data preprocessing, ranging from noise cleaning,
to discretization, or data reduction, among many others. Additionally, we ilustrate the usage of the libraries
with two cases of use.