Big data preprocessing: enabling smart data Luengo Martín, Julián García Gil, Diego Jesús Ramírez-Gallego, Sergio García López, Salvador Herrera Triguero, Francisco Big Data Machine Learning Information Systems and Communication Service The massive growth in the scale of data has been observed in recent years, being a key factor of the Big Data scenario. Big Data can be defined as high volume, velocity, and variety of data that require a new high-performance processing. Addressing Big Data is a challenging and time-demanding task that requires a large computational infrastructure to ensure successful data processing and analysis. Being a very common scenario in real-life applications, the interest of researchers and practitioners on the topic has grown significantly during these years. Among Big Data disciplines, data mining is a key topic, enabling the user to extract knowledge from enormous amounts of raw data. However, this raw data is not always in the best condition to be treated, analyzed, and surveyed. The application of preprocessing techniques is a must in real-world applications, to ensure quality data, Smart Data, for a proper treatment and analysis. The term Smart Data refers to the challenge of transforming raw data into quality data that can be appropriately exploited to obtain valuable insights. This book aims at offering a general and comprehensible overview of data preprocessing in Big Data, enabling Smart Data. It contains a comprehensive description of the topic and focuses on its main features and the most relevant proposed solutions. Additionally, it considers the different scenarios in Big Data for which the application of data preprocessing techniques can suppose a real challenge. Data preprocessing is a multifaceted discipline that includes data preparation, compounded by integration, cleaning, normalization, and transformation of data; data reduction tasks such as feature selection, instance selection, and discretization; and resampling techniques to deal with imbalanced data. This book stresses the gap with standard data preprocessing techniques and their Big Data equivalents, showing the challenging difficulties in their development for the latter. It also covers the different approaches that have been traditionally applied and the latest proposals in Big Data preprocessing. Specifically, it reviews data reduction methods, imperfect data approaches, discretization techniques, and imbalanced data preprocessing solutions. Finally, this book describes the most popular Big Data libraries for machine learning, focusing on their data preprocessing algorithms and utilities. 2025-01-16T11:22:54Z 2025-01-16T11:22:54Z 2020-03-16 book Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., & Herrera, F. (2020). Big data preprocessing. Cham: Springer. https://hdl.handle.net/10481/99405 https://doi.org/10.1007/978-3-030-39105-8 eng http://creativecommons.org/licenses/by-nc-nd/4.0/ open access Attribution-NonCommercial-NoDerivatives 4.0 Internacional Springer Cham