FDR2 -BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems

Basgall, María José; Naiouf, Marcelo; Fernández Hilario, Alberto Luis

doi:10.3390/electronics10151757

dc.contributor.author	Basgall, María José
dc.contributor.author	Naiouf, Marcelo
dc.contributor.author	Fernández Hilario, Alberto Luis
dc.date.accessioned	2021-09-23T08:13:55Z
dc.date.available	2021-09-23T08:13:55Z
dc.date.issued	2021
dc.identifier.citation	Basgall, M.J.; Naiouf, M.; Fernández, A. FDR2 -BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems. Electronics 2021, 10, 1757. https://doi.org/10.3390/ electronics10151757	es_ES
dc.identifier.uri	http://hdl.handle.net/10481/70391
dc.description.abstract	In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2 -BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.	es_ES
dc.language.iso	eng	es_ES
dc.publisher	MDPI	es_ES
dc.rights	Atribución 3.0 España	*
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/es/	*
dc.subject	Big Data	es_ES
dc.subject	Data reduction	es_ES
dc.subject	Classification	es_ES
dc.subject	Preprocessing techniques	es_ES
dc.subject	Apache spark	es_ES
dc.title	FDR2 -BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems	es_ES
dc.type	journal article	es_ES
dc.rights.accessRights	open access	es_ES
dc.identifier.doi	10.3390/electronics10151757

Fichier(s) constituant ce document

Nom:: electronics-10-01757-v2.pdf
Taille:: 1.213Mo
Format:: PDF

Ce document figure dans la(les) collection(s) suivante(s)

DCCIA - Artículos

Afficher la notice abrégée

Excepté là où spécifié autrement, la license de ce document est décrite en tant que Atribución 3.0 España