Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data

Maillo Hidalgo, Jesús; Triguero, Isaac; Herrera Triguero, Francisco

doi:10.1109/ACCESS.2020.2991800

dc.contributor.author	Maillo Hidalgo, Jesús
dc.contributor.author	Triguero, Isaac
dc.contributor.author	Herrera Triguero, Francisco
dc.date.accessioned	2020-06-29T11:56:43Z
dc.date.available	2020-06-29T11:56:43Z
dc.date.issued	2020-05
dc.identifier.citation	Maillo, J., Triguero, I., & Herrera, F. (2020). Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data. IEEE Access, 8, 87918-87928. [DOI: 10.1109/ACCESS.2020.2991800]	es_ES
dc.identifier.uri	http://hdl.handle.net/10481/62787
dc.description	P. J. Maillo hold a FPU scholarship from the Spanish Ministry of Education.	es_ES
dc.description.abstract	It is recognized the importance of knowing the descriptive properties of a dataset when tackling a data science problem. Having information about the redundancy, complexity and density of a problem allows us to make decisions as to which data preprocessing and machine learning techniques are most suitable. In classification problems, there are multiple metrics to describe the overlapping of the features between classes, class imbalances or separability, among others. However, these metrics may not scale up well when dealing with big datasets, or may not simply be sufficiently informative in this context. In this paper, we provide a package of metrics for big data classification problems. In particular, we propose two new big data metrics: Neighborhood Density and Decision Tree Progression, which study density and accuracy progression by discarding half of the samples. In addition, we enable a number of basic metrics to handle big data. The experimental study carried out in standard big data classification problems shows that our metrics can quickly characterize big datasets. We identified a clear redundancy of information in most datasets, so that, discarding randomly 75% of the samples does not drastically affect the accuracy of the classifiers used. Thus, the proposed big data metrics, which are available as a Spark-Package, provide a fast assessment of the shape of a classification dataset prior to applying big data preprocessing, toward smart data.	es_ES
dc.description.sponsorship	This work was supported by the Spanish National Research Project under Grant TIN2017-89517.	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)	es_ES
dc.rights	Atribución 3.0 España	*
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/es/	*
dc.subject	Big data	es_ES
dc.subject	Smart Data	es_ES
dc.subject	Classification	es_ES
dc.subject	Redundancy	es_ES
dc.subject	Complexity	es_ES
dc.subject	Apache spark	es_ES
dc.title	Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data	es_ES
dc.type	journal article	es_ES
dc.rights.accessRights	open access	es_ES
dc.identifier.doi	10.1109/ACCESS.2020.2991800

Fichier(s) constituant ce document

Nom:: 09083972.pdf
Taille:: 1.215Mo
Format:: PDF

Ce document figure dans la(les) collection(s) suivante(s)

DCCIA - Artículos

Afficher la notice abrégée

Excepté là où spécifié autrement, la license de ce document est décrite en tant que Atribución 3.0 España