Mostrar el registro sencillo del ítem

dc.contributor.authorMaillo Hidalgo, Jesús
dc.contributor.authorTriguero, Isaac
dc.contributor.authorHerrera Triguero, Francisco 
dc.date.accessioned2020-06-29T11:56:43Z
dc.date.available2020-06-29T11:56:43Z
dc.date.issued2020-05
dc.identifier.citationMaillo, J., Triguero, I., & Herrera, F. (2020). Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data. IEEE Access, 8, 87918-87928. [DOI: 10.1109/ACCESS.2020.2991800]es_ES
dc.identifier.urihttp://hdl.handle.net/10481/62787
dc.descriptionP. J. Maillo hold a FPU scholarship from the Spanish Ministry of Education.es_ES
dc.description.abstractIt is recognized the importance of knowing the descriptive properties of a dataset when tackling a data science problem. Having information about the redundancy, complexity and density of a problem allows us to make decisions as to which data preprocessing and machine learning techniques are most suitable. In classification problems, there are multiple metrics to describe the overlapping of the features between classes, class imbalances or separability, among others. However, these metrics may not scale up well when dealing with big datasets, or may not simply be sufficiently informative in this context. In this paper, we provide a package of metrics for big data classification problems. In particular, we propose two new big data metrics: Neighborhood Density and Decision Tree Progression, which study density and accuracy progression by discarding half of the samples. In addition, we enable a number of basic metrics to handle big data. The experimental study carried out in standard big data classification problems shows that our metrics can quickly characterize big datasets. We identified a clear redundancy of information in most datasets, so that, discarding randomly 75% of the samples does not drastically affect the accuracy of the classifiers used. Thus, the proposed big data metrics, which are available as a Spark-Package, provide a fast assessment of the shape of a classification dataset prior to applying big data preprocessing, toward smart data.es_ES
dc.description.sponsorshipThis work was supported by the Spanish National Research Project under Grant TIN2017-89517.es_ES
dc.language.isoenges_ES
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)es_ES
dc.rightsAtribución 3.0 España*
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/es/*
dc.subjectBig dataes_ES
dc.subjectSmart Dataes_ES
dc.subjectClassification es_ES
dc.subjectRedundancyes_ES
dc.subjectComplexityes_ES
dc.subjectApache sparkes_ES
dc.titleRedundancy and Complexity Metrics for Big Data Classification: Towards Smart Dataes_ES
dc.typeinfo:eu-repo/semantics/articlees_ES
dc.rights.accessRightsinfo:eu-repo/semantics/openAccesses_ES
dc.identifier.doi10.1109/ACCESS.2020.2991800


Ficheros en el ítem

[PDF]

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem

Atribución 3.0 España
Excepto si se señala otra cosa, la licencia del ítem se describe como Atribución 3.0 España