Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data
Metadata
Show full item recordEditorial
Institute of Electrical and Electronics Engineers (IEEE)
Materia
Big data Smart Data Classification Redundancy Complexity Apache spark
Date
2020-05Referencia bibliográfica
Maillo, J., Triguero, I., & Herrera, F. (2020). Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data. IEEE Access, 8, 87918-87928. [DOI: 10.1109/ACCESS.2020.2991800]
Sponsorship
This work was supported by the Spanish National Research Project under Grant TIN2017-89517.Abstract
It is recognized the importance of knowing the descriptive properties of a dataset when tackling
a data science problem. Having information about the redundancy, complexity and density of a problem
allows us to make decisions as to which data preprocessing and machine learning techniques are most
suitable. In classification problems, there are multiple metrics to describe the overlapping of the features
between classes, class imbalances or separability, among others. However, these metrics may not scale up
well when dealing with big datasets, or may not simply be sufficiently informative in this context. In this
paper, we provide a package of metrics for big data classification problems. In particular, we propose two new
big data metrics: Neighborhood Density and Decision Tree Progression, which study density and accuracy
progression by discarding half of the samples. In addition, we enable a number of basic metrics to handle big
data. The experimental study carried out in standard big data classification problems shows that our metrics
can quickly characterize big datasets. We identified a clear redundancy of information in most datasets,
so that, discarding randomly 75% of the samples does not drastically affect the accuracy of the classifiers
used. Thus, the proposed big data metrics, which are available as a Spark-Package, provide a fast assessment
of the shape of a classification dataset prior to applying big data preprocessing, toward smart data.