KSUFS: A novel unsupervised feature selection method based on statistical tests for standard and big data problems

Sáez Muñoz, José Antonio; Corchado, Emilio

doi:10.1109/ACCESS.2019.2930355

2019-IEEEAccess-Saez03.pdf (3.760Mb)

Identificadores

URI: https://hdl.handle.net/10481/99062

DOI: 10.1109/ACCESS.2019.2930355

Exportar

Editorial

IEEE

Materia

big data

clustering

feature selection

statistical tests

unsupervised learning

Fecha

2019

Referencia bibliográfica

José A. Sáez; Emilio Corchado. KSUFS: A novel unsupervised feature selection method based on statistical tests for standard and big data problems. IEEE Access, 7, 99754-99770. 2019. doi: 10.1109/ACCESS.2019.2930355

Resumen

The typical inaccuracy of data gathering and preparation procedures makes erroneous and unnecessary information to be a common issue in real-world applications. In this context, feature selection methods are used in order to reduce the harmful impact of such information in data analysis by removing irrelevant features from datasets. This research presents a novel feature selection method in the field of unsupervised learning, in which the complexity arises from the fact that the class labels cannot be used to select the most discriminative features as it is traditionally performed in supervised learning. The technique designed, which is called Kolmogorov-Smirnov test-based Unsupervised Feature Selection (KSUFS), is based on the computation of estimated feature distributions that are later compared to the original ones using non-parametric statistical tests to provide the most representative input variables. Two versions of the KSUFS are presented in this study: one of them is particularly designed to deal with standard data, in which the accuracy of the method prevalences over other of its aspects; the other version is designed to treat with big data problems, in which the computational complexity is improved due to the characteristics of this type of datasets. The KSUFS is successfully compared to other state-of-the-art unsupervised feature selection techniques in a thorough experimental study, which considers both standard and big data problems. The results obtained show that the method proposed is able to outperform the rest of reference unsupervised feature selection methods considered in the comparisons, selecting the first most influential features for standard datasets and particularly highlighting when big data problems are treated.

Colecciones

DEIO - Artículos

Excepto si se señala otra cosa, la licencia del ítem se describe como Atribución 4.0 Internacional