KSUFS: A novel unsupervised feature selection method based on statistical tests for standard and big data problems
Metadatos
Mostrar el registro completo del ítemEditorial
IEEE
Materia
big data clustering feature selection statistical tests unsupervised learning
Fecha
2019Referencia bibliográfica
José A. Sáez; Emilio Corchado. KSUFS: A novel unsupervised feature selection method based on statistical tests for standard and big data problems. IEEE Access, 7, 99754-99770. 2019. doi: 10.1109/ACCESS.2019.2930355
Resumen
The typical inaccuracy of data gathering and preparation procedures makes erroneous and unnecessary information to be a common issue in real-world applications. In this context, feature selection methods are used in order to reduce the harmful impact of such information in data analysis by removing irrelevant features from datasets. This research presents a novel feature selection method in the field of unsupervised learning, in which the complexity arises from the fact that the class labels cannot be used to select the most discriminative features as it is traditionally performed in supervised learning. The technique designed, which is called Kolmogorov-Smirnov test-based Unsupervised Feature Selection (KSUFS), is based on the computation of estimated feature distributions that are later compared to the original ones using non-parametric statistical tests to provide the most representative input variables. Two versions of the KSUFS are presented in this study: one of them is particularly designed to deal with standard data, in which the accuracy of the method prevalences over other of its aspects; the other version is designed to treat with big data problems, in which the computational complexity is improved due to the characteristics of this type of datasets. The KSUFS is successfully compared to other state-of-the-art unsupervised feature selection techniques in a thorough experimental study, which considers both standard and big data problems. The results obtained show that the method proposed is able to outperform the rest of reference unsupervised feature selection methods considered in the comparisons, selecting the first most influential features for standard datasets and particularly highlighting when big data problems are treated.