The impact of heterogeneous distance functions on missing data imputation and classification performance

Seoane Santos, Miriam; Henriques Abreu, Pedro; Fernández Hilario, Alberto Luis; Luengo Martín, Julián

doi:10.1016/j.engappai.2022.104791

dc.contributor.author	Seoane Santos, Miriam
dc.contributor.author	Henriques Abreu, Pedro
dc.contributor.author	Fernández Hilario, Alberto Luis
dc.contributor.author	Luengo Martín, Julián
dc.date.accessioned	2025-01-29T13:12:35Z
dc.date.available	2025-01-29T13:12:35Z
dc.date.issued	2022-03-24
dc.identifier.citation	Engineering Applications of Artificial Intelligence Volume 111, 104791	es_ES
dc.identifier.uri	https://hdl.handle.net/10481/100996
dc.description.abstract	This work performs an in-depth study of the impact of distance functions on K-Nearest Neighbours imputation of heterogeneous datasets. Missing data is generated at several percentages, on a large benchmark of 150 datasets (50 continuous, 50 categorical and 50 heterogeneous datasets) and data imputation is performed using different distance functions (HEOM, HEOM-R, HVDM, HVDM-R, HVDM-S, MDE and SIMDIST) and k values (1, 3, 5 and 7). The impact of distance functions on kNN imputation is then evaluated in terms of classification performance, through the analysis of a classifier learned from the imputed data, and in terms of imputation quality, where the quality of the reconstruction of the original values is assessed. By analysing the properties of heterogeneous distance functions over continuous and categorical datasets individually, we then study their behaviour over heterogeneous data. We discuss whether datasets with different natures may benefit from different distance functions and to what extent the component of a distance function that deals with missing values influences such choice. Our experiments show that missing data has a significant impact on distance computation and the obtained results provide guidelines on how to choose appropriate distance functions depending on data characteristics (continuous, categorical or heterogeneous datasets) and the objective of the study (classification or imputation tasks).	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Elsevier	es_ES
dc.title	The impact of heterogeneous distance functions on missing data imputation and classification performance	es_ES
dc.type	journal article	es_ES
dc.rights.accessRights	open access	es_ES
dc.identifier.doi	10.1016/j.engappai.2022.104791
dc.type.hasVersion	AM	es_ES

Ficheros en el ítem

Nombre:: 1-s2.0-S0952197622000707-main.pdf
Tamaño:: 860.2Kb
Formato:: PDF

Este ítem aparece en la(s) siguiente(s) colección(ones)

DCCIA - Artículos

Mostrar el registro sencillo del ítem