Evolutionary Undersampling for Extremely Imbalanced Big Data Classification under Apache Spark

Triguero, Isaac; Galar, Mikel; Merino, D.; Maillo Hidalgo, Jesús; Bustince, Humberto; Herrera Triguero, Francisco

doi:10.1109/CEC.2016.7743853

dc.contributor.author	Triguero, Isaac
dc.contributor.author	Galar, Mikel
dc.contributor.author	Merino, D.
dc.contributor.author	Maillo Hidalgo, Jesús
dc.contributor.author	Bustince, Humberto
dc.contributor.author	Herrera Triguero, Francisco
dc.date.accessioned	2020-12-23T12:50:24Z
dc.date.available	2020-12-23T12:50:24Z
dc.date.issued	2016
dc.identifier.citation	Published version: I. Triguero, M. Galar, D. Merino, J. Maillo, H. Bustince and F. Herrera, "Evolutionary undersampling for extremely imbalanced big data classification under apache spark," 2016 IEEE Congress on Evolutionary Computation (CEC), Vancouver, BC, 2016, pp. 640-647, [doi: 10.1109/CEC.2016.7743853.]	es_ES
dc.identifier.uri	http://hdl.handle.net/10481/65146
dc.description	This work was supported by the Research Projects TIN2011-28488, TIN2013-40765-P, P10-TIC-6858 and P11-TIC-7765. I. Triguero holds a BOF postdoctoral fellowship from the Ghent University.	es_ES
dc.description.abstract	The classification of datasets with a skewed class distribution is an important problem in data mining. Evolutionary undersampling of the majority class has proved to be a successful approach to tackle this issue. Such a challenging task may become even more difficult when the number of the majority class examples is very big. In this scenario, the use of the evolutionary model becomes unpractical due to the memory and time constrictions. Divide-and-conquer approaches based on the MapReduce paradigm have already been proposed to handle this type of problems by dividing data into multiple subsets. However, in extremely imbalanced cases, these models may suffer from a lack of density from the minority class in the subsets considered. Aiming at addressing this problem, in this contribution we provide a new big data scheme based on the new emerging technology Apache Spark to tackle highly imbalanced datasets. We take advantage of its in-memory operations to diminish the effect of the small sample size. The key point of this proposal lies in the independent management of majority and minority class examples, allowing us to keep a higher number of minority class examples in each subset. In our experiments, we analyze the proposed model with several data sets with up to 17 million instances. The results show the goodness of this evolutionary undersampling model for extremely imbalanced big data classification.	es_ES
dc.description.sponsorship	TIN2011-28488	es_ES
dc.description.sponsorship	TIN2013-40765-P	es_ES
dc.description.sponsorship	P10-TIC-6858	es_ES
dc.description.sponsorship	P11-TIC-7765	es_ES
dc.language.iso	eng	es_ES
dc.publisher	IEEE	es_ES
dc.rights	Atribución-NoComercial-SinDerivadas 3.0 España	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	*
dc.title	Evolutionary Undersampling for Extremely Imbalanced Big Data Classification under Apache Spark	es_ES
dc.type	conference output	es_ES
dc.rights.accessRights	open access	es_ES
dc.identifier.doi	10.1109/CEC.2016.7743853

Fichier(s) constituant ce document

Nom:: EUSspark.pdf
Taille:: 1.348Mo
Format:: PDF

Ce document figure dans la(les) collection(s) suivante(s)

DCCIA - Comunicaciones Congresos, Conferencias, ...

Afficher la notice abrégée

Excepté là où spécifié autrement, la license de ce document est décrite en tant que Atribución-NoComercial-SinDerivadas 3.0 España