ROSEFW-RF: The winner algorithm for the ECBDL’14 Big Data Competition: An extremely imbalanced big data bioinformatics problem

Triguero, Isaac; Del Río, Sara; López, Victoria; Bacardit, Jaume; Benítez Sánchez, José Manuel; Herrera Triguero, Francisco

doi:10.1016/j.knosys.2015.05.027

dc.contributor.author	Triguero, Isaac
dc.contributor.author	Del Río, Sara
dc.contributor.author	López, Victoria
dc.contributor.author	Bacardit, Jaume
dc.contributor.author	Benítez Sánchez, José Manuel
dc.contributor.author	Herrera Triguero, Francisco
dc.date.accessioned	2021-01-21T09:04:14Z
dc.date.available	2021-01-21T09:04:14Z
dc.date.issued	2017-09-04
dc.identifier.citation	Publisher versiion: Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J. M., & Herrera, F. (2015). ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87, 69-79. [https://doi.org/10.1016/j.knosys.2015.05.027]	es_ES
dc.identifier.uri	http://hdl.handle.net/10481/65880
dc.description	Supported by the Research Projects TIN2014-57251-P, P10-TIC-6858, P12-TIC-2958, TIN2013-47210-P and P11-TIC-7765. I. Triguero holds a BOF postdoctoral fellowship from the Ghent University.	es_ES
dc.description.abstract	The application of data mining and machine learning techniques to biological and biomedicine data continues to be an ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and store large quantities of data about cells, proteins, genes, etc., that should be processed. Moreover, in many of these problems such as contact map prediction, the problem tackled in this paper, it is difficult to collect representative positive examples. Learning under these circumstances, known as imbalanced big data classification, may not be straightforward for most of the standard machine learning methods. In this work we describe the methodology that won the ECBDL’14 big data challenge for a bioinformatics big data problem. This algorithm, named as ROSEFW-RF, is based on several MapReduce approaches to (1) balance the classes distribution through random oversampling, (2) detect the most relevant features via an evolutionary feature weighting process and a threshold to choose them, (3) build an appropriate Random Forest model from the pre-processed data and finally (4) classify the test data. Across the paper, we detail and analyze the decisions made during the competition showing an extensive experimental study that characterize the way of working of our methodology. From this analysis we can conclude that this approach is very suitable to tackle large-scale bioinformatics classifications problems.	es_ES
dc.description.sponsorship	Ghent University	es_ES
dc.description.sponsorship	TIN2014-57251-P	es_ES
dc.description.sponsorship	P10-TIC-6858	es_ES
dc.description.sponsorship	P12-TIC-2958	es_ES
dc.description.sponsorship	TIN2013-47210-P	es_ES
dc.description.sponsorship	P11-TIC-7765	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Elsevier	es_ES
dc.rights	Atribución-NoComercial-SinDerivadas 3.0 España	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	*
dc.subject	Bioinformatics	es_ES
dc.subject	Big Data	es_ES
dc.subject	Hadoop	es_ES
dc.subject	MapReduce	es_ES
dc.subject	Imbalance classification	es_ES
dc.subject	Evolutionary feature selection	es_ES
dc.title	ROSEFW-RF: The winner algorithm for the ECBDL’14 Big Data Competition: An extremely imbalanced big data bioinformatics problem	es_ES
dc.type	journal article	es_ES
dc.rights.accessRights	open access	es_ES
dc.identifier.doi	10.1016/j.knosys.2015.05.027
dc.type.hasVersion	SMUR	es_ES

Fichier(s) constituant ce document

Nom:: Triguero-et-al-cleanVersion.pdf
Taille:: 2.589Mo
Format:: PDF

Ce document figure dans la(les) collection(s) suivante(s)

DCCIA - Artículos

Afficher la notice abrégée

Excepté là où spécifié autrement, la license de ce document est décrite en tant que Atribución-NoComercial-SinDerivadas 3.0 España