Show simple item record

dc.contributor.authorTriguero, Isaac
dc.contributor.authorDel Río, Sara
dc.contributor.authorLópez, Victoria
dc.contributor.authorBacardit, Jaume
dc.contributor.authorBenítez Sánchez, José Manuel 
dc.contributor.authorHerrera Triguero, Francisco 
dc.date.accessioned2021-01-21T09:04:14Z
dc.date.available2021-01-21T09:04:14Z
dc.date.issued2017-09-04
dc.identifier.citationPublisher versiion: Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J. M., & Herrera, F. (2015). ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87, 69-79. [https://doi.org/10.1016/j.knosys.2015.05.027]es_ES
dc.identifier.urihttp://hdl.handle.net/10481/65880
dc.descriptionSupported by the Research Projects TIN2014-57251-P, P10-TIC-6858, P12-TIC-2958, TIN2013-47210-P and P11-TIC-7765. I. Triguero holds a BOF postdoctoral fellowship from the Ghent University.es_ES
dc.description.abstractThe application of data mining and machine learning techniques to biological and biomedicine data continues to be an ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and store large quantities of data about cells, proteins, genes, etc., that should be processed. Moreover, in many of these problems such as contact map prediction, the problem tackled in this paper, it is difficult to collect representative positive examples. Learning under these circumstances, known as imbalanced big data classification, may not be straightforward for most of the standard machine learning methods. In this work we describe the methodology that won the ECBDL’14 big data challenge for a bioinformatics big data problem. This algorithm, named as ROSEFW-RF, is based on several MapReduce approaches to (1) balance the classes distribution through random oversampling, (2) detect the most relevant features via an evolutionary feature weighting process and a threshold to choose them, (3) build an appropriate Random Forest model from the pre-processed data and finally (4) classify the test data. Across the paper, we detail and analyze the decisions made during the competition showing an extensive experimental study that characterize the way of working of our methodology. From this analysis we can conclude that this approach is very suitable to tackle large-scale bioinformatics classifications problems.es_ES
dc.description.sponsorshipGhent Universityes_ES
dc.description.sponsorshipTIN2014-57251-Pes_ES
dc.description.sponsorshipP10-TIC-6858es_ES
dc.description.sponsorshipP12-TIC-2958es_ES
dc.description.sponsorshipTIN2013-47210-Pes_ES
dc.description.sponsorshipP11-TIC-7765es_ES
dc.language.isoenges_ES
dc.publisherElsevieres_ES
dc.rightsAtribución-NoComercial-SinDerivadas 3.0 España*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/es/*
dc.subjectBioinformaticses_ES
dc.subjectBig Dataes_ES
dc.subjectHadoopes_ES
dc.subjectMapReducees_ES
dc.subjectImbalance classificationes_ES
dc.subjectEvolutionary feature selectiones_ES
dc.titleROSEFW-RF: The winner algorithm for the ECBDL’14 Big Data Competition: An extremely imbalanced big data bioinformatics problemes_ES
dc.typeinfo:eu-repo/semantics/articlees_ES
dc.rights.accessRightsinfo:eu-repo/semantics/openAccesses_ES
dc.identifier.doi10.1016/j.knosys.2015.05.027
dc.type.hasVersioninfo:eu-repo/semantics/submittedVersiones_ES


Files in this item

[PDF]

This item appears in the following Collection(s)

Show simple item record

Atribución-NoComercial-SinDerivadas 3.0 España
Except where otherwise noted, this item's license is described as Atribución-NoComercial-SinDerivadas 3.0 España