Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity

Torres Martos, Álvaro; Bustos Aibar, Mireia; Ramírez Mena, Alberto; Cámara Sánchez, Sofía; Anguita Ruiz, Augusto; Alcalá Fernández, Rafael; Aguilera García, Concepción María; Alcalá Fernández, Jesús

doi:10.3390/genes14020248

dc.contributor.author	Torres Martos, Álvaro
dc.contributor.author	Bustos Aibar, Mireia
dc.contributor.author	Ramírez Mena, Alberto
dc.contributor.author	Cámara Sánchez, Sofía
dc.contributor.author	Anguita Ruiz, Augusto
dc.contributor.author	Alcalá Fernández, Rafael
dc.contributor.author	Aguilera García, Concepción María
dc.contributor.author	Alcalá Fernández, Jesús
dc.date.accessioned	2023-03-28T06:40:24Z
dc.date.available	2023-03-28T06:40:24Z
dc.date.issued	2023-01-18
dc.identifier.citation	Torres-Martos, Á... [et al.]. Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity. Genes 2023, 14, 248. [https://doi.org/10.3390/genes14020248]	es_ES
dc.identifier.uri	https://hdl.handle.net/10481/80881
dc.description.abstract	The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools are subject to the proper application of algorithms as well as the appropriate preprocessing and management of input omics and molecular data. Currently, many of the available approaches that use machine learning on omics data for predictive purposes make mistakes in several of the following key steps: experimental design, feature selection, data pre-processing, and algorithm selection. For this reason, we propose the current work as a guideline on how to confront the main challenges inherent to multi-omics human data. As such, a series of best practices and recommendations are also presented for each of the steps defined. In particular, the main particularities of each omics data layer, the most suitable preprocessing approaches for each source, and a compilation of best practices and tips for the study of disease development prediction using machine learning are described. Using examples of real data, we show how to address the key problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for model improvement based on the results found, which serve as the bases for future work.	es_ES
dc.description.sponsorship	ERDF/Regional Government of Andalusia/Ministry of Economic Transformation, Industry, Knowledge, and Universities P18-RT-2248 B-CTS-536-UGR20	es_ES
dc.description.sponsorship	ERDF/Health Institute Carlos III/Spanish Ministry of Science, Innovation PI20/00711	es_ES
dc.language.iso	eng	es_ES
dc.publisher	MDPI	es_ES
dc.rights	Atribución 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	*
dc.subject	Machine learning	es_ES
dc.subject	Omics	es_ES
dc.subject	Data pre-processing	es_ES
dc.title	Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity	es_ES
dc.type	journal article	es_ES
dc.rights.accessRights	open access	es_ES
dc.identifier.doi	10.3390/genes14020248
dc.type.hasVersion	VoR	es_ES

Files in this item

Name:: genes-14-00248-v2.pdf
Size:: 1.485Mb
Format:: PDF

This item appears in the following Collection(s)

INTA - Artículos

Show simple item record

Except where otherwise noted, this item's license is described as Atribución 4.0 Internacional