Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity
Metadatos
Afficher la notice complèteAuteur
Torres Martos, Álvaro; Bustos Aibar, Mireia; Ramírez Mena, Alberto; Cámara Sánchez, Sofía; Anguita Ruiz, Augusto; Alcalá Fernández, Rafael; Aguilera García, Concepción María; Alcalá Fernández, JesúsEditorial
MDPI
Materia
Machine learning Omics Data pre-processing
Date
2023-01-18Referencia bibliográfica
Torres-Martos, Á... [et al.]. Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity. Genes 2023, 14, 248. [https://doi.org/10.3390/genes14020248]
Patrocinador
ERDF/Regional Government of Andalusia/Ministry of Economic Transformation, Industry, Knowledge, and Universities P18-RT-2248 B-CTS-536-UGR20; ERDF/Health Institute Carlos III/Spanish Ministry of Science, Innovation PI20/00711Résumé
The use of machine learning techniques for the construction of predictive models of disease
outcomes (based on omics and other types of molecular data) has gained enormous relevance in
the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine
learning tools are subject to the proper application of algorithms as well as the appropriate preprocessing
and management of input omics and molecular data. Currently, many of the available
approaches that use machine learning on omics data for predictive purposes make mistakes in
several of the following key steps: experimental design, feature selection, data pre-processing,
and algorithm selection. For this reason, we propose the current work as a guideline on how to
confront the main challenges inherent to multi-omics human data. As such, a series of best practices
and recommendations are also presented for each of the steps defined. In particular, the main
particularities of each omics data layer, the most suitable preprocessing approaches for each source,
and a compilation of best practices and tips for the study of disease development prediction using
machine learning are described. Using examples of real data, we show how to address the key
problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high
dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for
model improvement based on the results found, which serve as the bases for future work.