Development of advanced machine learning models for the fusion of heterogeneous biological sources in clinical decision support systems for cancer

Carrillo Pérez, Francisco

94848(1).pdf (6.347Mb)

Identificadores

URI: https://hdl.handle.net/10481/79672

ISBN: 9788411176682

Exportar

Editorial

Universidad de Granada

Director

Herrera Maldonado, Luis Javier; Rojas Ruiz, Ignacio

Departamento

Universidad de Granada. Programa de Doctorado en Tecnologías de la Información y la Comunicación

Fecha

2023

Fecha lectura

2023-01-27

Referencia bibliográfica

Carrillo Pérez, Francisco. Development of advanced machine learning models for the fusion of heterogeneous biological sources in clinical decision support systems for cancer. Granada: Universidad de Granada, 2023. [https://hdl.handle.net/10481/79672]

Patrocinador

Tesis Univ. Granada.; ref. RTI2018-101674-B-I00) from the Spanish Ministry of Universities

Resumen

Cancer is one of the leading causes of death worldwide, just behind cardiovascular diseases. An early diagnosis is key for the prognosis of the patient, since it allows applying the most suitable treatment. To do so, multiple screenings are routinely performed on the patient involving, for instance, the visual examination of histopathological slides, the analysis of the clinical history, or finding alterations in their gene expression. These examinations, however, are usually time-consuming, and not always the physicians have the experience to analyze them. To help them with these tasks, clinical decision support systems have been created in recent years using the advances in the machine learning field. Machine learning models are able to automatically learn from these data, and find insights that can help them to solve a specific task. This is part of the precision medicine field where, using a data-driven approach, we tailor the diagnosis, treatment, and other clinical outcomes to the specific characteristics of the patient. Thanks to the advances in this field, more heterogeneous sources of biological information are being gathered, and they provide diverse features that can help to accurately diagnose a cancer patient. This allows to create systems that use all the available information, accurately modelling the patient’s disease. This would be similar to having a separate diagnosis per data modality from a group of expert clinicians, where the final diagnosis is based on their analysis of their source of expertise. Unfortunately, not all these sources are always available, limiting the potential of creating multi-modal machine learning models. In this thesis, we explore the improvements that can be obtained by using multi-modal machine learning models resilient to missing modalities over single-modality ones in the area of cancer diagnosis. Firstly, we tackled the problem of lung cancer subtyping diagnosis using two of the most-used biomedical modalities in literature (gene expression and histopathology images), showing the improvements that can be obtained by fusing these two modalities in comparison to being independently used. Next, to study the limits that can be achieved by fusing heterogeneous biological sources, we include three new modalities to the proposed problem (micro-RNA, DNA Methylation values, and the copy number variation of the genes). We tested which modalities complemented each other, and which is the performance that can be obtained by fusing all these modalities in a classification model. Lastly, we approached the problem of data scarcity in biomedical multi-modal problems, presenting advance methodologies for biological data generation. Inspired by the recent advances in multi-modal generative models for natural images, we focus on generating one modality based on a paired one (RNA-to-image synthesis problem) for healthy tissues.We showed how the synthetic generated data were similar to the real samples and the model was able to impute missing modalities.

El cáncer es una de las primeras causas de mortalidad en el mundo, solo por detrás de las enfermedades cardiovasculares. Poder realizar un diagnóstico temprano es crucial para mejorar la esperanza de vida del paciente, ya que se le podría proporcionar un tratamiento más eficaz y adecuado a su estado. Para poder realizar este diagnóstico, múltiples pruebas médicas se le realizan rutinariamente a un paciente. Entre ellas, se incluye la inspección visual de imágenes histológicas, el análisis de la historia clínica, o encontrar alteraciones en la expresión de gen del paciente. Sin embargo, estas pruebas conllevan bastante tiempo, y no todos los hospitales están equipados con el material necesario para su realización. Con el fin de ayudar a los médicos en estas tareas de análisis, y gracias a los avances en el campo del aprendizaje automático, se han ido creado sistemas de apoyo al diagnóstico en los últimos años. Los algoritmos de aprendizaje máquina son capaces de aprender automáticamente de estos datos, y encontrar patrones que les ayuden a resolver una tarea específica. Esto forma parte del área de la medicina de precisión en la que, siguiendo una metodología basada en datos, se puede ofrecer un diagnóstico más robusto o elegir un tratamiento más eficaz basado en las características genéticas o del historial médico del paciente entre otras. Gracias en parte a los avances en este área, cada vez se recogen más fuentes de información biológica heterogénea, las cuáles proporcionan importante información biológica que pueden ayudar a la hora de realizar el diagnóstico de un paciente. Esto abre la posibilidad de crear sistemas que utilicen toda esta información, describiendo mejor la patología del paciente. Esto es similar a tener en cuenta la opinión de distintos especialistas a la hora de realizar un diagnóstico, donde cada uno de ellos se basa en una fuente de datos distinta. Desafortunadamente, no todas las fuentes de información están siempre disponibles, lo que limita la creación de algoritmos de aprendizaje máquina multimodales. En esta tesis, exploramos las mejoras que se pueden obtener haciendo uso de algoritmos de aprendizaje máquina multimodales en comparación con aquellos que utilizan una única modalidad. En primer lugar, hacemos uso de los dos tipos de datos más usados en la literatura (expresión de gen e imágenes histológicas) para el diagnóstico de los distintos subtipos de cáncer de pulmón, mostrando las mejoras que se pueden obtener haciendo uso de estas dos modalidades en conjunto en lugar de por separado. A continuación, para estudiar los límites que se pueden alcanzar integrando fuentes biológicas heterogéneas, añadimos tres modalidades adicionales (micro-RNA, datos de metilación del ADN, e información de la variación en el número de copias de los genes) para el mismo problema. Comprobamos qué modalidades interaccionan mejor con cuáles, y cuál es el límite que se puede alcanzar al integrar todas estas modalidades en un único modelo de clasificación. Por último, afrontamos el problema de la escasez de datos en problemas biomédicos multimodales aportando metodologías avanzadas de generación de datos sintéticos biológicos. Inspirados por los recientes avances en modelos generativos multimodales para imágenes no biológicas, nos enfocamos en la generación de una modalidad basándonos en su par (el problema de la síntesis de imagen histológicas en base a la expresión de gen), para tejidos sanos. Demostramos como los datos sintéticos generados se asemejan a los datos reales y pueden servir para la imputación de modalidades faltantes.

Colecciones

Tesis

Excepto si se señala otra cosa, la licencia del ítem se describe como Attribution-NonCommercial-NoDerivatives 4.0 Internacional