An Analysis of Protein Language Model Embeddings for Fold Prediction

Villegas Morcillo, Amelia Otilia; Gómez García, Ángel Manuel; Sánchez Calle, Victoria Eugenia

doi:10.1101/2022.02.07.479394

dc.contributor.author	Villegas Morcillo, Amelia Otilia
dc.contributor.author	Gómez García, Ángel Manuel
dc.contributor.author	Sánchez Calle, Victoria Eugenia
dc.date.accessioned	2023-03-16T09:20:16Z
dc.date.available	2023-03-16T09:20:16Z
dc.date.issued	2022-02-10
dc.identifier.citation	Published version: Amelia Villegas-Morcillo, Angel M Gomez, Victoria Sanchez, An analysis of protein language model embeddings for fold prediction, Briefings in Bioinformatics, Volume 23, Issue 3, May 2022, bbac142, [https://doi.org/10.1093/bib/bbac142]	es_ES
dc.identifier.uri	https://hdl.handle.net/10481/80621
dc.description.abstract	The identification of the protein fold class is a challenging problem in structural biology. Recent computational methods for fold prediction leverage deep learning techniques to extract protein fold-representative embeddings mainly using evolutionary information in the form of multiple sequence alignment (MSA) as input source. In contrast, protein language models (LM) have reshaped the field thanks to their ability to learn efficient protein representations (protein-LM embeddings) from purely sequential information in a self-supervised manner. In this paper, we analyze a framework for protein fold prediction using pre-trained protein-LM embeddings as input to several fine-tuning neural network models, which are supervisedly trained with fold labels. In particular, we compare the performance of six protein-LM embeddings: the long short-term memory-based UniRep and SeqVec, and the transformer-based ESM-1b, ESM-MSA, ProtBERT and ProtT5; as well as three neural networks: Multi-Layer Perceptron, ResCNN-BGRU (RBG) and Light-Attention (LAT). We separately evaluated the pairwise fold recognition (PFR) and direct fold classification (DFC) tasks on well-known benchmark datasets. The results indicate that the combination of transformer-based embeddings, particularly those obtained at amino acid level, with the RBG and LAT fine-tuning models performs remarkably well in both tasks. To further increase prediction accuracy, we propose several ensemble strategies for PFR and DFC, which provide a significant performance boost over the current state-of-the-art results. All this suggests that moving from traditional protein representations to protein-LM embeddings is a very promising approach to protein fold-related tasks.	es_ES
dc.description.sponsorship	PID2019-104206GB-I00 funded by MCIN/ AEI /10.13039/ 501100011033	es_ES
dc.description.sponsorship	FPI grant BES-2017-079792	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Oxford University Press	es_ES
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	Protein Fold Prediction	es_ES
dc.subject	Protein Language Models	es_ES
dc.subject	Fine-Tuning Neural Networks	es_ES
dc.subject	Embedding Learning	es_ES
dc.title	An Analysis of Protein Language Model Embeddings for Fold Prediction	es_ES
dc.type	journal article	es_ES
dc.rights.accessRights	open access	es_ES
dc.identifier.doi	10.1101/2022.02.07.479394
dc.type.hasVersion	SMUR	es_ES

Ficheros en el ítem

Nombre:: 2022.02.07.479394v1.full.pdf
Tamaño:: 864.1Kb
Formato:: PDF
Descripción:: Preprint

Este ítem aparece en la(s) siguiente(s) colección(ones)

DTSTC - Artículos

Mostrar el registro sencillo del ítem

Excepto si se señala otra cosa, la licencia del ítem se describe como Attribution-NonCommercial-NoDerivatives 4.0 Internacional