Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function

Villegas Morcillo, Amelia Otilia; Gómez García, Ángel Manuel; Sánchez Calle, Victoria Eugenia

doi:10.1093/bioinformatics/btaa701

btaa701.pdf (573.0Ko)

Identificadores

URI: http://hdl.handle.net/10481/69200

DOI: 10.1093/bioinformatics/btaa701

Exportar

Editorial

Oxford University Press

Date

2020-08-14

Referencia bibliográfica

Amelia Villegas-Morcillo, Stavros Makrodimitris, Roeland C H J van Ham, Angel M Gomez, Victoria Sanchez, Marcel J T Reinders, Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function, Bioinformatics, Volume 37, Issue 2, 15 January 2021, Pages 162–170, [https://doi.org/10.1093/bioinformatics/btaa701]

Patrocinador

Keygene N.V., a crop innovation company in the Netherlands; Spanish MINECO/FEDER TEC201680141-P; FPI grant BES-2017-079792

Résumé

Motivation: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. Results: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining.

Colecciones

DTSTC - Artículos

Excepté là où spécifié autrement, la license de ce document est décrite en tant que Atribución-NoComercial 3.0 España