A Benchmark Study of Protein Embeddings in Sequence-Based Classification

Humasak Tommy Argo Simanjuntak, Institut Teknologi Del, Indonesia
Lamsihar Siahaan, Institut Teknologi Del, Indonesia
Patricia Dian Margaretha, Institut Teknologi Del, Indonesia
Ruth Christine Manurung, Institut Teknologi Del, Indonesia
Susi Purba, Institut Teknologi Del, Indonesia
Rosni Lumbantoruan, Institut Teknologi Del, Indonesia
Arlinta Barus, Institut Teknologi Del, Indonesia
Helen Grace Gonzales, University of Science and Technology of Southern Philippines, Philippines

Abstract


Proteins play a vital role in various tissue and organ activities and play a key role in cell structure and function. Humans can produce thousands of proteins, each consisting of tens or hundreds of interconnected amino acids. The sequence of amino acids determines the protein's 3D structure and conformational dynamics, which in turn affects its biological function. Understanding protein function is very important, especially for biological processes at the molecular level. However, extracting or studying features from protein sequences that can predict protein function is still challenging: it takes a long time, is an expensive process, and has yet to be maximized in accuracy, resulting in a large gap between protein sequence and function. Protein embedding is essential in function protein prediction using a deep learning model. Therefore, this study benchmarks three protein embedding models, ProtBert, T5, and ESM-2, as a part of function protein prediction using the LSTM Model. We delve into protein embedding performance and how to leverage it to find optimal embeddings for a given use case. We experimented with the CAFA-5 dataset to see the optimal embedding model in protein function prediction. Experiment results show that ESM-2 outperforms from ProtBert and T5. On training, the accuracy of ESM-2 is above 0.99, almost the same as T5, but still above ProtBert. Furthermore, testing on five samples of protein sequence shows that ESM2 has an average hit rate of 93.33% (100% for four samples and 66.67% for one sample).


Keywords


Protein function prediction; protein embedding; protein sequence; protein function; sequence-based classification

Full Text:

PDF

References


M. Kulmanov e R. Hoehndorf, “DeepGOPlus: improved protein function prediction from sequence,” Bioinformatics, vol. 36, nº 2, p. 422–429, 2020.

M. Lee, “Recent Advances in Deep Learning for Protein-Protein Interaction Analysis: A Comprehensive Review,” Molecules, vol. 28, nº 13, p. 5169, 2023.

F. Soleymani , E. Paquet, H. Viktor, W. Michalowski e D. Spinello, “Protein–protein interaction prediction with deep learning: A comprehensive review,” Computational and Structural Biotechnology Journal, vol. 20, pp. 5316-5341, 2022.

R. You, Z. Zhang, Y. Xiong, F. Sun, H. Mamitsuka e S. Zhu, “GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank,” Bioinformatics, vol. 34, nº 14, p. 2465–2473, 2018.

M. Kulmanov, M. A. Khan e R. Hoehndorf, “DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier,” Bioinformatics, vol. 34, nº 4, p. 660–668, 2018.

M. Kulmanov e R. Hoehndorf, “DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms,” Bioinformatics, vol. 38, p. i238–i245, June 2022.

S. Yao, R. You, S. Wang, Y. Xiong, X. Huang e S. Zhu, “NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information,” Nucleic Acids Research, vol. 49, nº W1, p. W469–W475, 2021.

S. Makrodimitris, M. J. T. Reinders e R. C. H. J. van Ham, “Metric learning on expression data for gene function prediction,” Bioinformatics, vol. 36, nº 4, p. 1182–1190, 2020.

E. Lavezzo, M. Falda, P. Fontana, L. Bianco e S. Toppo, “Enhancing protein function prediction with taxonomic constraints – The Argot2.5 web server,” Methods, vol. 93, pp. 15-23, 2016.

R. You, S. Yao, Y. Xiong, X. Huang, F. Sun, H. Mamitsuka e S. Zhu, “NetGO: improving large-scale protein function prediction with massive network information,” Nucleic Acids Research, vol. 47, nº W1, p. W379–W387, 2019.

B. Dunham e M. K. Ganapathiraju, “Benchmark Evaluation of Protein–Protein Interaction Prediction Algorithms,” Molecules, vol. 27, nº 1, pp. 1-21, 22 December 2021.

K. M. Verspoor , “Roles for Text Mining in Protein Function Prediction,” Biomedical Literature Mining, vol. 1159, p. 95–108, 2014.

Z. Gao, J. Chenran, J. Zhang, X. Jiang, L. Li, P. Zhao, H. Yang, Y. Huan e J. Li, “Hierarchical graph learning for protein–protein interaction,” Nature Communications, vol. 14, p. 1093, 25 February 2023.

K. Jha, K. Sourav e S. Saha , “Graph-BERT and language model-based framework for protein–protein interaction identification,” Scientific Reports, vol. 13, p. 5663, 06 April 2023.

J. Pereira, A. J. Simpkin, M. D. Hartmann, D. J. Rigden, R. M. Keegan e A. N. Lupas, “High-accuracy protein structure prediction in CASP14,” Proteins: Structure, Function, and Bioinformatics, vol. 89, p. 1687–1699, 2021.

J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov e O. Ronneberger, “Highly accurate protein structure prediction with AlphaFold,” nature , vol. 596, p. 583–589, 2021.

V. Gligorijević, P. D. Renfrew, T. Kosciolek, J. K. Leman e D. Berenberg, “Structure-based protein function prediction using graph convolutional networks,” nature, vol. 12, p. 3168, 2021.

A. Elnaggar, M. Heinzinger, C. Dallago, G. Rehawi, Y. Wang, L. Jones, T. Gibbs, T. B. Fehér, C. Angerer, M. Steinegger, D. Bhowmik e B. Rost , “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing,” bioRxiv, 12 July 2020.

A. Elnaggar, M. Heinzinger, C. Dallago, G. Rehawi, Y. Wang, L. Jones, T. Gibbs, T. Feher, C. Angerer, M. Steinegger, D. Bhowmik e B. Rost, “ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, nº 10, pp. 7112-7127, October 2022.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li e P. J. Liu, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” Journal of Machine Learning Research, vol. 21, pp. 1-67, 2020.

Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin, R. Verkuil, O. Kabeli, Y. Shmueli, A. d. S. Costa, M. Fazel-Zarandi, T. Sercu, S. Candido e A. Rives, “Evolutionary-scale prediction of atomic level protein structure with a language model,” bioRxiv, 21 December 2022.

N. Zhou, Y. Jiang, T. R. Bergquist e A. J. Lee, “The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens,” bioRXiv, pp. 1-48, 29 May 2019.

CAFA, “https://biofunctionprediction.org/cafa/,” CAFA, 2024. [Online]. Available: https://biofunctionprediction.org/cafa/. [Acesso em 31 8 2024].

J. Devlin, M.-W. Chang, K. Lee e K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” em The 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.

I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras e I. Androutsopoulos, “LEGAL-BERT: The Muppets straight out of Law School,” em Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, 2020.

L. Zheng, N. Guha, R. B. Anderson, P. Henderson e D. E. Ho, “When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings,” em Proceeding Eighteenth International Conference for Artificial Intelligence and Law (ICAIL’21), New York, 2021.

Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin , R. Verkuil, O. Kabeli, Y. Shmueli, A. D. S. Costa, M. Fazel-Zarandi, T. Sercu, S. Candido e A. Rives, “Evolutionary-scale prediction of atomic-level protein structure with a language model,” Science, vol. 379, nº 6637, pp. 1123-1130, 17 March 2023.

J. Gong, L. Jiang, Y. Chen, Y. Zhang, X. Li, Z. Ma, Z. Fu, F. He, P. Sun, Z. Ren e M. Tian, “THPLM: a sequence-based deep learning framework for protein stability changes prediction upon point variations using pretrained protein language model,” Bioinformatics, vol. 39, nº 11, November 2023.




DOI: https://doi.org/10.21831/elinvo.v9i2.77389

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Elinvo (Electronics, Informatics, and Vocational Education)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Our Journal indexed by:

ISSN 2477-2399 (online) || ISSN 2580-6424 (print)

View My Stats

Flag Counter