Enhanced protein structural class prediction using effective feature modeling and ensemble of classifiers

dc.contributor.authorBankapur, S.
dc.contributor.authorPatil, N.
dc.date.accessioned2026-02-05T09:27:32Z
dc.date.issued2021
dc.description.abstractProtein Secondary Structural Class (PSSC) information is important in investigating further challenges of protein sequences like protein fold recognition, protein tertiary structure prediction, and analysis of protein functions for drug discovery. Identification of PSSC using biological methods is time-consuming and cost-intensive. Several computational models have been developed to predict the structural class; however, they lack in generalization of the model. Hence, predicting PSSC based on protein sequences is still proving to be an uphill task. In this article, we proposed an effective, novel and generalized prediction model consisting of a feature modeling and an ensemble of classifiers. The proposed feature modeling extracts discriminating information (features) by leveraging three techniques: (i) Embedding – features are extracted on the basis of spatial residue arrangements of the sequences using word embedding approaches; (ii) SkipXGram Bi-gram – various sets of skipped bi-gram features are extracted from the sequences; and (iii) General Statistical (GS) based features are extracted which covers the global information of structural sequences. The combined effective sets of features are trained and classified using an ensemble of three classifiers: Support Vector Machine (SVM), Random Forest (RF), and Gradient Boosting Machines (GBM). The proposed model when assessed on five benchmark datasets (high and low sequence similarity), viz. z277, z498, 25PDB, 1189, and FC699, reported an overall accuracy of 93.55, 97.58, 81.82, 81.11, and 93.93 percent respectively. The proposed model is further validated on a large-scale updated low similarity (?25%) dataset, where it achieved an overall accuracy of 81.11 percent. The proposed generalized model is robust and consistently outperformed several state-of-the-art models on all the five benchmark datasets. © 2021 Institute of Electrical and Electronics Engineers Inc.. All rights reserved.
dc.identifier.citationIEEE/ACM Transactions on Computational Biology and Bioinformatics, 2021, 18, 6, pp. 2409-2419
dc.identifier.issn15455963
dc.identifier.urihttps://doi.org/10.1109/TCBB.2020.2979430
dc.identifier.urihttps://idr.nitk.ac.in/handle/123456789/23416
dc.publisherInstitute of Electrical and Electronics Engineers Inc.
dc.subjectClassification (of information)
dc.subjectDecision trees
dc.subjectEmbeddings
dc.subjectForecasting
dc.subjectLarge dataset
dc.subjectSupport vector machines
dc.subjectAmino acid sequence
dc.subjectBi-gram
dc.subjectEnsemble of classifiers
dc.subjectEnsemble-classifier
dc.subjectFeature models
dc.subjectProtein sequences
dc.subjectProtein structural class
dc.subjectSkip-gram
dc.subjectStructural class
dc.subjectProteins
dc.subjectprotein
dc.subjectamino acid sequence
dc.subjectbiology
dc.subjectchemistry
dc.subjectclassification
dc.subjectgenetics
dc.subjectmachine learning
dc.subjectprocedures
dc.subjectprotein database
dc.subjectprotein secondary structure
dc.subjectsequence analysis
dc.subjectsupport vector machine
dc.subjectAmino Acid Sequence
dc.subjectComputational Biology
dc.subjectDatabases, Protein
dc.subjectMachine Learning
dc.subjectProtein Structure, Secondary
dc.subjectSequence Analysis, Protein
dc.subjectSupport Vector Machine
dc.titleEnhanced protein structural class prediction using effective feature modeling and ensemble of classifiers

Files

Collections