Computational Analysis of Protein Structure and its Subcellular Localization using Amino Acid Sequences
Date
2021
Authors
Bankapur, Sanjay S.
Journal Title
Journal ISSN
Volume Title
Publisher
National Institute of Technology Karnataka, Surathkal
Abstract
A cell is the basic unit of all organisms. In a cellular life cycle, various complex
metabolic activities are being carried out in different cell compartments. Protein plays
an important role in many complex metabolic activities. Proteins are generated in the
post-transcriptional modification activity of a cell. Initially, the generated proteins are in
the linear structure and it is called as protein primary structure. Within the cell, proteins
tend to move from one compartment (subcellular location) to other compartments, and
based on the environment (in which the primarily structured proteins reside), primary
structured proteins transform into secondary and tertiary structures. Tertiary structured
proteins interact with nearby structured proteins to form a quaternary structure. A protein
performs its biological functions when it attains its respective tertiary structure.
Identification of a protein structure and its subcellular locations are challenging and
important tasks in the field of medical science. Various health issues are identified
and solved via novel drug discoveries and a prior and accurate knowledge of protein
structure and its subcellular location helps in developing a respective drug. In order
to identify protein structure and its subcellular locations, various biological methods
such as X-ray crystallography, nuclear magnetic resonance spectroscopy, cell fractionation,
fluorescence microscopy, and electron microscopy are used. The main advantage
of biological methods is that they are accurate in identifying protein structures and
its subcellular locations. The disadvantages of biological methods are that they are
time-consuming and very expensive. In this post-genomic era, high-volumes of protein
primary structures are decoded by various research communities and are added to protein
data banks. Identification of protein structure and its subcellular locations using
biological methods are not a feasible option for high-volumes of proteins.
Over the decades, various computational methods have been proposed to identify
protein structure and its locations; however, the existing computational methods exhibit
limited accuracy and hence they are less effective. The main objective of this thesis
is to propose effective computational models that contribute to the prediction of protein
structure and its subcellular locations. In this regard, four important and specific
problems of protein structure and its subcellular location have been solved and they are:
(i) multiple sequence alignment, (ii) protein secondary structural class prediction, (iii)
protein fold recognition, and (iv) protein subcellular localization prediction.
The importance of multiple sequence alignment is that a vital and consistent homologous
pattern of proteins can be captured and these patterns will further help in
solving protein structure and its subcellular locations. The proposed alignment method
includes three main modules: a) an effective scoring system to score the quality of the
aligned sequences, b) a progressive-based alignment approach is adopted and modified
to align multiple sequences, and c) the aligned sequences are refined using the proposed
polynomial-time complexity-based single iterative optimization framework. The proposed
method has been assessed on publicly available benchmark datasets and recorded
17.7% improvement over the CLUSTAL X model on the BAliBASE dataset.
Identification of protein secondary structural class is one of the important tasks that
further help in the prediction of protein tertiary structure. Protein secondary structural
class prediction is a supervised problem that falls under the multi-class category. The
proposed protein secondary structural class prediction model contains a novel feature
modelling strategy that extracts global and local features followed by a novel ensemble
of classifiers to predict structural class. The proposed model has been assessed on both
publicly available benchmark datasets and derived latest high-volume datasets. The
performance of the proposed model recorded an improvement of 5.3% on the 25PDB
dataset over one of the best predictors from the literature.
A protein fold recognition is a categorization of various folds of a protein that exhibits
in tertiary structure. Protein fold recognition is a supervised problem that falls
under the multi-class category. The proposed fold recognition model contains a novel
and effective feature modelling approach that includes Convolutional and SkipXGram
bi-gram techniques to extract global and local features followed by an effective deep
learning framework for fold recognition. The proposed model has been assessed on
both publicly available benchmark datasets and derived latest high-volume datasets.
The performance of the proposed model recorded a relative improvement of 5% on the
DD dataset over one of the best predictors from the literature. An effective protein sub-chloroplast localization prediction model is proposed to
solve one-level more microscopic problem of subcellular localization. Protein subchloroplast
localization is a supervised problem that falls under the multi-class and
multi-label category. The proposed protein sub-chloroplast localization prediction model
contains a novel feature extraction technique such as SkipXGram bi-gram followed by
a deep learning framework for multi-label classification. The proposed model has been
assessed on publicly available benchmark datasets and recorded an improvement of
(absolute) 30.39% on the Novel dataset over the best predictor from the literature.
Description
Keywords
Department of Information Technology, Progressive alignment, Look back ahead scoring strategy, Positionresidue specific dynamic gap penalty scoring strategy, Single iterative optimization, Embedding, Skip-gram bi-gram, Evolutionary profiles, Ensemble classifier, Deep learning, Binary Relevance, Genetic Algorithm, Machine Learning