Computational Analysis of Protein Structure and its Subcellular Localization using Amino Acid Sequences

Bankapur, Sanjay S.

Please use this identifier to cite or link to this item: https://idr.nitk.ac.in/jspui/handle/123456789/17032

Title:	Computational Analysis of Protein Structure and its Subcellular Localization using Amino Acid Sequences
Authors:	Bankapur, Sanjay S.
Supervisors:	Patil, Nagamma.
Keywords:	Department of Information Technology;Progressive alignment;Look back ahead scoring strategy;Positionresidue specific dynamic gap penalty scoring strategy;Single iterative optimization;Embedding;Skip-gram bi-gram;Evolutionary profiles;Ensemble classifier;Deep learning;Binary Relevance;Genetic Algorithm;Machine Learning
Issue Date:	2021
Publisher:	National Institute of Technology Karnataka, Surathkal
Abstract:	A cell is the basic unit of all organisms. In a cellular life cycle, various complex metabolic activities are being carried out in different cell compartments. Protein plays an important role in many complex metabolic activities. Proteins are generated in the post-transcriptional modification activity of a cell. Initially, the generated proteins are in the linear structure and it is called as protein primary structure. Within the cell, proteins tend to move from one compartment (subcellular location) to other compartments, and based on the environment (in which the primarily structured proteins reside), primary structured proteins transform into secondary and tertiary structures. Tertiary structured proteins interact with nearby structured proteins to form a quaternary structure. A protein performs its biological functions when it attains its respective tertiary structure. Identification of a protein structure and its subcellular locations are challenging and important tasks in the field of medical science. Various health issues are identified and solved via novel drug discoveries and a prior and accurate knowledge of protein structure and its subcellular location helps in developing a respective drug. In order to identify protein structure and its subcellular locations, various biological methods such as X-ray crystallography, nuclear magnetic resonance spectroscopy, cell fractionation, fluorescence microscopy, and electron microscopy are used. The main advantage of biological methods is that they are accurate in identifying protein structures and its subcellular locations. The disadvantages of biological methods are that they are time-consuming and very expensive. In this post-genomic era, high-volumes of protein primary structures are decoded by various research communities and are added to protein data banks. Identification of protein structure and its subcellular locations using biological methods are not a feasible option for high-volumes of proteins. Over the decades, various computational methods have been proposed to identify protein structure and its locations; however, the existing computational methods exhibit limited accuracy and hence they are less effective. The main objective of this thesis is to propose effective computational models that contribute to the prediction of protein structure and its subcellular locations. In this regard, four important and specific problems of protein structure and its subcellular location have been solved and they are: (i) multiple sequence alignment, (ii) protein secondary structural class prediction, (iii) protein fold recognition, and (iv) protein subcellular localization prediction. The importance of multiple sequence alignment is that a vital and consistent homologous pattern of proteins can be captured and these patterns will further help in solving protein structure and its subcellular locations. The proposed alignment method includes three main modules: a) an effective scoring system to score the quality of the aligned sequences, b) a progressive-based alignment approach is adopted and modified to align multiple sequences, and c) the aligned sequences are refined using the proposed polynomial-time complexity-based single iterative optimization framework. The proposed method has been assessed on publicly available benchmark datasets and recorded 17.7% improvement over the CLUSTAL X model on the BAliBASE dataset. Identification of protein secondary structural class is one of the important tasks that further help in the prediction of protein tertiary structure. Protein secondary structural class prediction is a supervised problem that falls under the multi-class category. The proposed protein secondary structural class prediction model contains a novel feature modelling strategy that extracts global and local features followed by a novel ensemble of classifiers to predict structural class. The proposed model has been assessed on both publicly available benchmark datasets and derived latest high-volume datasets. The performance of the proposed model recorded an improvement of 5.3% on the 25PDB dataset over one of the best predictors from the literature. A protein fold recognition is a categorization of various folds of a protein that exhibits in tertiary structure. Protein fold recognition is a supervised problem that falls under the multi-class category. The proposed fold recognition model contains a novel and effective feature modelling approach that includes Convolutional and SkipXGram bi-gram techniques to extract global and local features followed by an effective deep learning framework for fold recognition. The proposed model has been assessed on both publicly available benchmark datasets and derived latest high-volume datasets. The performance of the proposed model recorded a relative improvement of 5% on the DD dataset over one of the best predictors from the literature. An effective protein sub-chloroplast localization prediction model is proposed to solve one-level more microscopic problem of subcellular localization. Protein subchloroplast localization is a supervised problem that falls under the multi-class and multi-label category. The proposed protein sub-chloroplast localization prediction model contains a novel feature extraction technique such as SkipXGram bi-gram followed by a deep learning framework for multi-label classification. The proposed model has been assessed on publicly available benchmark datasets and recorded an improvement of (absolute) 30.39% on the Novel dataset over the best predictor from the literature.
URI:	http://idr.nitk.ac.in/jspui/handle/123456789/17032
Appears in Collections:	1. Ph.D Theses

Files in This Item:

File	Description	Size	Format
Mr. Sanjay S. Bankapur.pdf		2.87 MB	Adobe PDF	View/Open

Show full item record