Faculty Publications
Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736
Publications by NITK Faculty
Browse
16 results
Search Results
Item A novel technique of feature selection with relieff and CFS for protein sequence classification(Springer Verlag service@springer.de, 2019) Kaur, K.; Patil, N.Bioinformatics has gained wide importance in research area for the last few decades. The main aim is to store the biological data and analyze it for better understanding. To predict the functions of newly added protein sequences, the classification of existing protein sequence is of great use. The rate at which protein sequence data is getting accumulated is increasing exponentially. So, it emerges as a very challenging task for the researcher, to deal with large number of features obtained by the use of various encoding techniques. Here, a two-stage algorithm is proposed for feature selection that combines ReliefF and CFS technique that takes extracted features as input and provides us with the discriminative set of features. The n-gram sequence encoding technique has been used to extract the feature vector from the protein sequences. In the first stage, ReliefF approach is used to rank the features and obtain candidate feature set. In the second stage, CFS is applied on this candidate feature set to obtain features that have high correlation with the class but less correlation with other features. The classification methods like Naive-Bayes, decision tree, and k-nearest neighbor can be used to analyze the performance of proposed approach. It is observed that this approach has increased accuracy of classification methods in comparison to existing methods. © Springer Nature Singapore Pte Ltd. 2019Item Authentication based on bioinformatics(2004) Mohandas, M.K.; Shet, K.C.Authentication has assumed a lot of importance over the years due to hackers and unauthorised access. The Authentication based on bioinformatics will do away with all kinds of smart cards, identity cards or any other device being carried by the users. A lot of research is being done to improve the reliability of bioinformatics comparison with central database. This paper focuses on the research carried at NITK, Surathkal in this direction.Item Genome Data Analysis Using MapReduce Paradigm(Institute of Electrical and Electronics Engineers Inc., 2015) Pahadia, M.; Srivastava, A.; Srivastava, D.; Patil, N.Counting the number of occurences of a substringin a string is a problem in many applications. This paper suggests a fast and efficient solution for the field of bioinformatics. Ak-mer is a k-length sub string of a biological sequence. K-mercounting is defined as counting the number of occurences of all the possible k-mers in a biological sequence. K-mer counting has uses in applications ranging from error correction of sequencing reads, genome assembly, disease prediction and feature extraction. The current k-mer counting tools are both time and space costly. We provide a solution which uses MapReduce and Hadoop to reduce the time complexity. After applying the algorithms on real genome datasets, we concluded that the algorithm using Hadoopand MapReduce Paradigm runs more efficiently and reduces the time complexity significantly. © 2015 IEEE.Item Distributed mining of significant frequent colossal closed itemsets from long biological dataset(Springer Verlag service@springer.de, 2020) Vanahalli, M.K.; Patil, N.Mining colossal itemsets have gained more attention in recent times. An extensive set of short and average sized itemsets do not confine complete and valuable information for decision making. But, the traditional itemset mining algorithms expend a gigantic measure of time in mining these little and average sized itemsets. Colossal itemsets are very significant for numerous applications including the field of bioinformatics and are influential during the decision making. The new mode of dataset known as long biological dataset was contributed by Bioinformatics. These datasets are high dimensional datasets, which are depicted by an expansive number of features (attributes) and a less number of rows (samples). Extracting huge amount of information and knowledge from high dimensional long biological dataset is a nontrivial task. The existing algorithms are computationally expensive and sequential in mining significant Frequent Colossal Closed itemsets (FCCI) from long biological dataset. Distributed computing is a good strategy to overcome the inefficiency of the existing sequential algorithm. The paper proposes a distributed computing approach for mining FCCI. The row enumerated mining search space is efficiently cut down by pruning strategy enclosed in Distributed Row Enumerated Frequent Colossal Closed Itemset Mining (DREFCCIM) algorithm. The proposed DREFCCIM algorithm is the first distributed algorithm to mine FCCI from long biological dataset. The experimental results demonstrate the efficient performance of the DREFCCIM algorithm in comparison to the current algorithms. © Springer Nature Switzerland AG 2020.Item Ageist Spider Monkey Optimization algorithm(Elsevier B.V., 2016) Sharma, A.; Sharma, A.; Panigrahi, B.K.; Kiran, D.; Kumar, R.Swarm Intelligence (SI) is quite popular in the field of numerical optimization and has enormous scope for research. A number of algorithms based on decentralized and self-organized swarm behavior of natural as well as artificial systems have been proposed and developed in last few years. Spider Monkey Optimization (SMO) algorithm, inspired by the intelligent behavior of spider monkeys, is one such recently proposed algorithm. The algorithm along with some of its variants has proved to be very successful and efficient. A spider monkey group consists of members from every age group. The agility and swiftness of the spider monkeys differ on the basis of their age groups. This paper proposes a new variant of SMO algorithm termed as Ageist Spider Monkey Optimization (ASMO) algorithm which seems more practical in biological terms and works on the basis of age difference present in spider monkey population. Experiments on different benchmark functions with different parameters and settings have been carried out and the variant with the best suited settings is proposed. This variant of SMO has enhanced the performance of its original version. Also, ASMO has performed better in comparison to some of the recent advanced algorithms. © 2016Item A Noise Reduction Technique Based on Nonlinear Kernel Function for Heart Sound Analysis(Institute of Electrical and Electronics Engineers Inc., 2018) Mondal, A.; Saxena, I.; Tang, H.; Banerjee, P.The main difficulty encountered in interpretation of cardiac sound is interference of noise. The contaminated noise obscures the relevant information, which are useful for recognition of heart diseases. The unwanted signals are produced mainly by lungs and surrounding environment. In this paper, a novel heart sound denoising technique has been introduced based on a combined framework of wavelet packet transform and singular value decomposition (SVD). The most informative node of the wavelet tree is selected on the criteria of mutual information measurement. Next, the coefficient corresponding to the selected node is processed by the SVD technique to suppress noisy component from heart sound signal. To justify the efficacy of the proposed technique, several experiments have been conducted with heart sound dataset, including normal and pathological cases at different signal to noise ratios. The significance of the method is validated by statistical analysis of the results. The biological information preserved in denoised heart sound signal is evaluated by the k-means clustering algorithm. The overall results show that the proposed method is superior than the baseline methods. © 2013 IEEE.Item A novel filter–wrapper hybrid greedy ensemble approach optimized using the genetic algorithm to reduce the dimensionality of high-dimensional biomedical datasets(Elsevier Ltd, 2019) Gangavarapu, T.; Patil, N.The predictive accuracy of high-dimensional biomedical datasets is often dwindled by many irrelevant and redundant molecular disease diagnosis features. Dimensionality reduction aims at finding a feature subspace that preserves the predictive accuracy while eliminating noise and curtailing the high computational cost of training. The applicability of a particular feature selection technique is heavily reliant on the ability of that technique to match the problem structure and to capture the inherent patterns in the data. In this paper, we propose a novel filter–wrapper hybrid ensemble feature selection approach based on the weighted occurrence frequency and the penalty scheme, to obtain the most discriminative and instructive feature subspace. The proposed approach engenders an optimal feature subspace by greedily combining the feature subspaces obtained from various predetermined base feature selection techniques. Furthermore, the base feature subspaces are penalized based on specific performance dependent penalty parameters. We leverage effective heuristic search strategies including the greedy parameter-wise optimization and the Genetic Algorithm (GA) to optimize the subspace ensembling process. The effectiveness, robustness, and flexibility of the proposed hybrid greedy ensemble approach in comparison with the base feature selection techniques, and prolific filter and state-of-the-art wrapper methods are justified by empirical analysis on three distinct high-dimensional biomedical datasets. Experimental validation revealed that the proposed greedy approach, when optimized using GA, outperformed the selected base feature selection techniques by 4.17%–15.14% in terms of the prediction accuracy. © 2019 Elsevier B.V.Item An efficient parallel row enumerated algorithm for mining frequent colossal closed itemsets from high dimensional datasets(Elsevier Inc. usjcs@elsevier.com, 2019) Vanahalli, M.K.; Patil, N.Mining colossal itemsets from high dimensional datasets have gained focus in recent times. The conventional algorithms expend most of the time in mining small and mid-sized itemsets, which do not enclose valuable and complete information for decision making. Mining Frequent Colossal Closed Itemsets (FCCI) from a high dimensional dataset play a highly significant role in decision making for many applications, especially in the field of bioinformatics. To mine FCCI from a high dimensional dataset, the existing preprocessing techniques fail to prune the complete set of irrelevant features and irrelevant rows. Besides, the state-of-the-art algorithms for the same are sequential and computationally expensive. The proposed work highlights an Effective Improved Parallel Preprocessing (EIPP) technique to prune the complete set of irrelevant features and irrelevant rows from high dimensional dataset and a novel efficient Parallel Frequent Colossal Closed Itemset Mining (PFCCIM) algorithm. Further, the PFCCIM algorithm is integrated with a novel Rowset Cardinality Table (RCT), an efficient method to check the closeness of a rowset and also an efficient pruning strategy to cut down the mining search space. The proposed PFCCIM algorithm is the first parallel algorithm to mine FCCI from a high dimensional dataset. The performance study shows the improved effectiveness of the proposed EIPP technique over the existing preprocessing techniques and the improved efficiency of the proposed PFCCIM algorithm over the existing algorithms. © 2018 Elsevier Inc.Item An efficient dynamic switching algorithm for mining colossal closed itemsets from high dimensional datasets(Elsevier B.V., 2019) Vanahalli, M.K.; Patil, N.The abundant data across a variety of domains including bioinformatics has led to the formation of dataset with high dimensionality. The conventional algorithms expend most of their time in mining a large number of small and mid-sized itemsets which does not enclose complete and valuable information for decision making. The recent research is focused on Frequent Colossal Closed Itemsets (FCCI), which plays a significant role in decision making for many applications, especially in the field of bioinformatics. The state-of-the-art algorithms in mining FCCI from datasets consisting of a large number of rows and a large number of features are computationally expensive, as they are either pure row or feature enumeration based algorithms. Moreover, the existing preprocessing techniques fail to prune the complete set of irrelevant features and irrelevant rows. The proposed work emphasizes an Effective Improvised Preprocessing (EIP) technique to prune the complete set of irrelevant features and irrelevant rows, and a novel efficient Dynamic Switching Frequent Colossal Closed Itemset Mining (DSFCCIM) algorithm. The proposed DSFCCIM algorithm efficiently switches between row and feature enumeration methods based on data characteristics during the mining process. Further, the DSFCCIM algorithm is integrated with a novel Rowset Cardinality Table, Itemset Support Table, two efficient methods to check the closeness of rowset and itemset, and two efficient pruning strategies to cut down the search space. The proposed DSFCCIM algorithm is the first dynamic switching algorithm to mine FCCI from datasets consisting of a large number of rows and a large number of features. The performance study shows the improved effectiveness of the proposed EIP technique over the existing preprocessing techniques and the improved efficiency of the proposed DSFCCIM algorithm over the existing algorithms. © 2019 Elsevier B.V.Item A fast and novel approach based on grouping and weighted mRMR for feature selection and classification of protein sequence data(Inderscience Publishers, 2020) Kaur, K.; Patil, N.The analysis of protein sequences under bioinformatics has gained wide importance in research area. Newly added protein sequences can be analysed using existing proteins and converting them into feature vector form. However, it emerges as a challenging task to deal with huge number of features obtained using sequence encoding techniques. Since all the features obtained are not actually required, a three-stage feature selection approach has been proposed. In the first stage, features are ranked and most irrelevant features are removed; in the second stage, conflicting features are grouped together; and in third stage, a fast approach based on weighted Minimum Redundancy Maximum Relevance (wMRMR) has been proposed and applied on grouped features. Different classification methods are used to analyse the performance of the proposed approach. It is observed that the proposed approach has increased classification accuracy results and reduced time consumption in comparison to the state-of-the-art methods. © 2020 Inderscience Enterprises Ltd.
