Faculty Publications
Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736
Publications by NITK Faculty
Browse
92 results
Search Results
Item A novel technique of feature selection with relieff and CFS for protein sequence classification(Springer Verlag service@springer.de, 2019) Kaur, K.; Patil, N.Bioinformatics has gained wide importance in research area for the last few decades. The main aim is to store the biological data and analyze it for better understanding. To predict the functions of newly added protein sequences, the classification of existing protein sequence is of great use. The rate at which protein sequence data is getting accumulated is increasing exponentially. So, it emerges as a very challenging task for the researcher, to deal with large number of features obtained by the use of various encoding techniques. Here, a two-stage algorithm is proposed for feature selection that combines ReliefF and CFS technique that takes extracted features as input and provides us with the discriminative set of features. The n-gram sequence encoding technique has been used to extract the feature vector from the protein sequences. In the first stage, ReliefF approach is used to rank the features and obtain candidate feature set. In the second stage, CFS is applied on this candidate feature set to obtain features that have high correlation with the class but less correlation with other features. The classification methods like Naive-Bayes, decision tree, and k-nearest neighbor can be used to analyze the performance of proposed approach. It is observed that this approach has increased accuracy of classification methods in comparison to existing methods. © Springer Nature Singapore Pte Ltd. 2019Item Genetic algorithm based wrapper feature selection on hybrid prediction model for analysis of high dimensional data(Institute of Electrical and Electronics Engineers Inc., 2015) Rayasam, R.C.; Kannan, R.; Patil, N.Data mining concepts have been extensively used for disease prediction in the medical field. Many Hybrid Prediction Models (HPM) have been proposed and implemented in this area, however, there is always a need for increasing accuracy and efficiency. The existing methods take into account all the features to build the classifier model thus reducing the accuracy and increasing the overall processing time. This paper proposes a Genetic Algorithm based Wrapper feature selection Hybrid Prediction Model (GWHPM). This model initially uses k-means clustering technique to remove the outliers from the dataset. Further, an optimal set of features are obtained by using Genetic Algorithm based Wrapper feature selection. Finally, it is used to build the classifier models such as Decision Tree, Naive Bayes, k nearest neighbor and Support Vector Machine. A comparative study of GWHPM is carried out and it is observed that the proposed model performed better than the existing methods. © 2014 IEEE.Item An improved sentiment analysis of online movie reviews based on clustering for box-office prediction(Institute of Electrical and Electronics Engineers Inc., 2015) Patil, N.; Pruthvi, H.R.; Nisha, K.K.; Hadimani, N.H.With the rapid development of E-commerce, more online reviews for products and services are created, which form an important source of information for both sellers and customers. Research on sentiment and opinion mining for online review analysis has attracted increasingly more attention because such study helps leverage information from online reviews for potential economic impact. The paper discusses applying sentiment analysis and machine learning methods to study the relationship between the online reviews for a movie and the movies box office revenue performance. The paper shows that a simplified version of the sentiment-aware autoregressive model can produce very good accuracy for predicting the box office sale using online review data. Document level sentiment analysis is used which consists of Term Frequency (TF) and Inverse Document Frequency (IDF) values as features along with Fuzzy Clustering which results in positive and negative sentiments. This lead to the creation of a simpler model which could be more efficient to train and use. In addition, a classification model is created using Support Vector Machine (SVM) Classifier for predicting the trend of the box office revenue from the review sentiment. © 2015 IEEE.Item Classification of multi-genomic data using MapReduce paradigm(Institute of Electrical and Electronics Engineers Inc., 2015) Pahadia, M.; Srivastava, A.; Srivastava, D.; Patil, N.Counting the number of occurences of a substring in a string is a problem in many applications. This paper suggests a fast and efficient solution for the field of bioinformatics. A k-mer is a k-length substring of a biological sequence. k-mer counting is defined as counting the number of occurences of all the possible k-mers in a biological sequence. k-mer counting has uses in applications ranging from error correction of sequencing reads, genome assembly, disease prediction and feature extraction. We provide a Hadoop based solution to solve the k-mer counting problem and then use this for classification of multi-genomic data. The classification is done using classifiers like Naive Bayes, Decision Tree and Support Vector Machine(SVM). Accuracy of more than 99% is observed. © 2015 IEEE.Item Recommender system based on Hierarchical Clustering algorithm Chameleon(Institute of Electrical and Electronics Engineers Inc., 2015) Gupta, U.; Patil, N.Recommender Systems are becoming inherent part of today's e-commerce applications. Since recommender system has a direct impact on the sales of many products therefore Recommender system plays an important role in e-commerce. Collaborative filtering is the oldest techniques used in the recommender system. A lot of work has been done towards the improvement of collaborative filtering which comprises of two components User Based and Item Based. The basic necessity of today's recommender system is accuracy and speed. In this work an efficient technique for recommender system based on Hierarchical Clustering is proposed. The user or item specific information is grouped into a set of clusters using Chameleon Hierarchical clustering algorithm. Further voting system is used to predict the rating of a particular item. In order to evaluate the performance of Chameleon based recommender system, it is compared with existing technique based on K-means clustering algorithm. The results demonstrates that Chameleon based Recommender system produces less error as compared to K-means based Recommender System. © 2015 IEEE.Item A novel semi-supervised approach for protein sequence classification(Institute of Electrical and Electronics Engineers Inc., 2015) Chaturvedi, B.; Patil, N.Bioinformatics is an emerging research area. Classification of protein sequence dataset is the biggest challenge for researcher. This paper deals with supervised and semi-supervised classification of human protein sequence. Amino acid composition (AAC) used for feature extraction of the protein sequence. The classification techniques like Support Vector Machine (SVM), Naive Bayes, K-Nearest Neighbour (KNN), Random Forest, Decision Tree are using for classification of protein sequence dataset. Amongst these classifiers SVM reported the best result with higher accuracy. The limitation with SVM is that it works only with supervised(labeled dataset). It doesn't work with unsupervised or semi-supervised dataset (unlabeled dataset or large amount of unlabeled dataset among small amount of labeled dataset). A novel semi-supervised support vector machine (SSVM) classifier is proposed which works with combination of labled and unlabled dataset. In results it observed that the proposed approach gives higher accuracy with semi-supervised dataset. Principal component analysis (PCA) used for feature reduction of protein sequence. The proposed semi-supervised support vector machine (SSVM) using PCA gives increased accuracy of about 5 to 10%. © 2015 IEEE.Item Recommendation of Optimal Locations for Government Funded Educational Institutes in Urban India Using a Hybrid Data Mining Technique(Institute of Electrical and Electronics Engineers Inc., 2015) Pulakhandam, S.; Patil, N.The Government of India has introduced schemes to build educational facilities in areas where literacy rate is less than the national average. It was found that literacy rate is a sufficient criterion with respect to rural areas but a different approach must be taken for urban planning because of space constraints, heterogeneous communities and the varied background of children living in urban areas. A hybrid data mining method to discover optimum locations for educational facilities in urban areas is proposed. The method is a combination of rule-based classification and spatial clustering. Rule-based classification is used to identify relevant data points from the spatial data set. New parameters like dropout rate and ratio of children out of school to children in school are introduced to measure relevance since literacy rate alone was found to be an insufficient criterion. Spatial clustering is used to group the points according to their location. The center of each cluster signifies the optimum location for an educational facility. A modified COD-CLARANS method is proposed. The algorithm is modified in two aspects. It is proposed that the absolute error, E, is calculated using the shortest path of commute on city roads rather than the obstructed distance calculated in the pre-processing step of the original COD-CLARANS algorithm. Secondly, only areas with space available for the establishment of a facility are considered to represent clusters. The modified method seeks improve efficiency and to make the spatial clustering technique more relevant to the urban setting. A comparison between different clustering algorithms and the modified COD-CLARANS algorithm is presented. © 2015 IEEE.Item Evaluation of Machine Learning Frameworks on Bank Marketing and Higgs Datasets(Institute of Electrical and Electronics Engineers Inc., 2015) Bhuvan, B.M.; Jain, S.; Rao, V.D.; Patil, N.; Raghavendra, G.S.Big data is an emerging field with different datasets of various sizes are being analyzed for potential applications. In parallel, many frameworks are being introduced where these datasets can be fed into machine learning algorithms. Though some experiments have been done to compare different machine learning algorithms on different data, these experiments have not been tested out on different platforms. Our research aims to compare two selected machine learning algorithms on data sets of different sizes deployed on different platforms like Weka, Scikit-Learn and Apache Spark. They are evaluated based on Training time, Accuracy and Root mean squared error. This comparison helps us to decide what platform is best suited to work while applying computationally expensive selected machine learning algorithms on a particular size of data. Experiments suggested that Scikit-Learn would be optimal on data which can fit into memory. While working with huge, data Apache Spark would be optimal as it performs parallel computations by distributing the data over a cluster. Hence this study concludes that spark platform which has growing support for parallel implementation of machine learning algorithms could be optimal to analyze big data. © 2015 IEEE.Item An Improved Method for Disease Prediction Using Fuzzy Approach(Institute of Electrical and Electronics Engineers Inc., 2015) Chetty, N.; Vaisla, K.S.; Patil, N.Data mining is a process of extracting useful information from the huge amount of data. Data Mining has great scope in the field of medicine. This article deals with the working on PIMA and Liver-disorder datasets. Many researchers have proposed the use of K-nearest neighbor (KNN) algorithm for diabetes disease prediction. Some researchers have proposed a different approach by using K-means clustering for preprocessing and then using KNN for classification. These approaches resulted in poor classification accuracy or prediction. In our work we proposed and developed two different methods first one is fuzzy c-means clustering algorithm followed by a KNN classifier and second one is fuzzy c-means clustering algorithm followed by fuzzy KNN classifier to improve the accuracy of classification. We are successful in obtaining the better results than the existing methods for the given datasets. Our second approach produced better result than the first one. Classification is carried out using ten folds cross-validation technique. © 2015 IEEE.Item Genome Data Analysis Using MapReduce Paradigm(Institute of Electrical and Electronics Engineers Inc., 2015) Pahadia, M.; Srivastava, A.; Srivastava, D.; Patil, N.Counting the number of occurences of a substringin a string is a problem in many applications. This paper suggests a fast and efficient solution for the field of bioinformatics. Ak-mer is a k-length sub string of a biological sequence. K-mercounting is defined as counting the number of occurences of all the possible k-mers in a biological sequence. K-mer counting has uses in applications ranging from error correction of sequencing reads, genome assembly, disease prediction and feature extraction. The current k-mer counting tools are both time and space costly. We provide a solution which uses MapReduce and Hadoop to reduce the time complexity. After applying the algorithms on real genome datasets, we concluded that the algorithm using Hadoopand MapReduce Paradigm runs more efficiently and reduces the time complexity significantly. © 2015 IEEE.
