Faculty Publications
Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736
Publications by NITK Faculty
Browse
88 results
Search Results
Item Gaining Actionable Insights in COVID-19 Dataset Using Word Embeddings(Springer Science and Business Media Deutschland GmbH, 2022) Jha, R.A.; Ananthanarayana, V.S.The field of unsupervised natural language processing (NLP) is gradually growing in prominence and popularity due to the overwhelming amount of scientific and medical data available as text, such as published journals and papers. To make use of this data, several techniques are used to extract information from these texts. Here, in this paper, we have made use of COVID-19 corpus (https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge ) related to the deadly corona virus, SARS-CoV-2, to extract useful information which can be invaluable in finding the cure of the disease. We make use of two word-embeddings model, Word2Vec and global vector for word representation (GloVe), to efficiently encode all the information available in the corpus. We then follow some simple steps to find the possible cures of the disease. We got useful results using these word-embeddings models, and also, we observed that Word2Vec model performed better than GloVe model on the used dataset. Another point highlighted by this work is that latent information about potential future discoveries are significantly contained in past papers and publications. © 2022, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.Item Dynamic primary copy with piggy-backing mechanism for replicated UDDI registry(Springer Science and Business Media Deutschland GmbH, 2006) Ananthanarayana, V.S.; Vidyasankar, K.As the community using web services grows, the UDDI registry is a crucial entry point that needs to provide high throughput, high availability and access to accurate data. Replication is often used to satisfy such requirements. In this paper, we propose dynamic primary copy method, a variant of primary copy method to handle the replicated UDDI registry, and two algorithms implementing this method. In this method, the update is done at the site where the request is submitted. The algorithms use a simple mechanism to handle the conflicting requests on UDDI entities in an efficient fashion. Due to a large volume of update and inquiry requests to UDDI, the number and size of the messages are critical in any replication solution for UDDI registry. Our algorithms reduce both the number and the size of messages significantly. The main difference between the two algorithms is that one of the algorithms handles high degree of conflicting update requests in an efficient fashion without transmitting unnecessary intermediate results. © Springer-Verlag Berlin Heidelberg 2006.Item A novel data structure for efficient representation of large data sets in data mining(2006) Pai, R.M.; Ananthanarayana, V.S.An important goal in data mining is to generate an abstraction of the data. Such an abstraction helps in reducing the time and space requirements of the overall decision making process. It is also important that the abstraction be generated from the data in small number of scans. In this paper, we propose a novel data structure called Prefix-Postfix structure(PP-structure), which is an abstraction of the data that can be built by scanning the database only once. We prove that this structure is compact, complete and incremental and therefore is suitable to represent dynamic databases. Further, we propose a clustering algorithm using this structure. The proposed algorithm is tested on different real world datasets and is shown that the algorithm is both space efficient and time efficient for large datasets without sacrificing for the accuracy. We compare our algorithm with other algorithms and show the effectiveness of our algorithm. © 2006 IEEE.Item EfficientTreeMiner: Mining frequent induced substructures from XML documents without candidate generation(2006) Santhi Thilagam, P.S.; Ananthanarayana, V.S.Tree structures are used extensively in domains such as XML databases, computational biology, pattern recognition, computer networks, web mining, multi-relational data mining and so on. In this paper, we present an EfficientTreeMiner, a computationally efficient algorithm that discovers all frequently occurring induced subtrees in a database of labeled rooted unordered trees. The proposed algorithm mines frequent subtrees without generating any candidate subtrees. Efficiency is achieved by compressing the large database into a condensed data structure, namely prefix string representation, which reduces space complexity and by adopting a Frequent Immediate Descendents method that avoids the costly generation of candidate sets. Experimental results show that our algorithm has less time complexity when compared to existing approaches and is also scalable for mining both long and short frequent subtrees. © 2006 IEEE.Item Efficient mining of frequent rooted continuous directed subgraphs(2006) Sreenivasa, G.J.; Ananthanarayana, V.S.Mining frequent rooted continuous directed (RCD) subgraphs is very useful in Web usage mining domain. We formulate the problem of mining RCD subgraphs in a database of rooted labeled continuous directed graphs. We propose a novel approach of merging like RCD subgraphs. This approach builds a Pattern Super Graph (PSG) structure. This PSG is a compact structure and ideal for extracting frequent patterns in the form of RCD subgraphs. The PSG based mine avoids costly, repeated database scans and there is no generation of candidates. Results obtained are appreciating the approach proposed. © 2006 IEEE.Item An efficient classification algorithm based on pattern range tree prototypes(2007) Shreeranga, P.R.; Vig, A.; Ananthanarayana, V.S.Abstraction based Pattern Classifier has drawn a lot of attention today. This type of classifier has two phases. They are: design phase, where the abstractions are created and classification phase, where the classification is done using these abstractions. Techniques like neural networks, genetic algorithms require very high design time. In other techniques like nearest neighbor classifier, the design time is near to zero but the classification time is predominantly high. Pattern Count Tree (PC- tree) based classifier was proposed as an abstraction based classifier that strikes a balance between the design time and the classification time. In this paper, we are going to propose a novel data structure called Pattern Range Tree (PR-tree) and a pattern classifier based on PR-tree. Experimental results presented in this paper show that PR-tree based classifier (PRC) is more efficient than PC-tree based classifier (PCC) in terms of storage space, processing time and classification accuracy. © 2007 IEEE.Item An expemmental study of the effect of frequency of co-occurrence of features in clustering(2007) Pai, R.M.; Ananthanarayana, V.S.In this paper, an attempt has been made to explore the effect of frequency of co-occurrence of features on the accuracy of the clustering results. This has been achieved by incorporating the frequency component in the clustering algorithm. The frequency, we mean here is the number of times the sequence of features appear in the data set. We try to utilize this component in the algorithm and study its effect on the resultant accuracy. The algorithm we have used is the PC(pattern count)-tree based clustering algorithm. The PC-tree is a compact and complete representation of the data set. It is data order independent and incremental. It can be applied to changing data and changing knowledge. i.e. dynamic databases. This algorithm is based on a compact data structure called PC-tree. The node of the PC-tree has, in addition to other fields a count field, which keeps track of the count of the number of features shared by the pattern. In the literature, the PC-tree was used for clustering and the count field was used only to retrieve back the transactions. In this paper, we try to make use of this field in clustering. We have also used the partitioned PC-tree based algorithm and studied the effect of frequency on the accuracy. We have conducted extensive experiments with the OCR handwritten digit dataset, a real dataset and observed the effect of frequency on the clustering results. The results of all our experiments are tabulated. ©2007 IEEE.Item Prefix-Suffix trees: A novel scheme for compact representation of large datasets(Springer Verlag, 2007) Pai, R.M.; Ananthanarayana, V.S.An important goal in data mining is to generate an abstraction of the data. Such an abstraction helps in reducing the time and space requirements of the overall decision making process. It is also important that the abstraction be generated from the data in small number of scans. In this paper we propose a novel scheme called Prefix-Suffix trees for compact storage of patterns in data mining, which forms an abstraction of the patterns, and which is generated from the data in a single scan. This abstraction takes less amount of space and hence forms a compact storage of patterns. Further, we propose a clustering algorithm based on this storage and prove experimentally that this type of storage reduces the space and time. This has been established by considering large data sets of handwritten numerals namely the OCR data, the MNIST data and the USPS data. The proposed algorithm is compared with other similar algorithms and the efficacy of our scheme is thus established. © Springer-Verlag Berlin Heidelberg 2007.Item Semantic partition based association rule mining across multiple databases using abstraction(2007) Santhi Thilagam, P.S.; Ananthanarayana, V.S.Association rule mining activity is both computationally and I/O intensive. A majority of ARM algorithms reported in the literature is efficient in handling high dimensional data but is single database based. Many enterprises maintain several databases independently to serve different purposes. There could be an implicit association among various parts of such data. In this paper, we investigate a mechanism to generate Association Rules (ARs) between the sets of values which are subsets of domains of attributes occurring in relations present in different databases. In our approach, the relevant databases, relations and attributes are identified using knowledge, multiple navigation paths are generated using data dictionary, a structure is constructed which semantically partitions the resultant relation using this navigation paths. We propose an efficient algorithm which uses this structure to generate ARs. © 2007 IEEE.Item An abstraction based communication efficient distributed association rule mining(2008) Santhi Thilagam, P.S.; Ananthanarayana, V.S.Association rule mining is one of the most researched areas because of its applicability in various fields. We propose a novel data structure called Sequence Pattern Count, SPC, tree which stores the database compactly and completely and requires only one scan of the database for its construction. The completeness property of the SPC tree with respect to the database makes it more suitable for mining association rules in the context of changing data and changing supports without rebuilding the tree. A performance study shows that SPC tree is efficient and scalable. We also propose a Doubly Logaxithmic-depth Tree, DLT, algorithm which uses SPC tree to efficiently mine the huge amounts of geographically distributed datasets in order to minimize the communication and computation costs. DLT requires only O(n) messages for support count exchange and it takes only O(log log n) time for exchange of messages, which increases its efficiency. © Springer-Verlag Berlin Heidelberg 2008.
