2. Thesis and Dissertations

Permanent URI for this communityhttps://idr.nitk.ac.in/handle/1/10

Browse

Search Results

Now showing 1 - 3 of 3
  • Thumbnail Image
    Item
    An Efficient Framework for Information Retrieval from Linked Data
    (National Institute of Technology Karnataka, Surathkal, 2021) R, Sakthi Murugan.; V. S, Ananthanarayana
    Linked data is a method of publishing machine-processable data over the web. The resource description framework (RDF) is a standard model for publishing linked data. Currently, distributed compendiums of linked data are available over multiple SPARQL endpoints that can be queried using SPARQL queries. The size of the linked data is steadily increasing as many companies and government organizations have begun adopting linked data technologies. Many frameworks have been proposed for information retrieval from linked data, and the volume of data is a common challenge. Due to the huge volume, many existing linked data search engines have not indexed the latest data. The major components of information retrieval from linked data include storing, partitioning, indexing and ranking. This thesis presents a novel framework for information retrieval from a distributed compendium of linked data called the ‘Linked Data Search Framework’, abbreviated as LDSF. The significant contributions of LDSF include the method of storage, partitioning, indexing and ranking of linked data. The storage cost of RDF data is one of the primary concerns for searching linked data. The main objective of linked data is to represent the data as URI’s in a format that is both human understandable and machine processable. This intermediate URI form of representation is difficult for humans to understand and consumes massive storage in the case of machines. Humans read, think and speak text data as words and not as characters, but computers use character-based encoding such as Unicode to handle text data (including linked data). This thesis presents an approach named ‘WordCode’, a wordbased encoding of text data (including linked data), that enables computers to store and process text data as words. A trie based code page named ‘WordTrie’ is proposed to store words for rapid encoding and decoding. Experimental results from encoding text files from the standard corpus using WordCode show an up to 19.9% reduction in file size compared to that achieved with character-based encoding. The proposed Word- Code method of encoding words in a machine-processable format used less storage space, resulting in faster processing and communication of text data (including linked data). Query processing in massive linked data is performed by distributing the storage across multiple partitions. Considerable research has been conducted to partition linked data based on clusters. Additionally, substantial research on hash-based partitioning, cloud-based partitioning, and graph-based partitioning has been reported. However, these sophisticated partitioning algorithms are not based on the semantic relatedness of the data and suffer from high preprocessing cost. In this thesis, a semantic-based partitioning method using a novel nexus clustering algorithm is discussed. For every concept, the core properties of the linked data are identified, and bilevel, nexus-based hierarchical agglomerative clustering is used to partition the linked data. The proposed method is evaluated using the gold standard test data sampled from DBpedia across eight closely related categories. The proposed clustering technique partitions the linked data with a precision of 98.7% on the gold standard dataset. Multiple indexing strategies have been proposed to search and access linked data easily at any given time. All these extensive indexing schemes involve substantial redundant data, which greatly increases the required storage and computational resources needed to update the index of the dynamically growing linked data. This thesis introduces ‘trist’, a hybrid data structure combining a tree and doubly linked list to index linked data. The linked data contain URIs and values. The URIs and values are separately indexed using ‘URI trist’ and ‘Value trist’, respectively. Compared to the existing indexing strategies, this indexing approach reduces the storage consumption. The experimental results using the sampled DBpedia dataset demonstrate that trist-based indexing achieves a space-saving of 60% compared to regular graph-based storage of linked data. Also, the proposed trist-based indexing is 6000% faster in accessing the linked data from the graph than the regular graph without indexing. The ultimate goal of an information retrieval system is to rank the linked data that will be appealing to the end user. The existing approaches for ranking linked data are all atomistic. Often, the problem with ranking linked data is that the data are of various kinds from multiple sources. This thesis presents a holistic approach to rank linked data from multiple SPARQL endpoints and presenting the integrated results. The holistic rank is computed based on four subranks: endpoint rank, concept rank, predicate rank and value rank. LDSF also provides an approach to represent the URI form of linked data to the user in an easily understandable manner. The ordering ofWikipedia is agreed to be readable by its users over the web, and the ranking in this thesis is evaluated based on the ordering ofWikipedia: the proposed ranking correlates up to 99% with the ordering of Wikipedia with the DBpedia sampled dataset. Overall, the proposed LDSF is an efficient framework for storing, partitioning, indexing and ranking linked data that produces a more satisfactory query result than that of existing systems.
  • Thumbnail Image
    Item
    Web UR: Effective Techniques For Web Usage Mining And Recommender System
    (National Institute of Technology Karnataka, Surathkal, 2013) G., Poornalatha; V. S, Ananthanarayana; Raghavendra, Prakash S.
    The proliferation of internet along with the attractiveness of the web in recent years has made web mining as the research area of great magnitude. Web mining essentially has many advantages which make this technology attractive to researchers. The analysis of web users’ navigational pattern within a web site can provide useful information for server performance enhancements, restructuring a web site, direct marketing in e-commerce etc. This thesis discusses an effective clustering technique that groups user sessions, by modifying k-means algorithm. The proposed distance measures namely, the variable length vector distance, sequence alignment based distance measure, and hybrid sequence alignment measure are explained. The results obtained are validated. The present work attempts to solve the problem of predicting the next page to be accessed by the user based on the mining of web server logs, that maintains the information of users who access the web site. The proposed model yields good prediction accuracy compared to the existing methods like Markov model, association rule, ANN etc. A recommender system based on session collaborative filtering is proposed. The proposed recommender system is compared with a few other recommender systems by using precision and recall as metrics, and a better performance is observed. The outcome of prediction and recommender system could be used to suggest any structural modifications to the web site.
  • Thumbnail Image
    Item
    Effective Multimedia Document Representations for Knowledge Discovery
    (National Institute of Technology Karnataka, Surathkal, 2017) K, Pushpalatha; V. S, Ananthanarayana
    In recent years, the rapid advances in multimedia technology have led to grow the multimedia documents explosively. In order to utilize the multimodal information of multimedia documents, sophisticated knowledge discovery systems are required. The knowledge discovery systems require efficient multimedia mining methods to extract the meaningful and useful information from the huge volume of multimedia documents. The success of multimedia mining relies on the representation of multimedia documents and its multimodal contents. The appropriate representation of multimedia documents discovers the useful patterns that can be used to assist the multimedia mining methods in discovering the useful knowledge. The multimodal nature of multimedia objects is the challenging problem for the multimedia document representation, as the features of multimodal objects are in different space with different characteristics and dimensionalities. Representation of multimodal multimedia objects in a unified feature space helps the multimedia document representation and multimedia mining methods. The research work in this thesis proposes the multimedia data representation methods, multimedia document representations, and multimedia mining methods for the effective knowledge discovery in multimedia documents. In the first methodology, this thesis aims at the representation of multimodal multimedia objects in a unified feature space. We propose two multimedia data representation methods, Multimedia To Signal Conversion (MSC) and Multimedia to Image Conversion (MIC) to represent the multimedia objects in a unified domain. The MSC represents the multimedia objects in frequency domain by converting the multimedia objects as signal objects. The MIC converts the multimedia objects as image objects to represent them in spatial domain. The multimedia objects in unified domain are represented in the unified feature space using the features with similar dimensions and characteristics. Hence, both the multimedia data representation methods convert themultimodal multimedia documents as unified multimedia documents. The unified multimedia documents ease the representation of multimedia documents and improve the efficiency of multimedia mining methods. The proposed multimedia data representation methods are effectively used for knowledge extraction from multimedia documents. In the second methodology, this thesis presents the two multimedia document representations, Multimedia Suffix Tree Document (MSTD) and Multimedia Feature Pattern Tree (MFPT) to represent the unified multimedia documents. The MSTD represents the unified multimedia documents based on shared similar multimedia objects among the documents. The similarity between the multimedia objects depends on the similarity of the features. The MFPT represents the documents based on shared similar feature patterns of the multimedia objects. Both the representations are compact and provide the complete information of the documents. They function as the platform for the multimedia knowledge extraction methods. In the third methodology, this thesis explores the multimedia mining methods based on the MSTD and MFPT representations. The MSTD and MFPT based classification algorithms effectively classifies the multimedia documents. The multimedia documents are partitioned into clusters of same multimedia concepts using the MSTD and MFPT based clustering algorithms. The MSTD representation extracts the frequent multimedia patterns to generate the multimedia class association rules for classifying the multimedia documents. The MFPT representation extracts the sequential multimedia feature patterns to derive the multimedia class sequential rules that support the classification of multimedia documents based on the object characteristics. The efficacy of the proposed methods is evaluated by conducting the experiments with four datasets of multimodal multimedia documents. Experimental results demonstrate that the proposed multimedia data representation methods benefit the multimedia document representation and multimedia mining methods by representing the multimodal multimedia objectsin a unified feature space. The proposed multimedia document representations are effectively used to enhance the performance of multimedia mining methods in discovering the knowledge from multimedia documents.