Please use this identifier to cite or link to this item: https://idr.nitk.ac.in/jspui/handle/123456789/17152
Title: An Efficient Framework for Information Retrieval from Linked Data
Authors: R, Sakthi Murugan.
Supervisors: V. S, Ananthanarayana
Keywords: Department of Information Technology;Semantic Web;Linked Data;RDF;SPARQL;Information Retrieval;Storing;Indexing;Partitioning;Clustering;Ranking;Word Encoding;Text Encoding
Issue Date: 2021
Publisher: National Institute of Technology Karnataka, Surathkal
Abstract: Linked data is a method of publishing machine-processable data over the web. The resource description framework (RDF) is a standard model for publishing linked data. Currently, distributed compendiums of linked data are available over multiple SPARQL endpoints that can be queried using SPARQL queries. The size of the linked data is steadily increasing as many companies and government organizations have begun adopting linked data technologies. Many frameworks have been proposed for information retrieval from linked data, and the volume of data is a common challenge. Due to the huge volume, many existing linked data search engines have not indexed the latest data. The major components of information retrieval from linked data include storing, partitioning, indexing and ranking. This thesis presents a novel framework for information retrieval from a distributed compendium of linked data called the ‘Linked Data Search Framework’, abbreviated as LDSF. The significant contributions of LDSF include the method of storage, partitioning, indexing and ranking of linked data. The storage cost of RDF data is one of the primary concerns for searching linked data. The main objective of linked data is to represent the data as URI’s in a format that is both human understandable and machine processable. This intermediate URI form of representation is difficult for humans to understand and consumes massive storage in the case of machines. Humans read, think and speak text data as words and not as characters, but computers use character-based encoding such as Unicode to handle text data (including linked data). This thesis presents an approach named ‘WordCode’, a wordbased encoding of text data (including linked data), that enables computers to store and process text data as words. A trie based code page named ‘WordTrie’ is proposed to store words for rapid encoding and decoding. Experimental results from encoding text files from the standard corpus using WordCode show an up to 19.9% reduction in file size compared to that achieved with character-based encoding. The proposed Word- Code method of encoding words in a machine-processable format used less storage space, resulting in faster processing and communication of text data (including linked data). Query processing in massive linked data is performed by distributing the storage across multiple partitions. Considerable research has been conducted to partition linked data based on clusters. Additionally, substantial research on hash-based partitioning, cloud-based partitioning, and graph-based partitioning has been reported. However, these sophisticated partitioning algorithms are not based on the semantic relatedness of the data and suffer from high preprocessing cost. In this thesis, a semantic-based partitioning method using a novel nexus clustering algorithm is discussed. For every concept, the core properties of the linked data are identified, and bilevel, nexus-based hierarchical agglomerative clustering is used to partition the linked data. The proposed method is evaluated using the gold standard test data sampled from DBpedia across eight closely related categories. The proposed clustering technique partitions the linked data with a precision of 98.7% on the gold standard dataset. Multiple indexing strategies have been proposed to search and access linked data easily at any given time. All these extensive indexing schemes involve substantial redundant data, which greatly increases the required storage and computational resources needed to update the index of the dynamically growing linked data. This thesis introduces ‘trist’, a hybrid data structure combining a tree and doubly linked list to index linked data. The linked data contain URIs and values. The URIs and values are separately indexed using ‘URI trist’ and ‘Value trist’, respectively. Compared to the existing indexing strategies, this indexing approach reduces the storage consumption. The experimental results using the sampled DBpedia dataset demonstrate that trist-based indexing achieves a space-saving of 60% compared to regular graph-based storage of linked data. Also, the proposed trist-based indexing is 6000% faster in accessing the linked data from the graph than the regular graph without indexing. The ultimate goal of an information retrieval system is to rank the linked data that will be appealing to the end user. The existing approaches for ranking linked data are all atomistic. Often, the problem with ranking linked data is that the data are of various kinds from multiple sources. This thesis presents a holistic approach to rank linked data from multiple SPARQL endpoints and presenting the integrated results. The holistic rank is computed based on four subranks: endpoint rank, concept rank, predicate rank and value rank. LDSF also provides an approach to represent the URI form of linked data to the user in an easily understandable manner. The ordering ofWikipedia is agreed to be readable by its users over the web, and the ranking in this thesis is evaluated based on the ordering ofWikipedia: the proposed ranking correlates up to 99% with the ordering of Wikipedia with the DBpedia sampled dataset. Overall, the proposed LDSF is an efficient framework for storing, partitioning, indexing and ranking linked data that produces a more satisfactory query result than that of existing systems.
URI: https://idr.nitk.ac.in/jspui/handle/123456789/17152
Appears in Collections:1. Ph.D Theses

Files in This Item:
File Description SizeFormat 
SAKTHI MURUGAN R 145012-IT14P01.pdf18.56 MBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.