An Efficient Framework for Information Retrieval from Linked Data
Date
2021
Authors
R, Sakthi Murugan.
Journal Title
Journal ISSN
Volume Title
Publisher
National Institute of Technology Karnataka, Surathkal
Abstract
Linked data is a method of publishing machine-processable data over the web. The
resource description framework (RDF) is a standard model for publishing linked data.
Currently, distributed compendiums of linked data are available over multiple SPARQL
endpoints that can be queried using SPARQL queries. The size of the linked data
is steadily increasing as many companies and government organizations have begun
adopting linked data technologies. Many frameworks have been proposed for information
retrieval from linked data, and the volume of data is a common challenge. Due
to the huge volume, many existing linked data search engines have not indexed the
latest data. The major components of information retrieval from linked data include
storing, partitioning, indexing and ranking. This thesis presents a novel framework for
information retrieval from a distributed compendium of linked data called the ‘Linked
Data Search Framework’, abbreviated as LDSF. The significant contributions of LDSF
include the method of storage, partitioning, indexing and ranking of linked data.
The storage cost of RDF data is one of the primary concerns for searching linked
data. The main objective of linked data is to represent the data as URI’s in a format that
is both human understandable and machine processable. This intermediate URI form
of representation is difficult for humans to understand and consumes massive storage in
the case of machines. Humans read, think and speak text data as words and not as characters,
but computers use character-based encoding such as Unicode to handle text data
(including linked data). This thesis presents an approach named ‘WordCode’, a wordbased
encoding of text data (including linked data), that enables computers to store and
process text data as words. A trie based code page named ‘WordTrie’ is proposed to
store words for rapid encoding and decoding. Experimental results from encoding text
files from the standard corpus using WordCode show an up to 19.9% reduction in file
size compared to that achieved with character-based encoding. The proposed Word-
Code method of encoding words in a machine-processable format used less storage
space, resulting in faster processing and communication of text data (including linked
data).
Query processing in massive linked data is performed by distributing the storage
across multiple partitions. Considerable research has been conducted to partition linked
data based on clusters. Additionally, substantial research on hash-based partitioning,
cloud-based partitioning, and graph-based partitioning has been reported. However,
these sophisticated partitioning algorithms are not based on the semantic relatedness
of the data and suffer from high preprocessing cost. In this thesis, a semantic-based
partitioning method using a novel nexus clustering algorithm is discussed. For every
concept, the core properties of the linked data are identified, and bilevel, nexus-based
hierarchical agglomerative clustering is used to partition the linked data. The proposed
method is evaluated using the gold standard test data sampled from DBpedia across
eight closely related categories. The proposed clustering technique partitions the linked
data with a precision of 98.7% on the gold standard dataset.
Multiple indexing strategies have been proposed to search and access linked data
easily at any given time. All these extensive indexing schemes involve substantial redundant
data, which greatly increases the required storage and computational resources
needed to update the index of the dynamically growing linked data. This thesis introduces
‘trist’, a hybrid data structure combining a tree and doubly linked list to index
linked data. The linked data contain URIs and values. The URIs and values are separately
indexed using ‘URI trist’ and ‘Value trist’, respectively. Compared to the existing
indexing strategies, this indexing approach reduces the storage consumption. The
experimental results using the sampled DBpedia dataset demonstrate that trist-based
indexing achieves a space-saving of 60% compared to regular graph-based storage of
linked data. Also, the proposed trist-based indexing is 6000% faster in accessing the
linked data from the graph than the regular graph without indexing.
The ultimate goal of an information retrieval system is to rank the linked data that
will be appealing to the end user. The existing approaches for ranking linked data are
all atomistic. Often, the problem with ranking linked data is that the data are of various
kinds from multiple sources. This thesis presents a holistic approach to rank linked data
from multiple SPARQL endpoints and presenting the integrated results. The holistic
rank is computed based on four subranks: endpoint rank, concept rank, predicate rank
and value rank. LDSF also provides an approach to represent the URI form of linked
data to the user in an easily understandable manner. The ordering ofWikipedia is agreed
to be readable by its users over the web, and the ranking in this thesis is evaluated
based on the ordering ofWikipedia: the proposed ranking correlates up to 99% with the
ordering of Wikipedia with the DBpedia sampled dataset.
Overall, the proposed LDSF is an efficient framework for storing, partitioning, indexing
and ranking linked data that produces a more satisfactory query result than that
of existing systems.
Description
Keywords
Department of Information Technology, Semantic Web, Linked Data, RDF, SPARQL, Information Retrieval, Storing, Indexing, Partitioning, Clustering, Ranking, Word Encoding, Text Encoding