Schema-Aware Indexes For Json Document Collections
Date
2023
Authors
D, Uma Priya
Journal Title
Journal ISSN
Volume Title
Publisher
National Institute Of Technology Karnataka Surathkal
Abstract
Web applications, IoT devices, and other real-time applications generate an abundance
of multi-structured data every day, increasing the complexity of data storage and man-
agement. Large organizations such as Amazon, Google, and Facebook use NoSQL
databases to store these large sets of diverse data. NoSQL databases offer an efficient
architecture for meeting the performance and scale requirements of big data compared
to relational databases. NoSQL document stores adopt the JSON format as the de-facto
standard for storing multi-structured data. The data first, schema later approach of doc-
ument stores greatly enhances the use of the JSON data format in modern applications.
However, this flexibility poses several challenges for data management and knowledge
discovery tasks.
A JSON collection does not have an explicit schema to describe the internal struc-
tures of documents; instead, the schema is implicit in the data, allowing the documents
to have various structures. Therefore, knowledge of the implicit schemas is essential
to understand the data stored in the collection. This schema information can be helpful
for efficient data retrieval, data integration, query formulation, etc. In this direction,
existing research extracts schemas from JSON documents using their structural related-
ness and generates either global schema or schema variants. The global schema is the
structural representation of the whole collection that summarises the unique attributes
in a collection. This information is generally used for JSON document validation, query
formulation, etc. As the global schema does not capture the different sets of attributes
available in each document, it does not support various data management tasks such
as data integration, query optimization, etc. To overcome this limitation, few studies
focus on extracting schema variants from the collections. Schema variants represent
the schema versions or distinct schemas of JSON collections that support the above-
mentioned data management tasks effectively. Most literature focuses on extracting the
schema versions from a collection using schema class types (entities) manually embed-
ded in the documents. Due to the dynamic nature and sheer size of JSON documents,the manual embedding of class types in each document is not feasible in a real-time
scenario. To address this issue, researchers employ clustering approaches to automati-
cally identify the class types of a JSON collection in two steps. The primary step is to
extract the schemas from a collection and then cluster the documents using the struc-
tural similarity of extracted schemas. However, differently annotated JSON schemas
are not only structurally heterogeneous but also semantically heterogeneous. Litera-
ture shows that the automatic identification of class types of JSON documents based on
structural and semantic similarity of JSON schemas is still in its infancy. To address
these research gaps, this research employs both syntactic and semantic relationships
of JSON schemas to capture the contextual information. In this work, we propose (i)
Schema Embeddings for JSON Documents (SchemaEmbed) model to capture the con-
textually similar JSON schemas, (ii) Embedding-based Clustering approach to group
the contextually similar JSON documents, and (iii) Schema Variants Tree (SVTree) to
represent the schema variants of each cluster. As SVTree contains information about
the core (common) and schema-specific attributes in a cluster, it supports efficient data
retrieval. The proposed approach is evaluated with real-world and synthetic datasets.
The results and findings demonstrate that the proposed approach outperforms the cur-
rent approaches significantly in grouping the contextually similar JSON documents. In
addition, the impact of clustering in constructing a compact SVTree is also studied.
The heterogeneous nature of JSON documents increases the complexity of the ef-
ficient retrieval of data. Indexes have traditionally been used to improve the speed of
data retrieval. Existing indexing techniques for JSON data use global schema to identify
the unique attributes in a collection and support exact (lexical) matching of path-based
queries. However, they suffer from huge index sizes and data retrieval time. As JSON
schemas are annotated differently, providing semantic support increases the search rele-
vancy. Existing work on the semantic search of JSON documents uses knowledge bases
such as WordNet. However, they capture the abstract meaning of JSON attributes rather
than their context. To bridge these research gaps, this research proposes efficient and
compact index structures, namely JSON Index (JIndex) and Embedding-based JIndex
(EJIndex), to support both lexical and semantic matching of path-based queries. With
iithe help of core and schema-specific attributes of schema variants stored in SVTree, the
proposed indexes reduce the index size by storing only a subset of attributes rather than
all the attributes in a collection. Experimental results demonstrate that the proposed in-
dexes outperform the existing approaches in retrieving both lexically and semantically
relevant results, significantly reducing index size and data retrieval time.
As JSON documents evolve and change over time, the implicit schemas must be
extracted and updated in the database to support dynamic data retrieval. Existing ap-
proaches focus either on maintaining the history of schema versions in data lakes or
updating the global schema. Nevertheless, the schema variants must be updated to
provide the latest documents for the user queries. In this work, we propose an Incre-
mental SchemaEmbed model to generate schema embeddings for new schema variants
of the latest documents while preserving the knowledge of old schema variants. The
Incremental Embedding-based Clustering approach assigns the latest documents to the
respective clusters based on the contextual similarity of their schema variants. Conse-
quently, the JIndex and EJIndex are updated incrementally to support the retrieval of
the latest documents for the user queries. The experimental results on diverse datasets
show that the proposed work is efficient in updating the schema variants and the indexes.
Description
Keywords
JSON, Schema extraction, Schema variants, JSON Indexing