ClustVariants: An Approach for Schema Variants Extraction from JSON Document Collections

No Thumbnail Available

Date

2022

Journal Title

Journal ISSN

Volume Title

Publisher

Institute of Electrical and Electronics Engineers Inc.

Abstract

The use of NoSQL Document Stores has grown in recent years as it offers the potential for increased scalability, flexibility, and consistency to store a massive collection of varied structured data in JSON format. Although the document stores do not impose any structural constraint on the data, the lack of schema information challenges efficient data processing, data management, and data integration. Hence, extant research focussed on identifying the global schema for a collection. Nevertheless, it comes at the cost of losing essential benefits of schema such as a detailed structural description of data, query optimization, etc. To address the specific research gap, we propose ClustVariants, a novel approach for discovering the exact schema variants available in a collection. While the complex structure of large heterogeneous JSON data can not be analyzed directly, we resolve this limitation by systematically extract the structure of data, analyze the fields, and cluster the homogeneous documents. We apply a distributed Formal Concept Analysis algorithm, using Apache Spark, to identify the schema variants from a large cluster of JSON documents. The experimental study on real datasets prove that ClustVariants is efficient in inferring exact schema variants of JSON document collections. © 2022 IEEE.

Description

Keywords

Clustering, Formal Concept Analysis, JSON, NoSQL, Schema Extraction

Citation

2022 IEEE IAS Global Conference on Emerging Technologies, GlobConET 2022, 2022, Vol., , p. 515-520

Endorsement

Review

Supplemented By

Referenced By