Faculty Publications

Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736

Publications by NITK Faculty

Browse

Search Results

Now showing 1 - 4 of 4
  • Item
    JSON Document Clustering Based on Structural Similarity and Semantic Fusion
    (Springer Science and Business Media Deutschland GmbH, 2023) Uma Priya, D.; Santhi Thilagam, P.S.
    The emerging drift toward real-time applications generates massive amounts of JSON data exponentially over the web. Dealing with the heterogeneous structures of JSON document collections is challenging for efficient data management and knowledge discovery. Clustering JSON documents has become a significant issue in organizing large data collections. Existing research has focused on clustering JSON documents using structural or semantic similarity measures. However, differently annotated JSON structures are also related by the context of the JSON attributes. As a result, existing research work is unable to identify the context hidden in the schemas, emphasizing the importance of leveraging the syntactic, semantic, and contextual properties of heterogeneous JSON schemas. To address the specific research gap, this work proposes JSON Similarity (JSim), a novel approach for clustering JSON documents by combining the structural and semantic similarity scores of JSON schemas. In order to capture more semantics, the semantic fusion method is proposed, which correlates schemas using semantic as well as contextual similarity measures. The JSON documents are clustered based on the weighted similarity matrix. The results and findings show that the proposed approach outperforms the current approaches significantly. © 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
  • Item
    ClustVariants: An Approach for Schema Variants Extraction from JSON Document Collections
    (Institute of Electrical and Electronics Engineers Inc., 2022) Uma Priya, D.; Santhi Thilagam, P.S.
    The use of NoSQL Document Stores has grown in recent years as it offers the potential for increased scalability, flexibility, and consistency to store a massive collection of varied structured data in JSON format. Although the document stores do not impose any structural constraint on the data, the lack of schema information challenges efficient data processing, data management, and data integration. Hence, extant research focussed on identifying the global schema for a collection. Nevertheless, it comes at the cost of losing essential benefits of schema such as a detailed structural description of data, query optimization, etc. To address the specific research gap, we propose ClustVariants, a novel approach for discovering the exact schema variants available in a collection. While the complex structure of large heterogeneous JSON data can not be analyzed directly, we resolve this limitation by systematically extract the structure of data, analyze the fields, and cluster the homogeneous documents. We apply a distributed Formal Concept Analysis algorithm, using Apache Spark, to identify the schema variants from a large cluster of JSON documents. The experimental study on real datasets prove that ClustVariants is efficient in inferring exact schema variants of JSON document collections. © 2022 IEEE.
  • Item
    Leveraging Structural and Semantic Measures for JSON Document Clustering
    (IICM, 2023) Uma Priya, D.; Santhi Thilagam, P.S.
    In recent years, the increased use of smart devices and digital business opportunities has generated massive heterogeneous JSON data daily, making efficient data storage and management more difficult. Existing research uses different similarity metrics and clusters the documents to support the above tasks effectively. However, extant approaches have focused on either structural or semantic similarity of schemas. As JSON documents are application-specific, differently annotated JSON schemas are not only structurally heterogeneous but also differ by the context of the JSON attributes. Therefore, there is a need to consider the structural, semantic, and contextual properties of JSON schemas to perform meaningful clustering of JSON documents. This work proposes an approach to cluster heterogeneous JSON documents using the similarity fusion method. The similarity fusion matrix is constructed using structural, semantic, and contextual measures of JSON schemas. The experimental results demonstrate that the proposed approach outperforms the existing approaches significantly. © 2023, IICM. All rights reserved.
  • Item
    JSON document clustering based on schema embeddings
    (SAGE Publications Ltd, 2024) Uma Priya, D.U.; Santhi Thilagam, P.S.
    The growing popularity of JSON as the data storage and interchange format increases the availability of massive multi-structured data collections. Clustering JSON documents has become a significant issue in organising large data collections. Existing research uses various structural similarity measures to perform clustering. However, differently annotated JSON structures may also encode semantic relatedness, necessitating the use of both syntactic and semantic properties of heterogeneous JSON schemas. Using the SchemaEmbed model, this paper proposes an embedding-based clustering approach for grouping contextually similar JSON documents. The SchemaEmbed model is designed using the pre-trained Word2Vec model and a deep autoencoder that considers both syntactic and semantic information of JSON schemas for clustering the documents. The Word2Vec model learns the attribute embeddings, and a deep autoencoder is designed to generate context-aware schema embeddings. Finally, the context-based similar JSON documents are grouped using a clustering algorithm. The effectiveness of the proposed work is evaluated using both real and synthetic datasets. The results and findings show that the proposed approach improves clustering quality significantly, with a high NMI score of 75%. In addition, we demonstrate that clustering results obtained by contextual similarity are superior to those obtained by traditional semantic similarity models. © The Author(s) 2022.