JSON document clustering based on schema embeddings

dc.contributor.authorUma Priya, D.U.
dc.contributor.authorSanthi Thilagam, P.S.
dc.date.accessioned2026-02-04T12:24:18Z
dc.date.issued2024
dc.description.abstractThe growing popularity of JSON as the data storage and interchange format increases the availability of massive multi-structured data collections. Clustering JSON documents has become a significant issue in organising large data collections. Existing research uses various structural similarity measures to perform clustering. However, differently annotated JSON structures may also encode semantic relatedness, necessitating the use of both syntactic and semantic properties of heterogeneous JSON schemas. Using the SchemaEmbed model, this paper proposes an embedding-based clustering approach for grouping contextually similar JSON documents. The SchemaEmbed model is designed using the pre-trained Word2Vec model and a deep autoencoder that considers both syntactic and semantic information of JSON schemas for clustering the documents. The Word2Vec model learns the attribute embeddings, and a deep autoencoder is designed to generate context-aware schema embeddings. Finally, the context-based similar JSON documents are grouped using a clustering algorithm. The effectiveness of the proposed work is evaluated using both real and synthetic datasets. The results and findings show that the proposed approach improves clustering quality significantly, with a high NMI score of 75%. In addition, we demonstrate that clustering results obtained by contextual similarity are superior to those obtained by traditional semantic similarity models. © The Author(s) 2022.
dc.identifier.citationJournal of Information Science, 2024, 50, 5, pp. 1112-1130
dc.identifier.issn1655515
dc.identifier.urihttps://doi.org/10.1177/01655515221116522
dc.identifier.urihttps://idr.nitk.ac.in/handle/123456789/20913
dc.publisherSAGE Publications Ltd
dc.subjectClustering algorithms
dc.subjectData acquisition
dc.subjectDigital storage
dc.subjectSemantics
dc.subjectSyntactics
dc.subjectAuto encoders
dc.subjectClusterings
dc.subjectContextual similarity
dc.subjectData collection
dc.subjectData interchange
dc.subjectData storage
dc.subjectDeep autoencoder
dc.subjectDocument Clustering
dc.subjectEmbeddings
dc.subjectJSON
dc.titleJSON document clustering based on schema embeddings

Files

Collections