Detecting Semantic Similarity of Documents Using Natural Language Processing
No Thumbnail Available
Date
2021
Journal Title
Journal ISSN
Volume Title
Publisher
Elsevier B.V.
Abstract
The similarity of documents in natural languages can be judged based on how similar the embeddings corresponding to their textual content are. Embeddings capture the lexical and semantic information of texts, and they can be obtained through bag-of-words approaches using the embeddings of constituent words or through pre-trained encoders. This paper examines various existing approaches to obtain embeddings from texts, which is then used to detect similarity between them. A novel model which builds upon the Universal Sentence Encoder is also developed to do the same. The explored models are tested on the SICK-dataset, and the correlation between the ground truth values given in the dataset and the predicted similarity is computed using the Pearson, Spearman and Kendall's Tau correlation metrics. Experimental results demonstrate that the novel model outperforms the existing approaches. Finally, an application is developed using the novel model to detect semantic similarity between a set of documents. © 2021 Elsevier B.V.. All rights reserved.
Description
Keywords
Computational Linguistic, Deep Learning, Embeddings, Natural Language Processing, Semantic Similarity
Citation
Procedia CIRP, 2021, Vol.189, , p. 128-135
