Context Sensitive Tamil Language Spellchecker Using RoBERTa

No Thumbnail Available

Date

2023

Journal Title

Journal ISSN

Volume Title

Publisher

Springer Science and Business Media Deutschland GmbH

Abstract

A spellchecker is a tool that helps to identify spelling errors in a piece of text and lists out the possible suggestions for that word. There are many spell-checkers available for languages such as English but a limited number of spell-checking tools are found for low-resource languages like Tamil. In this paper, we present an approach to develop a Tamil spell checker using the RoBERTa (xlm-roberta-base) model. We have also proposed an algorithm to generate the test dataset by introducing errors in a piece of text. The spellchecker finds out the mistake in a given text using a corpus of unique Tamil words collected from different sources such as Wikipedia and Tamil conversations, and lists out the suggestions that could be the potential contextual replacement of the misspelled word using the proposed model. On introducing a few errors in a piece of text collected from a Wikipedia article and testing it on our model, an accuracy of 91.14% was achieved for error detection. Contextually correct words were then suggested for these erroneous words detected. Our spellchecker performed better than some of the existing Tamil spellcheckers in terms of both higher accuracy and lower false positives. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.

Description

Keywords

Error Correction, Error Detection, Spellchecker, Tamil, XLM-RoBERTa

Citation

Communications in Computer and Information Science, 2023, Vol.1802 CCIS, , p. 51-61

Endorsement

Review

Supplemented By

Referenced By