HOTTEST: Hate and Offensive content identification in Tamil using Transformers and Enhanced STemming

Rajalakshmi, R.; Selvaraj, S.; Faerie Mattins, R.; Vasudevan, P.; Anand Kumar, A.K.

HOTTEST: Hate and Offensive content identification in Tamil using Transformers and Enhanced STemming

Date

2023

Authors

Publisher

Academic Press

Abstract

Offensive content or hate speech is defined as any form of communication that aims to annoy, harass, disturb, or anger an individual or community based on factors such as faith, ethnicity, appearance, or sexuality. Nowadays, offensive content posted in regional languages increased due to the popularity of social networks and other apps usage by common people. This work proposes a method to detect and identify hate speech or offensive content in Tamil. We have used the HASOC 2021 data set that contains YouTube comments in Tamil language and written in Tamil script. In this research work, an attempt is made to find suitable embedding techniques for Tamil text representation by applying TF-IDF and pre-trained transformer models like BERT, XLM-RoBERTa, IndicBERT, mBERT, TaMillion, and MuRIL. As Tamil is a morphologically rich language, a detailed analysis is made to study the performance of hate speech detection in Tamil by applying enhanced stemming algorithms. An extensive experimental study was performed with different classifiers such as logistic regression, SVM, stochastic Gradient Descent, decision tree, and ensemble learning models in combination with the above techniques. The results of this detailed experimental study show that stop word removal produces mixed results and does not guarantee improvement in the performance of the classifier to detect offensive content for Tamil data. However, the performance on stemmed data shows a significant improvement over un-stemmed data in Tamil texts. As the data is highly imbalanced, we also combined an oversampling/downsampling technique to analyze its role in designing the best offensive classifier for Tamil text. The highest performance was achieved by a combination of stemming the text data, embedding it with the multi-lingual model MuRIL and using a majority voting ensemble as the downstream classifier. We have achieved the F<inf>1</inf>-score of 84% and accuracy of 86% for detecting offensive content in Tamil YouTube comments. © 2022 Elsevier Ltd

Keywords

Data handling, Decision trees, Deep learning, Gradient methods, Natural language processing systems, Speech communication, Speech recognition, Stochastic models, Content identifications, Data preprocessing, Hate speech in tamil youtube comment, MuRIL, Offensive content identification in tamil, Stop word, Stop word removal and enhanced stemming for tamil text, Tamil text data pre-processing, Text data, Transformer for tamil, Word removals, YouTube, Stochastic systems

Citation

Computer Speech and Language, 2023, 78, , pp. -

URI

https://doi.org/10.1016/j.csl.2022.101464
https://idr.nitk.ac.in/handle/123456789/22007

Collections

Journal Articles

Full item page

HOTTEST: Hate and Offensive content identification in Tamil using Transformers and Enhanced STemming

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By