HOTTEST: Hate and Offensive content identification in Tamil using Transformers and Enhanced STemming

Rajalakshmi, R.; Selvaraj, S.; Faerie Mattins, R.; Vasudevan, P.; Anand Kumar, A.K.

HOTTEST: Hate and Offensive content identification in Tamil using Transformers and Enhanced STemming

dc.contributor.author	Rajalakshmi, R.
dc.contributor.author	Selvaraj, S.
dc.contributor.author	Faerie Mattins, R.
dc.contributor.author	Vasudevan, P.
dc.contributor.author	Anand Kumar, A.K.
dc.date.accessioned	2026-02-04T12:26:50Z
dc.date.issued	2023
dc.description.abstract	Offensive content or hate speech is defined as any form of communication that aims to annoy, harass, disturb, or anger an individual or community based on factors such as faith, ethnicity, appearance, or sexuality. Nowadays, offensive content posted in regional languages increased due to the popularity of social networks and other apps usage by common people. This work proposes a method to detect and identify hate speech or offensive content in Tamil. We have used the HASOC 2021 data set that contains YouTube comments in Tamil language and written in Tamil script. In this research work, an attempt is made to find suitable embedding techniques for Tamil text representation by applying TF-IDF and pre-trained transformer models like BERT, XLM-RoBERTa, IndicBERT, mBERT, TaMillion, and MuRIL. As Tamil is a morphologically rich language, a detailed analysis is made to study the performance of hate speech detection in Tamil by applying enhanced stemming algorithms. An extensive experimental study was performed with different classifiers such as logistic regression, SVM, stochastic Gradient Descent, decision tree, and ensemble learning models in combination with the above techniques. The results of this detailed experimental study show that stop word removal produces mixed results and does not guarantee improvement in the performance of the classifier to detect offensive content for Tamil data. However, the performance on stemmed data shows a significant improvement over un-stemmed data in Tamil texts. As the data is highly imbalanced, we also combined an oversampling/downsampling technique to analyze its role in designing the best offensive classifier for Tamil text. The highest performance was achieved by a combination of stemming the text data, embedding it with the multi-lingual model MuRIL and using a majority voting ensemble as the downstream classifier. We have achieved the F<inf>1</inf>-score of 84% and accuracy of 86% for detecting offensive content in Tamil YouTube comments. © 2022 Elsevier Ltd
dc.identifier.citation	Computer Speech and Language, 2023, 78, , pp. -
dc.identifier.issn	8852308
dc.identifier.uri	https://doi.org/10.1016/j.csl.2022.101464
dc.identifier.uri	https://idr.nitk.ac.in/handle/123456789/22007
dc.publisher	Academic Press
dc.subject	Data handling
dc.subject	Decision trees
dc.subject	Deep learning
dc.subject	Gradient methods
dc.subject	Natural language processing systems
dc.subject	Speech communication
dc.subject	Speech recognition
dc.subject	Stochastic models
dc.subject	Content identifications
dc.subject	Data preprocessing
dc.subject	Hate speech in tamil youtube comment
dc.subject	MuRIL
dc.subject	Offensive content identification in tamil
dc.subject	Stop word
dc.subject	Stop word removal and enhanced stemming for tamil text
dc.subject	Tamil text data pre-processing
dc.subject	Text data
dc.subject	Transformer for tamil
dc.subject	Word removals
dc.subject	YouTube
dc.subject	Stochastic systems
dc.title	HOTTEST: Hate and Offensive content identification in Tamil using Transformers and Enhanced STemming

Collections

Journal Articles

HOTTEST: Hate and Offensive content identification in Tamil using Transformers and Enhanced STemming

Files

Collections