HOTTEST: Hate and Offensive content identification in Tamil using Transformers and Enhanced STemming

dc.contributor.authorRajalakshmi, R.
dc.contributor.authorSelvaraj, S.
dc.contributor.authorFaerie Mattins, R.
dc.contributor.authorVasudevan, P.
dc.contributor.authorAnand Kumar, A.K.
dc.date.accessioned2026-02-04T12:26:50Z
dc.date.issued2023
dc.description.abstractOffensive content or hate speech is defined as any form of communication that aims to annoy, harass, disturb, or anger an individual or community based on factors such as faith, ethnicity, appearance, or sexuality. Nowadays, offensive content posted in regional languages increased due to the popularity of social networks and other apps usage by common people. This work proposes a method to detect and identify hate speech or offensive content in Tamil. We have used the HASOC 2021 data set that contains YouTube comments in Tamil language and written in Tamil script. In this research work, an attempt is made to find suitable embedding techniques for Tamil text representation by applying TF-IDF and pre-trained transformer models like BERT, XLM-RoBERTa, IndicBERT, mBERT, TaMillion, and MuRIL. As Tamil is a morphologically rich language, a detailed analysis is made to study the performance of hate speech detection in Tamil by applying enhanced stemming algorithms. An extensive experimental study was performed with different classifiers such as logistic regression, SVM, stochastic Gradient Descent, decision tree, and ensemble learning models in combination with the above techniques. The results of this detailed experimental study show that stop word removal produces mixed results and does not guarantee improvement in the performance of the classifier to detect offensive content for Tamil data. However, the performance on stemmed data shows a significant improvement over un-stemmed data in Tamil texts. As the data is highly imbalanced, we also combined an oversampling/downsampling technique to analyze its role in designing the best offensive classifier for Tamil text. The highest performance was achieved by a combination of stemming the text data, embedding it with the multi-lingual model MuRIL and using a majority voting ensemble as the downstream classifier. We have achieved the F<inf>1</inf>-score of 84% and accuracy of 86% for detecting offensive content in Tamil YouTube comments. © 2022 Elsevier Ltd
dc.identifier.citationComputer Speech and Language, 2023, 78, , pp. -
dc.identifier.issn8852308
dc.identifier.urihttps://doi.org/10.1016/j.csl.2022.101464
dc.identifier.urihttps://idr.nitk.ac.in/handle/123456789/22007
dc.publisherAcademic Press
dc.subjectData handling
dc.subjectDecision trees
dc.subjectDeep learning
dc.subjectGradient methods
dc.subjectNatural language processing systems
dc.subjectSpeech communication
dc.subjectSpeech recognition
dc.subjectStochastic models
dc.subjectContent identifications
dc.subjectData preprocessing
dc.subjectHate speech in tamil youtube comment
dc.subjectMuRIL
dc.subjectOffensive content identification in tamil
dc.subjectStop word
dc.subjectStop word removal and enhanced stemming for tamil text
dc.subjectTamil text data pre-processing
dc.subjectText data
dc.subjectTransformer for tamil
dc.subjectWord removals
dc.subjectYouTube
dc.subjectStochastic systems
dc.titleHOTTEST: Hate and Offensive content identification in Tamil using Transformers and Enhanced STemming

Files

Collections