HOTTEST: Hate and Offensive content identification in Tamil using Transformers and Enhanced STemming
No Thumbnail Available
Date
2023
Journal Title
Journal ISSN
Volume Title
Publisher
Academic Press
Abstract
Offensive content or hate speech is defined as any form of communication that aims to annoy, harass, disturb, or anger an individual or community based on factors such as faith, ethnicity, appearance, or sexuality. Nowadays, offensive content posted in regional languages increased due to the popularity of social networks and other apps usage by common people. This work proposes a method to detect and identify hate speech or offensive content in Tamil. We have used the HASOC 2021 data set that contains YouTube comments in Tamil language and written in Tamil script. In this research work, an attempt is made to find suitable embedding techniques for Tamil text representation by applying TF-IDF and pre-trained transformer models like BERT, XLM-RoBERTa, IndicBERT, mBERT, TaMillion, and MuRIL. As Tamil is a morphologically rich language, a detailed analysis is made to study the performance of hate speech detection in Tamil by applying enhanced stemming algorithms. An extensive experimental study was performed with different classifiers such as logistic regression, SVM, stochastic Gradient Descent, decision tree, and ensemble learning models in combination with the above techniques. The results of this detailed experimental study show that stop word removal produces mixed results and does not guarantee improvement in the performance of the classifier to detect offensive content for Tamil data. However, the performance on stemmed data shows a significant improvement over un-stemmed data in Tamil texts. As the data is highly imbalanced, we also combined an oversampling/downsampling technique to analyze its role in designing the best offensive classifier for Tamil text. The highest performance was achieved by a combination of stemming the text data, embedding it with the multi-lingual model MuRIL and using a majority voting ensemble as the downstream classifier. We have achieved the F<inf>1</inf>-score of 84% and accuracy of 86% for detecting offensive content in Tamil YouTube comments. © 2022 Elsevier Ltd
Description
Keywords
Data handling, Decision trees, Deep learning, Gradient methods, Natural language processing systems, Speech communication, Speech recognition, Stochastic models, Content identifications, Data preprocessing, Hate speech in tamil youtube comment, MuRIL, Offensive content identification in tamil, Stop word, Stop word removal and enhanced stemming for tamil text, Tamil text data pre-processing, Text data, Transformer for tamil, Word removals, YouTube, Stochastic systems
Citation
Computer Speech and Language, 2023, 78, , pp. -
