Faculty Publications

Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736

Publications by NITK Faculty

Browse

Search Results

Now showing 1 - 3 of 3
  • Item
    HOTTEST: Hate and Offensive content identification in Tamil using Transformers and Enhanced STemming
    (Academic Press, 2023) Rajalakshmi, R.; Selvaraj, S.; Faerie Mattins, R.; Vasudevan, P.; Anand Kumar, A.K.
    Offensive content or hate speech is defined as any form of communication that aims to annoy, harass, disturb, or anger an individual or community based on factors such as faith, ethnicity, appearance, or sexuality. Nowadays, offensive content posted in regional languages increased due to the popularity of social networks and other apps usage by common people. This work proposes a method to detect and identify hate speech or offensive content in Tamil. We have used the HASOC 2021 data set that contains YouTube comments in Tamil language and written in Tamil script. In this research work, an attempt is made to find suitable embedding techniques for Tamil text representation by applying TF-IDF and pre-trained transformer models like BERT, XLM-RoBERTa, IndicBERT, mBERT, TaMillion, and MuRIL. As Tamil is a morphologically rich language, a detailed analysis is made to study the performance of hate speech detection in Tamil by applying enhanced stemming algorithms. An extensive experimental study was performed with different classifiers such as logistic regression, SVM, stochastic Gradient Descent, decision tree, and ensemble learning models in combination with the above techniques. The results of this detailed experimental study show that stop word removal produces mixed results and does not guarantee improvement in the performance of the classifier to detect offensive content for Tamil data. However, the performance on stemmed data shows a significant improvement over un-stemmed data in Tamil texts. As the data is highly imbalanced, we also combined an oversampling/downsampling technique to analyze its role in designing the best offensive classifier for Tamil text. The highest performance was achieved by a combination of stemming the text data, embedding it with the multi-lingual model MuRIL and using a majority voting ensemble as the downstream classifier. We have achieved the F1-score of 84% and accuracy of 86% for detecting offensive content in Tamil YouTube comments. © 2022 Elsevier Ltd
  • Item
    Overlapping word removal is all you need: revisiting data imbalance in hope speech detection
    (Taylor and Francis Ltd., 2024) RamakrishnaIyer LekshmiAmmal, H.; Ravikiran, M.; Nisha, G.; Balamuralidhar, N.; Madhusoodanan, A.; Anand Kumar, A.K.; Chakravarthi, B.R.
    Hope speech detection is a new task for finding and highlighting positive comments or supporting content from user-generated social media comments. For this task, we have used a Shared Task multilingual dataset on Hope Speech Detection for Equality, Diversity, and Inclusion (HopeEDI) for three languages English, code-switched Tamil and Malayalam. In this paper, we present deep learning techniques using context-aware string embeddings for word representations and Recurrent Neural Network (RNN) and pooled document embeddings for text representation. We have evaluated and compared the three models for each language with different approaches. Our proposed methodology works fine and achieved higher performance than baselines. The highest weighted average F-scores of 0.93, 0.58, and 0.84 are obtained on the task organisers{'} final evaluation test set. The proposed models are outperforming the baselines by 3{\%}, 2{\%} and 11{\%} in absolute terms for English, Tamil and Malayalam respectively. © 2023 Informa UK Limited, trading as Taylor & Francis Group.
  • Item
    The Effect of Phrase Vector Embedding in Explainable Hierarchical Attention-Based Tamil Code-Mixed Hate Speech and Intent Detection
    (Institute of Electrical and Electronics Engineers Inc., 2024) Sharmila Devi, V.S.; Subramanian, S.; Anand Kumar, A.K.
    The substantial growth in social media users has led to a significant increase in code-mixed content on social media platforms. Millions of users on these platforms upload pictures and videos and post comments regarding their recent or exciting activities. Responding to this uploaded content, a few users occasionally use offensive language to insult others or specific groups. Social media platforms encounter challenges identifying and removing hate speech and objectionable content in various languages. Hate speech, in its general sense, refers to harmful posts directed at individuals or groups based on factors such as their sexuality, religion, community affiliation, disability, and others. Typically, offensive language is directly or indirectly utilized in hate speech posts to insult someone, causing psychological distress to users. In light of this, we propose developing a system to automatically block, remove, or report posts written in code-mixed Tamil containing hate speech. We have gathered code-mixed Tamil comments from Twitter and the Helo App, categorizing them as hate speech and classifying their intent. We have identified three categories of hate speech intent, namely Targeted Individual (TI), Targeted Group (TG), and Others (O). The Targeted Individual (TI) class encompasses posts aimed at a specific individual target. At the same time, the Targeted Group (TG) category primarily focuses on identifying people based on their religion, community, gender, and other characteristics. The Others (O) category encompasses untargeted offensive posts and other posts containing offensive language. In this context, we propose using a phrase-based, Explainable Hierarchical Attention model for hate speech detection. The results demonstrate that the proposed method is more effective in identifying and explaining hate speech and offensive language in social media posts. © 2013 IEEE.