Conference Papers

Permanent URI for this collectionhttps://idr.nitk.ac.in/handle/123456789/28506

Browse

Search Results

Now showing 1 - 9 of 9
  • Item
    Conversational Hate-Offensive detection in Code-Mixed Hindi-English Tweets
    (CEUR-WS, 2021) Rajalakshmi, R.; Srivarshan, S.; Mattins, F.; Kaarthik, E.; Seshadri, P.; Anand Kumar, M.
    Hate speech in social media has increased due to the increased use of online forums for sharing the opinion among the people. Especially, people prefer expressing the views in their native language while posting such objectionable contents in many social media platforms. It is a challenging task to have an automated system to identify such hate and offensive tweets in many regional languages due to the rich linguistics nature. Recently, this problem has become too complicated, due to the use of multi-lingual and code-mixed tweets. The code-mixed data includes the mixing of two languages on the granular level. A word that might not be a part of either language may be found in the data. To address the above challenges in Hindi-English tweets, we propose an efficient method by combining the IndicBERT with an effective ensemble based method. We have applied different methodologies to find a way to accurately classify whether the given tweet is considered to be Hate Speech or Not in code-mixed Hinglish dataset. Three different models namely, IndicBERT, XLM Roberta and Masked LM were used to embed the tweet data. Then various classification methods such as Logistic Regression, Support Vector Machine, Ensembling and Neural Networks based method were applied to perform classification. From extensive experiments on the data set, embedding the code-mixed data with IndicBERT and Ensembling was found to be the best method, which resulted in an macro F1-score of 62.53%. This work was submitted to the shared task of the HASOC 2021 [1] [2] Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages Competition by team TNLP. © 2021 Copyright for this paper by its authors.
  • Item
    Hate Speech and Offensive Content Identification in Hindi and Marathi Language Tweets using Ensemble Techniques
    (CEUR-WS, 2021) Rajalakshmi, R.; Mattins, F.; Srivarshan, S.; Reddy, L.P.; Anand Kumar, M.
    Hate Speech is described as any form of speech in which speakers attempt to ridicule, humiliate, or inculcate hatred in someone else’s minds based on characteristics such as religion, the colour of skin, race, or sexual preference. In recent years, social networking sites have been a major source of excessive amounts of hate speech. If unaddressed, these might cause anxiety and despair in the affected individuals or groups. As a result, the above-mentioned social networks utilize an assortment of algorithms to identify such hate speech. Detecting Hate Speech in English texts has been one of the hottest topics in recent years, with multiple types of research being published. However, in regional and indigenous languages, hate speech detection is a recent area with not much research being conducted. It is difficult to perform hate speech detection using data in regional languages due to a lack of large enough training data and a lack of resources about that domain. The HASOC [1] 2021 Hate Speech Detection Task solves one of the problems. It provides a dataset containing Tweet data in English, Hindi [2] and Marathi [3] languages. There were two subtasks as part of the main task. The subtask was to classify the hate speech and offensive texts in the Hindi and Marathi tweet dataset as Hate Speech (HATE), Offensive (OFFN) or Profane (PRF). This work compares the performance of different models on both subtasks and provides a conclusion on the best performing model. The Random Forest Classifier reports the most remarkable accuracy on the first subtask with a macro F1 score of 75.19% and 73.12% on the Marathi and Hindi tweet datasets. The XGBoost algorithm is the best performing algorithm on the second subtask with a 46.5% macro F1 score. Overall any of these models can get satisfactory results when dealing with hate speech detection in regional language. This work has been submitted to the FIRE2021 shared task, Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages (HASOC-2021) by team DLRG. © 2021 Copyright for this paper by its authors.
  • Item
    Findings of the Shared Task on Offensive Span Identification from Code-Mixed Tamil-English Comments
    (Association for Computational Linguistics (ACL), 2022) Ravikiran, M.; Chakravarthi, B.R.; Anand Kumar, M.; Sangeetha, S.; Rajalakshmi, R.; Thavareesan, S.; Ponnusamy, R.; Mahadevan, S.
    Offensive content moderation is vital in social media platforms to support healthy online discussions. However, their prevalence in code-mixed Dravidian languages is limited to classifying whole comments without identifying part of it contributing to offensiveness. Such limitation is primarily due to the lack of annotated data for offensive spans. Accordingly, in this shared task, we provide Tamil-English code-mixed social comments with offensive spans. This paper outlines the dataset so released, methods, and results of the submitted systems. © 2022 Association for Computational Linguistics.
  • Item
    Context Sensitive Tamil Language Spellchecker Using RoBERTa
    (Springer Science and Business Media Deutschland GmbH, 2023) Rajalakshmi, R.; Sharma, V.; Anand Kumar, M.
    A spellchecker is a tool that helps to identify spelling errors in a piece of text and lists out the possible suggestions for that word. There are many spell-checkers available for languages such as English but a limited number of spell-checking tools are found for low-resource languages like Tamil. In this paper, we present an approach to develop a Tamil spell checker using the RoBERTa (xlm-roberta-base) model. We have also proposed an algorithm to generate the test dataset by introducing errors in a piece of text. The spellchecker finds out the mistake in a given text using a corpus of unique Tamil words collected from different sources such as Wikipedia and Tamil conversations, and lists out the suggestions that could be the potential contextual replacement of the misspelled word using the proposed model. On introducing a few errors in a piece of text collected from a Wikipedia article and testing it on our model, an accuracy of 91.14% was achieved for error detection. Contextually correct words were then suggested for these erroneous words detected. Our spellchecker performed better than some of the existing Tamil spellcheckers in terms of both higher accuracy and lower false positives. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
  • Item
    Tamil NLP Technologies: Challenges, State of the Art, Trends and Future Scope
    (Springer Science and Business Media Deutschland GmbH, 2023) Rajendran, S.; Anand Kumar, M.; Rajalakshmi, R.; Dhanalakshmi, V.; Balasubramanian, P.; Padannayil, K.P.
    This paper aims to summarize the NLP-based technological development of the Tamil language. Tamil is one of the Dravidian languages that are serious about technological development. This phenomenon is reflected in its activities in developing language technology tools and the resources made for technological development. Tamil has successfully developed tools or systems for speech synthesis and recognition, grammatical analysis of grammar, semantics and social media text, along with machine translation. There are many types of research undertaken to orient towards this achievement. Similarly, many activities are developing resources to facilitate technological development. The activities include preparing text corpora for text including monolingual, parallel and lexical along with speech with lexical resources and grammar. What is needed now is to stock-take the achievement made so far and found out where Tamil is in the arena of technological development and looks forward further to its fast technological development. Computational linguistics in Tamil NLP is gaining more attraction, and various data sets available for research is highlighted in this work for further exploration. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
  • Item
    Findings of the Second Shared Task on Offensive Span Identification from Code-Mixed Tamil-English Comments
    (Incoma Ltd, 2023) Ravikiran, M.; Ganesh, A.; Anand Kumar, M.; Rajalakshmi, R.; Chakravarthi, B.R.
    Maintaining effective control over offensive content is essential on social media platforms to foster constructive online discussions. Yet, when it comes to code-mixed Dravidian languages, the current prevalence of offensive content moderation is restricted to categorizing entire comments, failing to identify specific portions that contribute to the offensiveness. Such limitation is primarily due to the lack of annotated data and open source systems for offensive spans. To alleviate this issue, in this shared task, we offer a collection of Tamil-English code-mixed social comments that include offensive comments. This paper provides an overview of the released dataset, the algorithms employed, and the outcomes achieved by the systems submitted for this task. © DravidianLangTech 2023 - 3rd Workshop on Speech and Language Technologies for Dravidian Languages, associated with 14th International Conference on Recent Advances in Natural Language Processing, RANLP 2023 - Proceedings.
  • Item
    Overview of Shared Task on Multitask Meme Classification - Unraveling Misogynistic and Trolls in Online Memes
    (Association for Computational Linguistics (ACL), 2024) Chakravarthi, B.; Rajiakodi, S.; Ponnusamy, R.; Pannerselvam, K.; Anand Kumar, M.A.; Rajalakshmi, R.; LekshmiAmmal, H.R.; Kizhakkeparambil, A.; Kumar, S.S.; Sivagnanam, B.; Rajkumar, C.
    This paper offers a detailed overview of the first shared task on "Multitask Meme Classification - Unraveling Misogynistic and Trolls in Online Memes," organized as part of the LT-EDI@EACL 2024 conference. The task was set to classify misogynistic content and troll memes within online platforms, focusing specifically on memes in Tamil and Malayalam languages. A total of 52 teams registered for the competition, with four submitting systems for the Tamil meme classification task and three for the Malayalam task. The outcomes of this shared task are significant, providing insights into the current state of misogynistic content in digital memes and highlighting the effectiveness of various computational approaches in identifying such detrimental content. The top-performing model got a macro F1 score of 0.73 in Tamil and 0.87 in Malayalam. © 2024 Association for Computational Linguistics.
  • Item
    Findings of the First Shared Task on Offensive Span Identification from Code-Mixed Kannada-English Comments
    (Association for Computational Linguistics (ACL), 2024) Ravikiran, M.; Rajalakshmi, R.; Chakravarthi, B.; Anand Kumar, M.A.; Thavareesan, S.
    Effectively managing offensive content is crucial on social media platforms to encourage positive online interactions. However, addressing offensive contents in code-mixed Dravidian languages faces challenges, as current moderation methods focus on flagging entire comments rather than pinpointing specific offensive segments. This limitation stems from a lack of annotated data and accessible systems designed to identify offensive language sections. To address this, our shared task presents a dataset comprising Kannada-English code-mixed social comments, encompassing offensive comments. This paper outlines the dataset, the utilized algorithms, and the results obtained by systems participating in this shared task. © 2024 Association for Computational Linguistics.
  • Item
    Sarcasm Detection in Tamil Code-Mixed Data Using Transformers
    (Springer Science and Business Media Deutschland GmbH, 2024) Rajalakshmi, R.; Joshua, R.G.; Varsini, S.R.; Anand Kumar, M.
    Social media analytics has been increasingly gaining popularity due to the extensive amount of customer data it offers, benefiting businesses of all sizes, from local ventures to global brands. Analysing textual contents aids context understanding and also enables content moderation to maintain a positive user experience. Sarcasm detection in social media is essential to maintain constructive and respectful online communication, preventing misunderstandings, minimizing conflicts, and fostering a positive and inclusive digital environment. We propose a Transformer based model for sarcasm detection in Tamil code-mixed text. The model consists of two custom-designed layers: Encoder and Embedding layer. It incorporates multi-head self-attention layer and feed-forward neural networks, followed by normalisation and dropout layers. The proposed model has outperformed compared to other state-of-art models for sarcasm detection by achieving an impressive weighted F1 score of 0.77. This proposed model effectively addressed the unique challenges posed by the Tamil code-mixed text. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.