Conference Papers

Permanent URI for this collectionhttps://idr.nitk.ac.in/handle/123456789/28506

Browse

Search Results

Now showing 1 - 10 of 12
  • Item
    Overview of the track on HASOC-offensive Language Identification-DravidianCodeMix
    (CEUR-WS, 2020) Chakravarthi, B.R.; Anand Kumar, M.; Mccrae, J.P.; Premjith, B.; Padannayil, K.P.; Mandl, T.
    We present the results and main findings of the HASOC-Offensive Language Identification on code mixed Dravidian languages. The task featured two tasks. Task 1 is about offensive language identification in Malayalam language where the comment were written in both native script and Latin script. Task 2 is about offensive language identification in Tamil and Malayalam languages where the comments were written in Latin script (non-native script). For both the task, given a comment the participants should develop a system to classify the text into offensive or not-offensive. In total 96 participants participated and 12 participants submitted the papers. In this paper, we present the task, data, the results and discuss the system submission and methods used by participants. © 2020 Copyright for this paper by its authors.
  • Item
    Overview of the HASOC Track at FIRE 2020: Hate Speech and Offensive Language Identification in Tamil, Malayalam, Hindi, English and German
    (Association for Computing Machinery, 2020) Mandl, T.; Modha, S.; Anand Kumar, M.; Chakravarthi, B.R.
    This paper presents the HASOC track and its two parts. HASOC is dedicated to evaluate technology for finding Offensive Language and Hate Speech. HASOC is creating test collections for languages with few resources and English for comparison. The first track within HASOC has continued work from 2019 and provided a testbed of Twitter posts for Hindi, German and English. The second track within HASOC has created test resources for Tamil and Malayalam in native and Latin script. Posts were extracted mainly from Youtube and Twitter. Both tracks have attracted much interest and over 40 research groups have participated as well as described their approaches in papers. In this overview, we present the tasks, the data and the main results. © 2020 ACM.
  • Item
    Findings of the Shared Task on Machine Translation in Dravidian languages
    (Association for Computational Linguistics (ACL), 2021) Chakravarthi, B.R.; Priyadharshini, R.; Banerjee, S.; Saldanha, R.; Mccrae, J.P.; Anand Kumar, M.; Krishnamurthy, P.; Johnson, M.
    This paper presents an overview of the shared task on machine translation of Dravidian languages. We presented the shared task results at the EACL 2021 workshop on Speech and Language Technologies for Dravidian Languages. This paper describes the datasets used, the methodology used for the evaluation of participants, and the experiments’ overall results. As a part of this shared task, we organized four sub-tasks corresponding to machine translation of the following language pairs: English to Tamil, English to Malayalam, English to Telugu and Tamil to Telugu which are available at https://competitions.codalab.org/competitions/27650. We provided the participants with training and development datasets to perform experiments, and the results were evaluated on unseen test data. In total, 46 research groups participated in the shared task and 7 experimental runs were submitted for evaluation. We used BLEU scores for assessment of the translations. ©2021 Association for Computational Linguistics
  • Item
    Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada
    (Association for Computational Linguistics (ACL), 2021) Chakravarthi, B.R.; Priyadharshini, R.; Jose, N.; Anand Kumar, M.; Mandl, T.; Kumaresan, P.K.; Ponnusamy, R.; LekshmiAmmal, R.L.; Mccrae, J.P.; Sherly, E.
    Detecting offensive language in social media in local languages is critical for moderating user-generated content. Thus, the field of offensive language identification for under-resourced languages like Tamil, Malayalam and Kannada is of essential importance. As user-generated content is often code-mixed and not well studied for under-resourced languages, it is imperative to create resources and conduct benchmark studies to encourage research in under-resourced Dravidian languages. We created a shared task on offensive language detection in Dravidian languages. We summarize the dataset for this challenge which are openly available at https://competitions.codalab.org/competitions/27654, and present an overview of the methods and the results of the competing systems. ©2021 Association for Computational Linguistics
  • Item
    Overview of the HASOC-DravidianCodeMix Shared Task on Offensive Language Detection in Tamil and Malayalam
    (CEUR-WS, 2021) Chakravarthi, B.R.; Kumaresan, P.K.; Sakuntharaj, R.; Anand Kumar, M.; Thavareesan, S.; Premjith, B.; Sreelakshmi, K.; Subalalitha, S.C.; Mccrae, J.P.; Mandl, T.
    We present the results of HASOC-Dravidian-CodeMix shared task1 held at FIRE 2021, a track on offensive language identification for Dravidian languages in Code-Mixed Text in this paper. This paper will detail the task, its organisation, and the submitted systems. The identification of offensive language was viewed as a classification task. For this, 16 teams participated in identifying offensive language from Tamil-English code mixed data, 11 teams for Malayalam-English code mixed data and 14 teams for Tamil data. The teams detected offensive language using various machine learning and deep learning classification models. This paper has analysed those benchmark systems to find out how well they accommodate a code-mixed scenario in Dravidian languages, focusing on Tamil and Malayalam. © 2021 Copyright for this paper by its authors.
  • Item
    Findings of Shared Task on Offensive Language Identification in Tamil and Malayalam
    (Association for Computing Machinery, 2021) Kumaresan, P.K.; Premjith; Sakuntharaj, R.; Thavareesan, S.; Subalalitha, S.; Anand Kumar, M.; Chakravarthi, B.R.; Mccrae, J.P.
    We present the results of HASOC-Dravidian-CodeMix shared task1 held at FIRE 2021, a track on offensive language identification for Dravidian languages in Code-Mixed Text in this paper. This paper will detail the task, its organisation, and the submitted systems. The identification of offensive language was viewed as a classification task. For this, 16 teams participated in identifying offensive language from Tamil-English code mixed data, 11 teams for Malayalam-English code mixed data and 14 teams for Tamil data. The teams detected offensive language using various machine learning and deep learning classification models. This paper has analysed those benchmark systems to find out how well they accommodate a code-mixed scenario in Dravidian languages, focusing on Tamil and Malayalam. © 2021 Owner/Author.
  • Item
    Findings of the Shared Task on Offensive Span Identification from Code-Mixed Tamil-English Comments
    (Association for Computational Linguistics (ACL), 2022) Ravikiran, M.; Chakravarthi, B.R.; Anand Kumar, M.; Sangeetha, S.; Rajalakshmi, R.; Thavareesan, S.; Ponnusamy, R.; Mahadevan, S.
    Offensive content moderation is vital in social media platforms to support healthy online discussions. However, their prevalence in code-mixed Dravidian languages is limited to classifying whole comments without identifying part of it contributing to offensiveness. Such limitation is primarily due to the lack of annotated data for offensive spans. Accordingly, in this shared task, we provide Tamil-English code-mixed social comments with offensive spans. This paper outlines the dataset so released, methods, and results of the submitted systems. © 2022 Association for Computational Linguistics.
  • Item
    Overview of the Shared Task on Machine Translation in Dravidian Languages
    (Association for Computational Linguistics (ACL), 2022) Anand Kumar, A.M.; Hegde, A.; Banerjee, S.; Chakravarthi, B.R.; Priyadarshini, R.; Shashirekha, H.L.; Mccrae, J.P.
    This paper presents an outline of the shared task on translation of under-resourced Dravidian languages at DravidianLangTech-2022 workshop to be held jointly with ACL 2022. A description of the datasets used, approach taken for analysis of submissions and the results have been illustrated in this paper. Five sub-tasks organized as a part of the shared task include the following translation pairs: Kannada to Tamil, Kannada to Telugu, Kannada to Sanskrit, Kannada to Malayalam and Kannada to Tulu. Training, development and test datasets were provided to all participants and results were evaluated on the gold standard datasets. A total of 16 research groups participated in the shared task and a total of 12 submission runs were made for evaluation. Bilingual Evaluation Understudy (BLEU) score was used for evaluation of the translations. © 2022 Association for Computational Linguistics.
  • Item
    A Study of Machine Translation Models for Kannada-Tulu
    (Springer Science and Business Media Deutschland GmbH, 2023) Hegde, A.; Shashirekha, H.L.; Anand Kumar, M.; Chakravarthi, B.R.
    Over the past ten years, neural machine translation (NMT) has seen tremendous growth and is now entering a phase of maturity. Despite being the most popular solution for machine translation (MT), it performs sub-optimally on under-resourced language pairs due to lack of parallel corpora as compared to high-resourced language pairs. The implementation of NMT techniques for under-resourced language pairs is receiving the attention of researchers and has resulted in a significant amount of research for many under-resourced language pairs. In view of the growth of MT, this paper describes a set of practical approaches for investigating MT between Kannada and Tulu. These two languages belong to the family of Dravidian languages and are under-resourced due to lack of tools and resources particularly the parallel corpus for MT. Since there are no parallel corpora for the Kannada-Tulu language pair for MT, this work aims to construct a parallel corpus for this language pair. As manual construction of parallel corpus is laborious, data augmentation is introduced to enhance the size of the parallel corpus along with suitable preprocessing techniques. Different NMT schemes such as recurrent neural network (RNN) baseline, bidirectional recurrent neural network (BiRNN), transformer-based NMT with and without subword tokenization, and statistical machine translation (SMT) models are implemented for MT of Kannada-Tulu and Tulu-Kannada language pairs. Empirical results reveal that the impact of data augmentation increases the bilingual evaluation understudy (BLEU) score of the proposed models. Transformer-based models with subword tokenization outperformed the other models with BLEU scores 41.82 and 40.91 for Kannada-Tulu and Tulu-Kannada MT, respectively. © 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
  • Item
    Findings of the Second Shared Task on Offensive Span Identification from Code-Mixed Tamil-English Comments
    (Incoma Ltd, 2023) Ravikiran, M.; Ganesh, A.; Anand Kumar, M.; Rajalakshmi, R.; Chakravarthi, B.R.
    Maintaining effective control over offensive content is essential on social media platforms to foster constructive online discussions. Yet, when it comes to code-mixed Dravidian languages, the current prevalence of offensive content moderation is restricted to categorizing entire comments, failing to identify specific portions that contribute to the offensiveness. Such limitation is primarily due to the lack of annotated data and open source systems for offensive spans. To alleviate this issue, in this shared task, we offer a collection of Tamil-English code-mixed social comments that include offensive comments. This paper provides an overview of the released dataset, the algorithms employed, and the outcomes achieved by the systems submitted for this task. © DravidianLangTech 2023 - 3rd Workshop on Speech and Language Technologies for Dravidian Languages, associated with 14th International Conference on Recent Advances in Natural Language Processing, RANLP 2023 - Proceedings.