Faculty Publications
Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736
Publications by NITK Faculty
Browse
7 results
Search Results
Item Advancing Human-Like Summarization: Approaches to Text Summarization(CEUR-WS, 2023) Gowhar, S.; Sharma, B.; Gupta, A.K.; Anand Kumar, A.K.Text summarization, a well-explored domain within Natural Language Processing, has witnessed significant progress. The ILSUM shared task, encompassing various languages, such as English, Hindi, Gujarati, and Bengali, concentrates on text summarization. The proposed research focuses on leveraging pretrained sequence-to-sequence models for abstractive summarization specifically in the context of the English language. This paper provides an extensive exposition of our model and approach. Notably, we achieved the top ranking in the English Language subtask. Furthermore, this paper dives into an analysis of various techniques for extractive summarization, presenting their outcomes and drawing comparisons with abstractive summarization. © 2023 Copyright for this paper by its authors.Item Prediction of High-Resolution Atmospheric CO2 Concentration from OCO-2 using Machine Learning(Association for Computing Machinery, 2023) Pais, S.M.; Bhattacharjee, S.; Anand Kumar, A.K.Carbon Dioxide (CO2) is a greenhouse gas (GHG) emitted by human anthropogenic activities. The satellite measurement of atmospheric column-averaged CO2 concentration (XCO2) provides an excellent opportunity to understand the global carbon cycle for a large comprehensive temporal range. Orbiting Carbon Observatory-2 (OCO-2) satellite provides highly accurate data with a spatial resolution of approximately 3 km2. However, OCO-2 measures one location on the Earth's surface almost fortnightly. Also, the clouds and aerosols cause missing data. In this work, the OCO-2 measurements, along with Open-Source Data Inventory for Anthropogenic CO2 (ODIAC) emission estimate, are considered. A spatial upscaling, followed by different machine learning methods are used to predict high-resolution, continuous mapping of XCO2. The prediction models are evaluated using the Mean Absolute Error (MAE) and the Root Mean Square Error (RMSE) for Germany, considering a temporal range of November 2018 to December 2019. The least error is attained by monthly model, achieving a MAE and RMSE of 0.707 ppm and 1.187 ppm, respectively, using the extremely randomized trees (ERT) method. The predictions are externally validated using Total Carbon Column Observing Network (TCCON) ground-based measurements as well. © 2023 ACM.Item HOTTEST: Hate and Offensive content identification in Tamil using Transformers and Enhanced STemming(Academic Press, 2023) Rajalakshmi, R.; Selvaraj, S.; Faerie Mattins, R.; Vasudevan, P.; Anand Kumar, A.K.Offensive content or hate speech is defined as any form of communication that aims to annoy, harass, disturb, or anger an individual or community based on factors such as faith, ethnicity, appearance, or sexuality. Nowadays, offensive content posted in regional languages increased due to the popularity of social networks and other apps usage by common people. This work proposes a method to detect and identify hate speech or offensive content in Tamil. We have used the HASOC 2021 data set that contains YouTube comments in Tamil language and written in Tamil script. In this research work, an attempt is made to find suitable embedding techniques for Tamil text representation by applying TF-IDF and pre-trained transformer models like BERT, XLM-RoBERTa, IndicBERT, mBERT, TaMillion, and MuRIL. As Tamil is a morphologically rich language, a detailed analysis is made to study the performance of hate speech detection in Tamil by applying enhanced stemming algorithms. An extensive experimental study was performed with different classifiers such as logistic regression, SVM, stochastic Gradient Descent, decision tree, and ensemble learning models in combination with the above techniques. The results of this detailed experimental study show that stop word removal produces mixed results and does not guarantee improvement in the performance of the classifier to detect offensive content for Tamil data. However, the performance on stemmed data shows a significant improvement over un-stemmed data in Tamil texts. As the data is highly imbalanced, we also combined an oversampling/downsampling technique to analyze its role in designing the best offensive classifier for Tamil text. The highest performance was achieved by a combination of stemming the text data, embedding it with the multi-lingual model MuRIL and using a majority voting ensemble as the downstream classifier. We have achieved the F1-score of 84% and accuracy of 86% for detecting offensive content in Tamil YouTube comments. © 2022 Elsevier LtdItem Transfer learning based code-mixed part-of-speech tagging using character level representations for Indian languages(Springer Science and Business Media Deutschland GmbH, 2023) Anand Kumar, A.K.; Padannayil, S.K.Massive amounts of unstructured content have been generated day-by-day on social media platforms like Facebook, Twitter and blogs. Analyzing and extracting useful information from this vast amount of text content is a challenging process. Social media have currently provided extensive opportunities for researchers and practitioners to do adequate research on this area. Most of the text content in social media tend to be either in English or code-mixed regional languages. In a multilingual country like India, code-mixing is the usual fashion witnessed in social media discussions. Multilingual users frequently use Roman script, an convenient mode of expression, instead of the regional language script for posting messages on social media and often mix it with English into their native languages. Stylistic and grammatical irregularities are significant challenges in processing the code-mixed text using conventional methods. This paper explains the new word embedding via character level representation as features for POS tagging the code-mixed text in Indian languages using the ICON-2015, ICON-2016 NLP tools contest data set. The proposed word embedding features are context-appended, and the well-known Support Vector Machine (SVM) classifier has been used to train the system. We have combined the Facebook, Twitter, and WhatsApp code-mixed data of three Indian languages to train the Transfer learning based language-independent and source independent POS tagging. The experimental results demonstrated that the proposed transfer method achieved state-of-the-art accuracy in 12 systems out of 18 systems for the ICON data set. © 2021, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.Item Relation Extraction: Hypernymy Discovery Using a Novel Pattern Learning Algorithm(Springer, 2023) Pinto, O.; Gole, S.; Srushti, H.P.; Anand Kumar, A.K.This paper proposes a semi-supervised relation extraction methodology to extract hypernymy (Is-A) relations. We developed a pattern learning-based model based on a "most reliable pattern". After each iteration, the algorithm generates trusted instances of hypernym–hyponym pairs using only a corpus of text and a set of seed instances as the input. Sentences are masked and extracted, and patterns are discovered and ranked. A pattern-matching algorithm generates pairs, and a scoring function appropriately filters pairs. The generated pairs are added to the initial seed set via a bootstrapping approach to facilitate further the iterative algorithm in generating a new trusted pair set. The work presented here is a semi-supervised approach, and to facilitate the experiments conducted, we are using two freely available public Wikipedia text corpus to extract hypernyms. We use Hearst patterns, an extended version of Hearst patterns (adding more patterns), and a dependency-based approach to form a base for comparison to our developed pattern learning approach. To evaluate the proposed algorithm, the hypernym–hyponym relations obtained are tested against five standard publicly available datasets, namely, BLESS, WBLESS, WEEDS, EVAL, and LEDS datasets as criteria for comparison. The results of the two Wikipedia text corpus and five evaluation datasets show that the pattern learning approach performs better than the three comparison base algorithms. The lack of heavy skewness in results across the two datasets also indicates that the algorithms implemented are independent of the corpus used and can be used on any large corpus. © 2023, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.Item Overlapping word removal is all you need: revisiting data imbalance in hope speech detection(Taylor and Francis Ltd., 2024) RamakrishnaIyer LekshmiAmmal, H.; Ravikiran, M.; Nisha, G.; Balamuralidhar, N.; Madhusoodanan, A.; Anand Kumar, A.K.; Chakravarthi, B.R.Hope speech detection is a new task for finding and highlighting positive comments or supporting content from user-generated social media comments. For this task, we have used a Shared Task multilingual dataset on Hope Speech Detection for Equality, Diversity, and Inclusion (HopeEDI) for three languages English, code-switched Tamil and Malayalam. In this paper, we present deep learning techniques using context-aware string embeddings for word representations and Recurrent Neural Network (RNN) and pooled document embeddings for text representation. We have evaluated and compared the three models for each language with different approaches. Our proposed methodology works fine and achieved higher performance than baselines. The highest weighted average F-scores of 0.93, 0.58, and 0.84 are obtained on the task organisers{'} final evaluation test set. The proposed models are outperforming the baselines by 3{\%}, 2{\%} and 11{\%} in absolute terms for English, Tamil and Malayalam respectively. © 2023 Informa UK Limited, trading as Taylor & Francis Group.Item The Effect of Phrase Vector Embedding in Explainable Hierarchical Attention-Based Tamil Code-Mixed Hate Speech and Intent Detection(Institute of Electrical and Electronics Engineers Inc., 2024) Sharmila Devi, V.S.; Subramanian, S.; Anand Kumar, A.K.The substantial growth in social media users has led to a significant increase in code-mixed content on social media platforms. Millions of users on these platforms upload pictures and videos and post comments regarding their recent or exciting activities. Responding to this uploaded content, a few users occasionally use offensive language to insult others or specific groups. Social media platforms encounter challenges identifying and removing hate speech and objectionable content in various languages. Hate speech, in its general sense, refers to harmful posts directed at individuals or groups based on factors such as their sexuality, religion, community affiliation, disability, and others. Typically, offensive language is directly or indirectly utilized in hate speech posts to insult someone, causing psychological distress to users. In light of this, we propose developing a system to automatically block, remove, or report posts written in code-mixed Tamil containing hate speech. We have gathered code-mixed Tamil comments from Twitter and the Helo App, categorizing them as hate speech and classifying their intent. We have identified three categories of hate speech intent, namely Targeted Individual (TI), Targeted Group (TG), and Others (O). The Targeted Individual (TI) class encompasses posts aimed at a specific individual target. At the same time, the Targeted Group (TG) category primarily focuses on identifying people based on their religion, community, gender, and other characteristics. The Others (O) category encompasses untargeted offensive posts and other posts containing offensive language. In this context, we propose using a phrase-based, Explainable Hierarchical Attention model for hate speech detection. The results demonstrate that the proposed method is more effective in identifying and explaining hate speech and offensive language in social media posts. © 2013 IEEE.
