Conference Papers
Permanent URI for this collectionhttps://idr.nitk.ac.in/handle/123456789/28506
Browse
4 results
Search Results
Item KCe_Dalab@maponsms-Fire2018: Effective word and character-based features for multilingual author profiling(CEUR-WS ceurws@sunsite.informatik.rwth-aachen.de, 2018) Sharmila Devi, V.; Subramanian, S.; Ravikumar, G.; Anand Kumar, M.This paper illustrates the work on identification of gender and age-group in Multilingual Author Profiling on SMS messages (MAPonSMS) shared task conducted in the Forum for Information Retrieval and Evaluation (FIRE 2018). To develop the Multilingual Author profiling system, the organizers released the training corpus which includes multilingual (Roman Urdu and English) SMS messages and its corresponding profiles. In gender identification, a profile may be either male or female. The author's age-group fall into one of the three categories: 15-19, 20-24, 25-xx. We have developed the author profiling system 1 using the word and character-based Term Frequency & Inverse Document Frequency (TFIDF) features and classify with Support Vector Machine classifier. The proposed system achieved the State-of-Art performance in the multilingual author profiling on SMS task. The accuracy obtained for identification of age-group is 65% and for gender, it is 87%. The performance is also evaluated jointly where the accuracy gained is 57%. We also experimented with the system by changing different parameters and report the cross-validation accuracy. © 2018 CEUR-WS. All Rights Reserved.Item Conversational Hate-Offensive detection in Code-Mixed Hindi-English Tweets(CEUR-WS, 2021) Rajalakshmi, R.; Srivarshan, S.; Mattins, F.; Kaarthik, E.; Seshadri, P.; Anand Kumar, M.Hate speech in social media has increased due to the increased use of online forums for sharing the opinion among the people. Especially, people prefer expressing the views in their native language while posting such objectionable contents in many social media platforms. It is a challenging task to have an automated system to identify such hate and offensive tweets in many regional languages due to the rich linguistics nature. Recently, this problem has become too complicated, due to the use of multi-lingual and code-mixed tweets. The code-mixed data includes the mixing of two languages on the granular level. A word that might not be a part of either language may be found in the data. To address the above challenges in Hindi-English tweets, we propose an efficient method by combining the IndicBERT with an effective ensemble based method. We have applied different methodologies to find a way to accurately classify whether the given tweet is considered to be Hate Speech or Not in code-mixed Hinglish dataset. Three different models namely, IndicBERT, XLM Roberta and Masked LM were used to embed the tweet data. Then various classification methods such as Logistic Regression, Support Vector Machine, Ensembling and Neural Networks based method were applied to perform classification. From extensive experiments on the data set, embedding the code-mixed data with IndicBERT and Ensembling was found to be the best method, which resulted in an macro F1-score of 62.53%. This work was submitted to the shared task of the HASOC 2021 [1] [2] Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages Competition by team TNLP. © 2021 Copyright for this paper by its authors.Item Hate Speech Detection Using Audio in Portuguese Language(Springer Science and Business Media Deutschland GmbH, 2024) Tembe, L.A.; Anand Kumar, M.This study focuses on hate speech in Portuguese language using audio and introduces a novel methodology that integrates audio-to-text and self-image technologies to effectively tackle this problem. We utilize Machine Learning and Deep Learning models to differentiate between hate speech and normal speech. The research utilized a total of 200 datasets, which were categorized into hate speech and normal speech. These datasets were collected by me personally for this project. Four distinct models are presented in the analysis: LSTM, SVM, CNN, and Random Forest. The findings highlight the superior performance of the CNN model when applied to spectrogram data, achieving an accuracy rate of 90%. Conversely, the Random Forest model outperforms others when dealing with text data, achieving an impressive accuracy rate of 73.1%. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.Item Measuring the Severity of the Signs of Eating Disorders Using Machine Learning Techniques(CEUR-WS, 2024) Prasanna, S.; Gulati, A.; Karmakar, S.; Hiranmayi, M.Y.; Anand Kumar, M.The paper presents the results submitted by Team SCaLAR-NITK for task 3 of eRisk Lab at CLEF 2024 [1]. The dataset provided by the task organizers consisted of 74 subjects for training and 18 for testing. We begin by describing the data cleaning and preprocessing steps. Subsequently, we outline various approaches used to address the problem, such as Word2Vec, TF-IDF, Backtranslation and Dimensionality Reduction, among others. Finally, we summarize the results obtained from each approach. Our solutions demonstrated strong performance, achieving the best results in 7 out of the 8 evaluated metrics. © 2024 Copyright for this paper by its authors.
