Journal Articles
Permanent URI for this collectionhttps://idr.nitk.ac.in/handle/123456789/19884
Browse
9 results
Search Results
Item A review on NLP zero-shot and few-shot learning: methods and applications(Springer Nature, 2025) Ramesh, G.; Sahil, M.; Palan, S.A.; Bhandary, D.; Ashok, T.A.; J, J.; Sowjanya, N.Zero-shot and few-shot learning techniques in natural language processing (NLP), this comprehensive review traces their evolution from traditional methods to cutting-edge approaches like transfer learning and pre-trained language models, semantic embedding, attribute-based approaches, generative models for data augmentation in zero-shot learning, and meta-learning, model-agnostic meta-learning, relationship networks, model-agnostic meta-learning (MAML), prototypical networks in few-shot learning. Real-world applications underscore the adaptability and efficacy of these techniques across various NLP tasks in both industry and academia. Acknowledging challenges inherent in zero-shot and few-shot learning, this review identifies limitations and suggests avenues for improvement. It emphasizes theoretical foundations alongside practical considerations such as accuracy and generalization across diverse NLP tasks. By consolidating key insights, this review provides researchers and practitioners with valuable guidance on the current state and future potential of zero-shot and few-shot learning techniques in addressing real-world NLP challenges. Looking ahead, this review aims to stimulate further research, fostering a deeper understanding of the complexities and applicability of zero-shot and few-shot learning techniques in NLP. By offering a roadmap for future exploration, it seeks to contribute to the ongoing advancement and practical implementation of NLP technologies across various domains. © The Author(s) 2025.Item Ontology-driven Text Feature Modeling for Disease Prediction using Unstructured Radiological Notes(Instituto Politecnico Nacional revista@cic.ipn.mx, 2019) S. Krishnan, G.S.; Kamath S?, S.Clinical Decision Support Systems (CDSSs) support medical personnel by offering aid in decision-making and timely interventions in patient care. Typically such systems are built on structured Electronic Health Records (EHRs), which, unfortunately have a very low adoption rate in developing countries at present. In such situations, clinical notes recorded by medical personnel, though unstructured, can be a significant source for rich patient related information. However, conversion of unstructured clinical notes to a structured EHR form is a manual and time consuming task, underscoring a critical need for more efficient, automated methods. In this paper, a generic disease prediction CDSS built on unstructured radiology text reports is proposed. We incorporate word embeddings and clinical ontologies to model the textual features of the patient data for training a feed-forward neural network for ICD9 disease group prediction. The proposed model built on unstructured text outperformed the state-of-the-art model built on structured data by 9% in terms of AUROC and 23% in terms of AUPRC, thus eliminating the dependency on the availability of structured clinical data. © 2019 Instituto Politecnico Nacional. All rights reserved.Item Diagnostic Performance Evaluation of Deep Learning-Based Medical Text Modelling to Predict Pulmonary Diseases from Unstructured Radiology Free-Text Reports(Prague University of Economics and Business, 2023) Shetty, S.; Ananthanarayana, V.S.; Mahale, A.The third most common cause of death worldwide is attributed to pulmonary diseases, making it imperative to diagnose them promptly. Radiology is a medical discipline that utilizes medical imaging to guide treatment. Radiologists prepare reports interpreting details and findings analysed from medical images. Radiology free-text reports are a rich source of textual information that can be exploited to enhance the efficacy of medical prognosis, treatment and research. Radiology reports exist in an unstructured format as are not suitable by themselves for mathematical computation or machine learning operations. Therefore, natural language processing (NLP) strategies are employed to convert unstructured natural language text into a structured format that can be fed into machine learning (ML) or deep learning (DL) models for information extraction. We propose a DL-based medical text modelling framework incorporating a knowledge base to predict pulmonary diseases from unstructured radiology free-text reports. We make detailed diagnostic performance evaluations of our proposed technique by comparing it with state-of-the-art NLP techniques on radiology free-text reports extracted from two medical institutions. The comprehensive analysis shows that the proposed model achieves superior results compared to existing state-of-the-art text modelling techniques. © 2023 Prague University of Economics and Business. All Rights Reserved.Item Transfer learning based code-mixed part-of-speech tagging using character level representations for Indian languages(Springer Science and Business Media Deutschland GmbH, 2023) Anand Kumar, A.K.; Padannayil, S.K.Massive amounts of unstructured content have been generated day-by-day on social media platforms like Facebook, Twitter and blogs. Analyzing and extracting useful information from this vast amount of text content is a challenging process. Social media have currently provided extensive opportunities for researchers and practitioners to do adequate research on this area. Most of the text content in social media tend to be either in English or code-mixed regional languages. In a multilingual country like India, code-mixing is the usual fashion witnessed in social media discussions. Multilingual users frequently use Roman script, an convenient mode of expression, instead of the regional language script for posting messages on social media and often mix it with English into their native languages. Stylistic and grammatical irregularities are significant challenges in processing the code-mixed text using conventional methods. This paper explains the new word embedding via character level representation as features for POS tagging the code-mixed text in Indian languages using the ICON-2015, ICON-2016 NLP tools contest data set. The proposed word embedding features are context-appended, and the well-known Support Vector Machine (SVM) classifier has been used to train the system. We have combined the Facebook, Twitter, and WhatsApp code-mixed data of three Indian languages to train the Transfer learning based language-independent and source independent POS tagging. The experimental results demonstrated that the proposed transfer method achieved state-of-the-art accuracy in 12 systems out of 18 systems for the ICON data set. © 2021, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.Item Open-Domain Long-Form Question–Answering Using Transformer-Based Pipeline(Springer, 2023) Dash, A.; Awachar, M.; Patel, A.; Rudra, B.For a long time, question–answering has been a crucial part of natural language processing (NLP). This task refers to fetching accurate and complete answers for a question using certain support documents or knowledge sources. In recent years, much work has been done in this field, especially after the introduction of transformer models. However, analysis reveals that the majority of research done in this domain mainly focuses on answering questions curated to have short answers, and fewer works focus on long-form question–answering (LFQA). LFQA systems generate explanatory answers for questions and pose more challenges than the short-form version. This paper investigates the long-form question–answering task by proposing a system in the form of a pipeline consisting of various transformer-based models, enabling the system to give explanatory answers to open-domain long-form questions. The pipeline mainly consists of a retriever module and a generator module. The retriever module retrieves the relevant support documents containing evidence to answer a question from a comprehensive knowledge source. On the other hand, the generator module generates the final answer using the relevant documents retrieved by the retriever module. The Explain Like I’m Five (ELI5) dataset is used to train and evaluate the system, and the final results are documented using proper metrics. The system is implemented in the Python programming language using the PyTorch framework. According to the evaluation, the proposed LFQA pipeline outperforms the existing research works when evaluated on the Knowledge-Intensive Language Tasks (KILT) benchmark and is thus effective in question–answering tasks. © 2023, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.Item Relation Extraction: Hypernymy Discovery Using a Novel Pattern Learning Algorithm(Springer, 2023) Pinto, O.; Gole, S.; Srushti, H.P.; Anand Kumar, A.K.This paper proposes a semi-supervised relation extraction methodology to extract hypernymy (Is-A) relations. We developed a pattern learning-based model based on a "most reliable pattern". After each iteration, the algorithm generates trusted instances of hypernym–hyponym pairs using only a corpus of text and a set of seed instances as the input. Sentences are masked and extracted, and patterns are discovered and ranked. A pattern-matching algorithm generates pairs, and a scoring function appropriately filters pairs. The generated pairs are added to the initial seed set via a bootstrapping approach to facilitate further the iterative algorithm in generating a new trusted pair set. The work presented here is a semi-supervised approach, and to facilitate the experiments conducted, we are using two freely available public Wikipedia text corpus to extract hypernyms. We use Hearst patterns, an extended version of Hearst patterns (adding more patterns), and a dependency-based approach to form a base for comparison to our developed pattern learning approach. To evaluate the proposed algorithm, the hypernym–hyponym relations obtained are tested against five standard publicly available datasets, namely, BLESS, WBLESS, WEEDS, EVAL, and LEDS datasets as criteria for comparison. The results of the two Wikipedia text corpus and five evaluation datasets show that the pattern learning approach performs better than the three comparison base algorithms. The lack of heavy skewness in results across the two datasets also indicates that the algorithms implemented are independent of the corpus used and can be used on any large corpus. © 2023, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.Item DBNLP: detecting bias in natural language processing system for India-centric languages(Springer Science and Business Media B.V., 2025) Keerthan Kumar, K.K.; Mendke, S.; Parihar, R.; Mayya, S.; Venkatesh, S.; Koolagudi, S.G.Natural language processing (NLP) is gaining widespread interest and seeing advancements rapidly due to its attractive and exhilarating applications. NLP models are being developed in search engines for real-world scenarios such as language translation, sentiment analysis, chat-bots such as ChatGPT, and auto-completion. These models are trained on a vast corpus of online data, exposing them to harmful biases and stereotypes towards various communities. The models learn these biases, making harmful and undesirable predictions about particular genders, religions, races, and professions. Biases in NLP systems can perpetuate societal biases and discrimination, leading to unfair and unequal treatment of individuals or groups. It is crucial to identify these biases, which will help mitigate them. Most of the literary works in this area have been primarily Western-centric, focusing on the English language, making it tough to use them for Indian models and languages. In this work, we propose a model called Detecting Bias in Natural Language Processing System for India-Centric Languages (DBNLP), which aims to identify the biases relevant to the Indian context present in the text-based language models, particularly for the English and Hindi languages. The DBNLP presents three techniques for bias identification based on (1) a Context Association Test (CAT), (2) a template-based perturbation technique for various co-domain associations, and (3) a co-occurrence count-based corpus analysis technique. Further, this work showcases how India-centric models such as IndicBERT, MuRIL, and datasets such as IndicCorp are biased toward various demographic categories. Detecting bias in natural language processing systems for India-centric languages is essential to creating fair, diverse, and inclusive models that benefit society. © Bharati Vidyapeeth's Institute of Computer Applications and Management 2025.Item Machine Learning Framework for Classification of COVID-19 Variants Using K-mer Based DNA Sequencing(John Wiley and Sons Inc, 2025) Kumar, S.; Raju, S.; Bhowmik, B.Accurate classification of viral DNA sequences is essential for tracking mutations, understanding viral evolution, and enabling timely public health responses. Traditional alignment-based methods are often computationally intensive and less effective for highly mutating viruses. This article presents a machine learning framework for classifying DNA sequences of COVID-19 variants using K-mer-based tokenization and vectorization techniques inspired by Natural Language Processing (NLP). DNA sequences corresponding to Alpha, Beta, Gamma, and Omicron variants are obtained from the Global Initiative on Sharing All Influenza Data (GISAID) database and encoded into feature vectors. Multiple classifiers, including Extra Trees, Random Forest, Support Vector Classifier (SVC), Decision Tree, Logistic Regression, Naive Bayes, K-Nearest Neighbor (KNN), Ridge Classifier, Stochastic Gradient Descent (SGD), and XGBoost, are evaluated based on accuracy, precision, recall, and F1-score. The Extra Trees model achieved the highest accuracy of 93.10% (Formula presented.) 0.42, followed by Random Forest with 92.60% (Formula presented.) 0.38, both demonstrating robust and balanced performance. Statistical significance tests confirmed the robustness of the results. The results validate the effectiveness of K-mer-based encoding combined with traditional machine learning models in classifying COVID-19 variants, offering a scalable and efficient solution for genomic surveillance. © 2025 Wiley Periodicals LLC.Item Human-in-the-Loop Data Analytics for Classifying Fatal Mining Accident Causes Using Natural Language Processing and Machine Learning Techniques(Springer Science and Business Media Deutschland GmbH, 2025) Sharma, A.; Kumar, A.; Vardhan, H.; Mangalpady, A.; Mandal, B.B.; Senapati, A.; Akhil, A.; Saini, S.Mining remains one of the most hazardous industries globally, marked by frequent fatalities resulting from complex operational risks. While accident investigation reports hold valuable insights for improving safety practices, the manual coding of fatality narratives remains labor-intensive, inconsistent, and impractical for large datasets. Although natural language processing (NLP) and machine learning (ML) techniques have gained traction for automating the analysis of safety narratives in other high-risk industries, their application to mining accident data, particularly within the Indian context, remains limited. Addressing this gap, the present study proposes a ML framework for the semi-automated classification of fatal accident causes from unstructured text narratives reported by the Directorate General of Mines Safety (DGMS) between 2016 and 2022. A total of 401 fatal accident descriptions were pre-processed and vectorized using Bag-of-Words, TF-IDF, and Word2Vec techniques, followed by model evaluation across multiple algorithms. A semi-automated classification scheme was developed to balance efficiency with expert oversight, where high-confidence predictions were assigned automatically and uncertain cases were flagged for manual review. Logistic regression combined with TF-IDF unigram features achieved the highest performance, with an F1 score of 0.78 and an accuracy of 0.81. Overall, the developed framework successfully auto-coded 68.75% of cases with 94% accuracy, 0.93 recall, and 0.91 precision. Word cloud visualizations were also employed to capture dominant words associated with different cause categories. The proposed framework offers a practical and operationally feasible solution for assigning fatality causes in the mining sector, contributing to active safety management, surveillance, and policy formulation. © Society for Mining, Metallurgy & Exploration Inc. 2025.
