DBNLP: detecting bias in natural language processing system for India-centric languages
No Thumbnail Available
Date
2025
Journal Title
Journal ISSN
Volume Title
Publisher
Springer Science and Business Media B.V.
Abstract
Natural language processing (NLP) is gaining widespread interest and seeing advancements rapidly due to its attractive and exhilarating applications. NLP models are being developed in search engines for real-world scenarios such as language translation, sentiment analysis, chat-bots such as ChatGPT, and auto-completion. These models are trained on a vast corpus of online data, exposing them to harmful biases and stereotypes towards various communities. The models learn these biases, making harmful and undesirable predictions about particular genders, religions, races, and professions. Biases in NLP systems can perpetuate societal biases and discrimination, leading to unfair and unequal treatment of individuals or groups. It is crucial to identify these biases, which will help mitigate them. Most of the literary works in this area have been primarily Western-centric, focusing on the English language, making it tough to use them for Indian models and languages. In this work, we propose a model called Detecting Bias in Natural Language Processing System for India-Centric Languages (DBNLP), which aims to identify the biases relevant to the Indian context present in the text-based language models, particularly for the English and Hindi languages. The DBNLP presents three techniques for bias identification based on (1) a Context Association Test (CAT), (2) a template-based perturbation technique for various co-domain associations, and (3) a co-occurrence count-based corpus analysis technique. Further, this work showcases how India-centric models such as IndicBERT, MuRIL, and datasets such as IndicCorp are biased toward various demographic categories. Detecting bias in natural language processing systems for India-centric languages is essential to creating fair, diverse, and inclusive models that benefit society. © Bharati Vidyapeeth's Institute of Computer Applications and Management 2025.
Description
Keywords
Diverse and Inclusive models, Harmful biases and stereotypes, India-centric models, Indian context, Natural language processing
Citation
International Journal of Information Technology (Singapore), 2025, 17, 6, pp. 3291-3306
