DBNLP: detecting bias in natural language processing system for India-centric languages

Keerthan Kumar, K.K.; Mendke, S.; Parihar, R.; Mayya, S.; Venkatesh, S.; Koolagudi, S.G.

DBNLP: detecting bias in natural language processing system for India-centric languages

dc.contributor.author	Keerthan Kumar, K.K.
dc.contributor.author	Mendke, S.
dc.contributor.author	Parihar, R.
dc.contributor.author	Mayya, S.
dc.contributor.author	Venkatesh, S.
dc.contributor.author	Koolagudi, S.G.
dc.date.accessioned	2026-02-03T13:19:47Z
dc.date.issued	2025
dc.description.abstract	Natural language processing (NLP) is gaining widespread interest and seeing advancements rapidly due to its attractive and exhilarating applications. NLP models are being developed in search engines for real-world scenarios such as language translation, sentiment analysis, chat-bots such as ChatGPT, and auto-completion. These models are trained on a vast corpus of online data, exposing them to harmful biases and stereotypes towards various communities. The models learn these biases, making harmful and undesirable predictions about particular genders, religions, races, and professions. Biases in NLP systems can perpetuate societal biases and discrimination, leading to unfair and unequal treatment of individuals or groups. It is crucial to identify these biases, which will help mitigate them. Most of the literary works in this area have been primarily Western-centric, focusing on the English language, making it tough to use them for Indian models and languages. In this work, we propose a model called Detecting Bias in Natural Language Processing System for India-Centric Languages (DBNLP), which aims to identify the biases relevant to the Indian context present in the text-based language models, particularly for the English and Hindi languages. The DBNLP presents three techniques for bias identification based on (1) a Context Association Test (CAT), (2) a template-based perturbation technique for various co-domain associations, and (3) a co-occurrence count-based corpus analysis technique. Further, this work showcases how India-centric models such as IndicBERT, MuRIL, and datasets such as IndicCorp are biased toward various demographic categories. Detecting bias in natural language processing systems for India-centric languages is essential to creating fair, diverse, and inclusive models that benefit society. © Bharati Vidyapeeth's Institute of Computer Applications and Management 2025.
dc.identifier.citation	International Journal of Information Technology (Singapore), 2025, 17, 6, pp. 3291-3306
dc.identifier.issn	25112104
dc.identifier.uri	https://doi.org/10.1007/s41870-025-02437-9
dc.identifier.uri	https://idr.nitk.ac.in/handle/123456789/20217
dc.publisher	Springer Science and Business Media B.V.
dc.subject	Diverse and Inclusive models
dc.subject	Harmful biases and stereotypes
dc.subject	India-centric models
dc.subject	Indian context
dc.subject	Natural language processing
dc.title	DBNLP: detecting bias in natural language processing system for India-centric languages

Collections

Journal Articles

DBNLP: detecting bias in natural language processing system for India-centric languages

Files

Collections