Contextual Code-Mixed Translations with LLM-Based Data Augmentation

Jinkathoti, M.; Verma, M.; Anand Kumar, M.; Adnan, M.

Contextual Code-Mixed Translations with LLM-Based Data Augmentation

Date

2025

Authors

Publisher

Institute of Electrical and Electronics Engineers Inc.

Abstract

Code-mixed conversational translation presents unique challenges for neural machine translation (NMT) systems due to its context-dependent nature and linguistic complexity. This paper investigates context-aware translation approaches for Hinglish-to-English code-mixed text, focusing on the comparative performance of the fine-tuned State-Of-The-Art (SOTA) language models, and proposes an approach that incorporates contextual information through a novel preprocessing technique and augmented prompt-based synthetically generated training data. With the help of comprehensive experimentation conducted across distinct configurations using two SOTA Mistral-7B-v0.3 and IndicTrans2 models, the findings demonstrate that Mistral-7Bv0.3 fine-tuned with context-enriched data and synthetic examples achieves state-of-the-art performance with 14.3% improvement over previous approaches. Furthermore, it is observed that in domain-specific patterns in optimal model configuration: conversational data benefits most from contextaware models with synthetic data augmentation, while nonconversational translation performs optimally with syntheticaugmented datasets without contextual enrichment. This research contributes valuable insights into the design of effective translation systems for code-mixed language and establishes new benchmarks for this increasingly important domain of multilingual NLP. Â© 2025 IEEE.

Keywords

code-mixed, contextaware translation, Multilingual conversational system, Neural Machine Translator

Citation

3rd IEEE International Conference on Networks, Multimedia and Information Technology, NMITCON 2025, 2025, Vol., , p. -

URI

https://doi.org/10.1109/NMITCON65824.2025.11187436
https://idr.nitk.ac.in/handle/123456789/28559

Collections

Conference Papers

Full item page

Contextual Code-Mixed Translations with LLM-Based Data Augmentation

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By