Contextual Code-Mixed Translations with LLM-Based Data Augmentation
No Thumbnail Available
Date
2025
Journal Title
Journal ISSN
Volume Title
Publisher
Institute of Electrical and Electronics Engineers Inc.
Abstract
Code-mixed conversational translation presents unique challenges for neural machine translation (NMT) systems due to its context-dependent nature and linguistic complexity. This paper investigates context-aware translation approaches for Hinglish-to-English code-mixed text, focusing on the comparative performance of the fine-tuned State-Of-The-Art (SOTA) language models, and proposes an approach that incorporates contextual information through a novel preprocessing technique and augmented prompt-based synthetically generated training data. With the help of comprehensive experimentation conducted across distinct configurations using two SOTA Mistral-7B-v0.3 and IndicTrans2 models, the findings demonstrate that Mistral-7Bv0.3 fine-tuned with context-enriched data and synthetic examples achieves state-of-the-art performance with 14.3% improvement over previous approaches. Furthermore, it is observed that in domain-specific patterns in optimal model configuration: conversational data benefits most from contextaware models with synthetic data augmentation, while nonconversational translation performs optimally with syntheticaugmented datasets without contextual enrichment. This research contributes valuable insights into the design of effective translation systems for code-mixed language and establishes new benchmarks for this increasingly important domain of multilingual NLP. © 2025 IEEE.
Description
Keywords
code-mixed, contextaware translation, Multilingual conversational system, Neural Machine Translator
Citation
3rd IEEE International Conference on Networks, Multimedia and Information Technology, NMITCON 2025, 2025, Vol., , p. -
