Code-Mixed Text Generation and Identification in low-resource Indian language
Overview
Track Overview
Code-mixed text, where speakers naturally switch between multiple languages within the same sentence, is increasingly common in multilingual societies and social media communication. In India, multilingual users frequently mix regional languages with English while writing in Roman script, creating unique linguistic challenges for Natural Language Processing (NLP) systems.
The TriMixGen-Indic track focuses on understanding and processing code-mixed Indic languages written in Roman script, specifically language combinations such as Hindi–Bengali–English and Hindi–Gujarati–English. The track addresses both the generative and analytical aspects of multilingual text processing.
Participants will develop AI-based systems for generating natural and contextually meaningful code-mixed text while also identifying the language of each word at the token level. This task reflects real-world multilingual communication patterns observed in conversational platforms and social media.
The track highlights challenges such as script normalization, phonetic spelling variations, linguistic ambiguity, and seamless language switching in Romanized Indic text. It aims to advance research in multilingual NLP, code-mixed language processing, sequence labeling, and low-resource language technologies.
Task Description
This track consists of two subtasks designed to address multilingual code-mixed text generation and language identification challenges in Indic languages.
Subtask A: Code-Mixed Sentence Generation
Participants are required to generate fluent and contextually appropriate trilingual code-mixed sentences using combinations such as Hindi–Bengali–English and Hindi–Gujarati–English. The generated sentences should reflect realistic language mixing patterns commonly observed in conversational and social media text.
Subtask B: Word-Level Language Identification
Participants are required to identify the language of each word in a generated code-mixed sentence. This is a token-level classification task where every word must be labeled according to its corresponding language category.
| Hindi | मुझे आज देर हो गई क्योंकि ट्रैफिक बहुत ज्यादा था। |
|---|---|
| Hindi (Transliterated) | Mujhe aaj der ho gayi kyunki traffic bahut zyada tha. |
| Bengali | আজ আমার দেরি হয়ে গেছে কারণ ট্রাফিক খুব বেশি ছিল। |
| Bengali (Transliterated) | Aaj amar deri hoye geche karon traffic khub beshi chhilo. |
| English | I got late today because the traffic was very heavy. |
| Code-Mixed (CM) | आज मुझे देर हो गई क्योंकि traffic খুব বেশি था। |
| Subtask A Code-Mixed Generation | Aaj mujhe der ho gayi kyunki traffic khub beshi tha. |
| Subtask B Language Identification | HIN HIN HIN HIN HIN HIN ENG BEN BEN HIN UNI |
Participants will be provided with parallel multilingual sentences in Hindi, Bengali, Gujarati, and English written in Roman script. The dataset is curated from publicly available resources and designed to support both text generation and token-level language labeling tasks.
The performance of Subtask A will be evaluated using human judgment and automatic code-mixing metrics such as CMI, M-Index, I-Index, SyMCoM, and Pseudo Log Likelihood (PLL). Subtask B will be evaluated using Macro F1-score.
Contact us
For any queries, updates, or discussions, please join our official Google Group: trimixgen-indic@googlegroups.com