Transfer learning based code-mixed part-of-speech tagging using character level representations for Indian languages

No Thumbnail Available

Date

2023

Journal Title

Journal ISSN

Volume Title

Publisher

Springer Science and Business Media Deutschland GmbH

Abstract

Massive amounts of unstructured content have been generated day-by-day on social media platforms like Facebook, Twitter and blogs. Analyzing and extracting useful information from this vast amount of text content is a challenging process. Social media have currently provided extensive opportunities for researchers and practitioners to do adequate research on this area. Most of the text content in social media tend to be either in English or code-mixed regional languages. In a multilingual country like India, code-mixing is the usual fashion witnessed in social media discussions. Multilingual users frequently use Roman script, an convenient mode of expression, instead of the regional language script for posting messages on social media and often mix it with English into their native languages. Stylistic and grammatical irregularities are significant challenges in processing the code-mixed text using conventional methods. This paper explains the new word embedding via character level representation as features for POS tagging the code-mixed text in Indian languages using the ICON-2015, ICON-2016 NLP tools contest data set. The proposed word embedding features are context-appended, and the well-known Support Vector Machine (SVM) classifier has been used to train the system. We have combined the Facebook, Twitter, and WhatsApp code-mixed data of three Indian languages to train the Transfer learning based language-independent and source independent POS tagging. The experimental results demonstrated that the proposed transfer method achieved state-of-the-art accuracy in 12 systems out of 18 systems for the ICON data set. © 2021, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.

Description

Keywords

Codes (symbols), Computational linguistics, Embeddings, Natural language processing systems, Social networking (online), Syntactics, Character and word embedding, Code-mixed script, ICON-2015, ICON-2016, Indian languages, Language processing, Natural language processing, Natural languages, Part of speech tagging, Parts-of-speech tagging, Social media, Support vectors machine, Transfer learning, Support vector machines

Citation

Journal of Ambient Intelligence and Humanized Computing, 2023, 14, 6, pp. 7207-7218

Collections

Endorsement

Review

Supplemented By

Referenced By