Browsing by Author "Padannayil, K.P."

Now showing 1 - 16 of 16

A comparative analysis of machine comprehension using deep learning models in code-mixed hindi language
(Springer Verlag service@springer.de, 2019) Viswanathan, S.; Anand Kumar, M.; Padannayil, K.P.
The domain of artificial intelligence revolutionizes the way in which humans interact with machines. Machine comprehension is one of the latest fields under natural language processing that holds the capability for huge improvement in artificial intelligence. Machine comprehension technique gives systems the ability to understand a passage given by user and answer questions asked from it, which is an evolved version of traditional question answering technique. Machine comprehension is a main technique that falls under the category of natural language understanding, which exposes the amount of understanding required for a model to find the area of interest from a passage. The scope for the implementation of this technique is very high in India due to the availability of different regional languages. This work focused on the incorporation of machine comprehension technique in code-mixed Hindi language. A detailed comparison study on the performance of dataset in several deep learning approaches including End to End Memory Network, Dynamic Memory Network, Recurrent Neural Network, Long Short-Term Memory Network and Gated Recurrent Unit are evaluated. The best suited model for the dataset used is identified from the comparison study. A new architecture is proposed in this work by combining two of the best performing networks. To improve the model with respect to various ways of answering questions from a passage the natural language processing technique of distributed word representation was performed on the best model identified. The model was improved by applying pre-trained fastText embeddings for word representations. This is the first implementation of machine comprehension models in code-mixed Hindi language using deep neural networks. The work analyses the performance of all five models implemented, which will be helpful for future researches on Machine Comprehension technique in code-mixed Indian languages. © Springer Nature Switzerland AG 2019.
An overview of the shared task on machine translation in Indian languages (MTIL)-2017
(De Gruyter peter.golla@degruyter.com, 2019) Anand Kumar, M.A.; Premjith, B.; Singh, S.; Rajendran, S.; Padannayil, K.P.
In recent years, the multilingual content over the internet has grown exponentially together with the evolution of the internet. The usage of multilingual content is excluded from the regional language users because of the language barrier. So, machine translation between languages is the only possible solution to make these contents available for regional language users. Machine translation is the process of translating a text from one language to another. The machine translation system has been investigated well already in English and other European languages. However, it is still a nascent stage for Indian languages. This paper presents an overview of the Machine Translation in Indian Languages shared task conducted on September 7-8, 2017, at Amrita Vishwa Vidyapeetham, Coimbatore, India. This machine translation shared task in Indian languages is mainly focused on the development of English-Tamil, English-Hindi, English-Malayalam and English-Punjabi language pairs. This shared task aims at the following objectives: (a) to examine the state-of-the-art machine translation systems when translating from English to Indian languages; (b) to investigate the challenges faced in translating between English to Indian languages; (c) to create an open-source parallel corpus for Indian languages, which is lacking. Evaluating machine translation output is another challenging task especially for Indian languages. In this shared task, we have evaluated the participant's outputs with the help of human annotators. As far as we know, this is the first shared task which depends completely on the human evaluation. © 2019 Walter de Gruyter GmbH, Berlin/Boston.
Deep learning architecture for big data analytics in detecting intrusions and malicious URL
(Institution of Engineering and Technology, 2019) Harikrishnan, N.B.; Ravi, R.; Padannayil, K.P.; Poornachandran, P.; Annappa, A.; Alazab, M.
Security attacks are one of the major threats in today’s world. These attacks exploit the vulnerabilities in a system or online sites for financial gain. By doing so, there arises a huge loss in revenue and reputation for both government and private firms. These attacks are generally carried out through malware interception, intrusions, phishing uniform resource locator (URL). There are techniques like signature-based detection, anomaly detection, state full protocol to detect intrusions, blacklisting for detecting phishing URL. Even though these techniques claim to thwart cyberattacks, they often fail to detect new attacks or variants of existing attacks. The second reason why these techniques fail is the dynamic nature of attacks and lack of annotated data. In such a situation, we need to propose a system which can capture the changing trends of cyberattacks to some extent. For this, we used supervised and unsupervised learning techniques. The growing problem of intrusions and phishing URLs generates a need for a reliable architectural-based solution that can efficiently identify intrusions and phishing URLs. This chapter aims to provide a comprehensive survey of intrusion and phishing URL detection techniques and deep learning. It presents and evaluates a highly effective deep learning architecture to automat intrusion and phishing URL Detection. The proposed method is an artificial intelligence (AI)-based hybrid architecture for an organization which provides supervised and unsupervised-based solutions to tackle intrusions, and phishing URL detection. The prototype model uses various classical machine learning (ML) classifiers and deep learning architectures. The research specifically focuses on detecting and classifying intrusions and phishing URL detection. © The Institution of Engineering and Technology 2020.
Dynamic mode-based feature with random mapping for sentiment analysis
(Springer Verlag service@springer.de, 2020) Sachin Kumar, S.; Anand Kumar, M.A.; Padannayil, K.P.; Poornachandran, P.
Sentiment analysis (SA) or polarity identification is a research topic which receives considerable number of attention. The work in this research attempts to explore the sentiments or opinions in text data related to any event, politics, movies, product reviews, sports, etc. The present article discusses the use of dynamic modes from dynamic mode decomposition (DMD) method with random mapping for sentiment classification. Random mapping is performed using random kitchen sink (RKS) method. The present work aims to explore the use of dynamic modes as the feature for sentiment classification task. In order to conduct the experiment and analysis, the dataset used consists of tweets from SAIL 2015 shared task (tweets in Tamil, Bengali, Hindi) and Malayalam languages. The dataset for Malayalam is prepared by us for the work. The evaluations are performed using accuracy, F1-score, recall, and precision. It is observed from the evaluations that the proposed approach provides competing result. Â© Springer Nature Singapore Pte Ltd. 2020.
Embedding linguistic features in word embedding for preposition sense disambiguation in english—Malayalam machine translation context
(Springer Verlag service@springer.de, 2019) Premjith, B.; Padannayil, K.P.; Anand Kumar, M.; Jyothi Ratnam, D.
Preposition sense disambiguation has huge significance in Natural language processing tasks such as Machine Translation. Transferring the various senses of a simple preposition in source language to a set of senses in target language has high complexity due to these many-to-many relationships, particularly in English-Malayalam machine translation. In order to reduce this complexity in the transfer of senses, in this paper, we used linguistic information such as noun class features and verb class features of the respective noun and verb correlated to the target simple preposition. The effect of these linguistic features for the proper classification of the senses (postposition in Malayalam) is studied with the help of several machine learning algorithms. The study showed that, the classification accuracy is higher when both verb and noun class features are taken into consideration. In linguistics, the major factor that decides the sense of the preposition is the noun in the prepositional phrase. The same trend was observed in the study when the training data contained only noun class features. i.e., noun class features dominates the verb class features. © Springer Nature Switzerland AG 2019.
Extraction of named entities from social media text in tamil language using N-gram embedding for disaster management
(Springer Verlag service@springer.de, 2020) Remmiya Devi, G.R.; Anand Kumar, M.A.; Padannayil, K.P.
In the present era, data in any form is considered with greater importance. More specifically, text data has rich and brief information than any other form of data. Extraction and analysis of these data can result in various new findings through text analytics. This has led to applications such as search engines, extraction of product names, sentiment analysis, document classification and few more. Companies are much focused on sentimental analysis to review the positive, negative and neutral comments for their products. Summarization of text is a notable application of Natural Language Processing that reveals the gist of brief documents. Apart from these, on concerning welfare of the society, application based on information extraction can be developed. Handling an emergency situation requires collection of vast information. Extraction of such data can be supportive during disaster management. In order to perceive such task, system must learn the meaning of human languages. To ease the accessibility of text data across language barriers is the primary motive of Natural Language Processing (NLP) systems. The proposed systems has utilized word embedding model, specifically skip gram model to implement the most fundamental task of NLP—entity extraction in social media text. Implementation of N-gram embedding methods paved way for creation of rich context knowledge for the system to handle social media text. Classification of named entities using the proposed system has been carried out using machine learning classifier Support Vector Machine (SVM). © Springer Nature Switzerland AG 2020.
Indian native language identification - INLI 2018
(Association for Computing Machinery acmhelp@acm.org, 2018) Anand Kumar, M.; Barathi Ganesh, H.B.; Padannayil, K.P.; Ajay, S.G.
The growth of digital platforms enables the industries to serve user specific services. Most of the time, the information of the internet users are not explicitly available and it acts as a constrain in developing the personalized applications. There comes the need for author profiling tasks, which intends to predict the internet users characteristics from their texts. Native language Identification is one among the author profiling task, that predicts the authors native language from their texts available in other language. We have proposed Indian Native Language Identification task, where the internet users texts are written in English and participants needs to find, whether the userâ€™s native language is from Tamil, Malayalam, Kannada, Telugu, Bengali and Hindi. The corpus is collected from texts from regional news paper pages available in Facebook by considering the hypothesis that the user belongs to a particular region will read the news from respective regional news paper. Â© 2018 Association for Computing Machinery.
Intrinsic evaluation for englishâ€“tamil bilingual word embeddings
(Springer Verlag service@springer.de, 2020) Jp, J.P.; Krishna Menon, V.K.; Rajendran, S.; Padannayil, K.P.; Anand Kumar, M.A.
Despite the growth of bilingual word embeddings, there is no work done so far, for directly evaluating them for Englishâ€“Tamil language pair. In this paper, we present a data resource and evaluation for the Englishâ€“Tamil bilingual word vector model. In this paper, we present dataset and the evaluation paradigm for Englishâ€“Tamil bilingual language pair. This dataset contains words that covers a range of concepts that occur in natural language. The dataset is scored based on the similarity rather than association or relatedness. Hence, the word pairs that are associated but not literally similar have a low rating. The measures are quantified further to ensure consistency in the dataset, mimicking the cognitive phenomena. Henceforth, the dataset can be used by non-native speakers, with minimal effort. We also present some inferences and insights into the semantics captured by word vectors and human cognition. Â© Springer Nature Singapore Pte Ltd. 2020.
MedNLU: Natural Language Understander for Medical Texts
(Springer Science and Business Media Deutschland GmbH, 2020) Barathi Ganesh, H.B.; Reshma, U.; Padannayil, K.P.; Anand Kumar, M.
Natural Language Understanding is one of the essential tasks for building clinical text-based applications. Understanding of these clinical texts can be achieved through Vector Space Models and Sequential Modelling tasks. This paper is focused on sequential modelling i.e. Named Entity Recognition and Part of Speech Tagging by attaining a state of the art performance of 93.8% as F1 score for i2b2 clinical corpus and achieves 97.29% as F1 score for GENIA corpus. This paper also states the performance of feature fusion by integrating word embedding, feature embedding and character embedding for sequential modelling tasks. We also propose a framework based on a sequential modelling architecture, named MedNLU, which has the capability of performing Part of Speech Tagging, Chunking, and Entity Recognition on clinical texts. The sequence modeler in MedNLU is an integrated framework of Convolutional Neural Network, Conditional Random Fields and Bi-directional Long-Short Term Memory network. © 2020, Springer Nature Switzerland AG.
On developing handwritten character image database for Malayalam language script
(Elsevier B.V., 2019) Manjusha, K.; Anand Kumar, M.A.; Padannayil, K.P.
The objective of this paper is to build a handwritten character image database for Malayalam language script. Standard handwritten document image databases are an essential requirement for the development and objective evaluation of different handwritten text recognition systems for any language script. Considerable research efforts for handwritten Malayalam character recognition are present in literature. Still, no public domain handwritten image database is available for the Malayalam language. The present work focuses on building an open source handwritten character image database for Malayalam language script. The unique orthographic representation of the Malayalam characters forms the different character classes, and the current version of the database contains 85 character classes frequently used in writing Malayalam text. Handwritten data samples collected from 77 native Malayalam writers. For extracting the character images from the handwritten data sheets, active contour model-based image segmentation algorithm utilized. Recognition experiments conducted on the created character image database by employing different feature extraction techniques. Among the considered feature descriptors, scattering convolutional network-based feature descriptors attain the highest recognition accuracy of 91.05%. © 2018 Karabuk University
Ontological Structure-Based Retrieval System for Tamil
(Springer Science and Business Media Deutschland GmbH info@springer-sbm.com, 2021) Rajendran, S.; Padannayil, K.P.; Anand Kumar, M.; Sankaralingam, C.
Ontological structure of Tamil (OST) is an outcome of an extensive research activity that went on in the field of lexical semantics of Tamil for the last three decades. Rajendran’s (Semantic structure of Tamil vocabulary. Report of the UGC sponsored postdoctoral work (in manuscript). Deccan College Post-Doctoral Research Institute, Pune, 1983) post-doctoral research work went through several stages before culminating into OST. It depicts the travel from Tamil thesaurus to Tamil WordNet and into OST. OST is a lexical resource which amalgamates all sorts of information available in a dictionary, thesaurus and WordNet. The Dravidian WordNets (in which Tamil WordNet is one of the four components) built under the Indo-WordNet project depended on an ontology developed by Western conceptualization of the world found in English. This has not taken into consideration the Indian conceptualization of the world depicted in the nikhandu tradition. There are many lexical gaps between English WordNet and Tamil WordNet. Moreover, building a WordNet based on Hindi WordNet which in turn is built on English WordNet will take many years to complete and it would miss the conceptualization depicted in Indian tradition. Apart from this, the extension approach of building Tamil WordNet using Hindi WordNet cannot fulfil Dravidian conceptualization. A merger approach of building separate WordNets and collapsing them into one would have been a preferable approach. The present OST tried to overcome the lacunae found in Tamil WordNet. OST is based on the Indian and Dravidian conceptualization and the process of building one is comparatively very simple. We have the plan to mend it into a generic one so that all the Dravidian languages can be easily accommodated into it. © 2021, Springer Nature Switzerland AG.
Overview of Arnekt IECSIL at Fire-2018 track on information extraction for conversational systems in Indian languages
(CEUR-WS ceurws@sunsite.informatik.rwth-aachen.de, 2018) Barathi Ganesh, H.; Padannayil, K.P.; Reshma, U.; Kale, M.; Mankame, P.; Kulkarni, G.; Kale, A.; Anand Kumar, M.
This overview paper describes the first shared task on Information Extractor for Conversational Systems in Indian Languages (IECSIL) which has been organized by FIRE 2018. Motivated by the need of Information Extractor, corpora has been developed to perform the Named Entity Recognition (Task A) and Relation Extraction (Task B) for five Indian languages (Hindi, Tamil, Malayalam, Telugu and Kannada). Task A is to identify and classify the named entities to one of the many classes and Task B is to extract the relation among the entities present in the sentences. Altogether, nearly 100 submission of 10 different teams were evaluated. In this paper, we have given an overview of the approaches and also discussed the results that the participated teams have attained. Â© 2018 CEUR-WS. All Rights Reserved.
Overview of the second shared task on Indian native language identification (INLI)
(CEUR-WS ceurws@sunsite.informatik.rwth-aachen.de, 2018) Anand Kumar, M.; Barathi Ganesh, H.; Ajay, S.G.; Padannayil, K.P.
This overview paper describes the second shared task on Indian Native Language Identification (INLI) that was organized by FIRE 2018. Given a corpus with comments in English from various Facebook newspapers pages, the objective of the task is to identify the native language among the following six Indian languages: Bengali, Hindi, Kannada, Malayalam, Tamil, and Telugu. Altogether, 31 approaches of 14 different teams are evaluated. In this paper, we report the overview of the participantâ€™s systems and the results of second INLI shared task. We have also compared the results of the first INLI shared task conducted with FIRE-2017. Â© 2018 CEUR-WS. All Rights Reserved.
Overview of the track on HASOC-offensive Language Identification-DravidianCodeMix
(CEUR-WS, 2020) Chakravarthi, B.R.; Anand Kumar, M.; Mccrae, J.P.; Premjith, B.; Padannayil, K.P.; Mandl, T.
We present the results and main findings of the HASOC-Offensive Language Identification on code mixed Dravidian languages. The task featured two tasks. Task 1 is about offensive language identification in Malayalam language where the comment were written in both native script and Latin script. Task 2 is about offensive language identification in Tamil and Malayalam languages where the comments were written in Latin script (non-native script). For both the task, given a comment the participants should develop a system to classify the text into offensive or not-offensive. In total 96 participants participated and 12 participants submitted the papers. In this paper, we present the task, data, the results and discuss the system submission and methods used by participants. Â© 2020 Copyright for this paper by its authors.
Semantic Similarity and Paraphrase Identification for Malayalam Using Deep Autoencoders
(Springer Science and Business Media Deutschland GmbH, 2021) Praveena, R.; Anand Kumar, M.; Padannayil, K.P.
In this chapter, we deal with the sentence-level paraphrase identification for the Malayalam language. We use recursive autoencoder architecture for the unsupervised learning of phrase representations to extract features for paraphrase identification. Sentence’s features of varying lengths are converted to fixed-size representation using the convolution method of dynamic pooling. Initially, the Malayalam paraphrase identification system was designed to identify paraphrases and non-paraphrases alone and later extended to identify semi-equivalent paraphrases. Along with semantic features, conventional statistical features are further taken into account, resulting in improved system performance. The proposed system was implemented using word2vec embedding and obtained 77.67% accuracy for the two-class system and 66.07% for the three-class system. This chapter also discusses different experiments done for choosing the best parameters and embedding models. © 2021, The Author(s), under exclusive license to Springer Nature Switzerland AG.
Tamil NLP Technologies: Challenges, State ofÂ theÂ Art, Trends andÂ Future Scope
(Springer Science and Business Media Deutschland GmbH, 2023) Rajendran, S.; Anand Kumar, M.; Rajalakshmi, R.; Dhanalakshmi, V.; Balasubramanian, P.; Padannayil, K.P.
This paper aims to summarize the NLP-based technological development of the Tamil language. Tamil is one of the Dravidian languages that are serious about technological development. This phenomenon is reflected in its activities in developing language technology tools and the resources made for technological development. Tamil has successfully developed tools or systems for speech synthesis and recognition, grammatical analysis of grammar, semantics and social media text, along with machine translation. There are many types of research undertaken to orient towards this achievement. Similarly, many activities are developing resources to facilitate technological development. The activities include preparing text corpora for text including monolingual, parallel and lexical along with speech with lexical resources and grammar. What is needed now is to stock-take the achievement made so far and found out where Tamil is in the arena of technological development and looks forward further to its fast technological development. Computational linguistics in Tamil NLP is gaining more attraction, and various data sets available for research is highlighted in this work for further exploration. Â© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.