Faculty Publications

Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736

Publications by NITK Faculty

Browse

Search Results

Now showing 1 - 10 of 16
  • Item
    A comparative analysis of machine comprehension using deep learning models in code-mixed hindi language
    (Springer Verlag service@springer.de, 2019) Viswanathan, S.; Anand Kumar, M.; Padannayil, K.P.
    The domain of artificial intelligence revolutionizes the way in which humans interact with machines. Machine comprehension is one of the latest fields under natural language processing that holds the capability for huge improvement in artificial intelligence. Machine comprehension technique gives systems the ability to understand a passage given by user and answer questions asked from it, which is an evolved version of traditional question answering technique. Machine comprehension is a main technique that falls under the category of natural language understanding, which exposes the amount of understanding required for a model to find the area of interest from a passage. The scope for the implementation of this technique is very high in India due to the availability of different regional languages. This work focused on the incorporation of machine comprehension technique in code-mixed Hindi language. A detailed comparison study on the performance of dataset in several deep learning approaches including End to End Memory Network, Dynamic Memory Network, Recurrent Neural Network, Long Short-Term Memory Network and Gated Recurrent Unit are evaluated. The best suited model for the dataset used is identified from the comparison study. A new architecture is proposed in this work by combining two of the best performing networks. To improve the model with respect to various ways of answering questions from a passage the natural language processing technique of distributed word representation was performed on the best model identified. The model was improved by applying pre-trained fastText embeddings for word representations. This is the first implementation of machine comprehension models in code-mixed Hindi language using deep neural networks. The work analyses the performance of all five models implemented, which will be helpful for future researches on Machine Comprehension technique in code-mixed Indian languages. © Springer Nature Switzerland AG 2019.
  • Item
    Embedding linguistic features in word embedding for preposition sense disambiguation in english—Malayalam machine translation context
    (Springer Verlag service@springer.de, 2019) Premjith, B.; Padannayil, K.P.; Anand Kumar, M.; Jyothi Ratnam, D.
    Preposition sense disambiguation has huge significance in Natural language processing tasks such as Machine Translation. Transferring the various senses of a simple preposition in source language to a set of senses in target language has high complexity due to these many-to-many relationships, particularly in English-Malayalam machine translation. In order to reduce this complexity in the transfer of senses, in this paper, we used linguistic information such as noun class features and verb class features of the respective noun and verb correlated to the target simple preposition. The effect of these linguistic features for the proper classification of the senses (postposition in Malayalam) is studied with the help of several machine learning algorithms. The study showed that, the classification accuracy is higher when both verb and noun class features are taken into consideration. In linguistics, the major factor that decides the sense of the preposition is the noun in the prepositional phrase. The same trend was observed in the study when the training data contained only noun class features. i.e., noun class features dominates the verb class features. © Springer Nature Switzerland AG 2019.
  • Item
    Deep learning architecture for big data analytics in detecting intrusions and malicious URL
    (Institution of Engineering and Technology, 2019) Harikrishnan, N.B.; Ravi, R.; Padannayil, K.P.; Poornachandran, P.; Annappa, A.; Alazab, M.
    Security attacks are one of the major threats in today’s world. These attacks exploit the vulnerabilities in a system or online sites for financial gain. By doing so, there arises a huge loss in revenue and reputation for both government and private firms. These attacks are generally carried out through malware interception, intrusions, phishing uniform resource locator (URL). There are techniques like signature-based detection, anomaly detection, state full protocol to detect intrusions, blacklisting for detecting phishing URL. Even though these techniques claim to thwart cyberattacks, they often fail to detect new attacks or variants of existing attacks. The second reason why these techniques fail is the dynamic nature of attacks and lack of annotated data. In such a situation, we need to propose a system which can capture the changing trends of cyberattacks to some extent. For this, we used supervised and unsupervised learning techniques. The growing problem of intrusions and phishing URLs generates a need for a reliable architectural-based solution that can efficiently identify intrusions and phishing URLs. This chapter aims to provide a comprehensive survey of intrusion and phishing URL detection techniques and deep learning. It presents and evaluates a highly effective deep learning architecture to automat intrusion and phishing URL Detection. The proposed method is an artificial intelligence (AI)-based hybrid architecture for an organization which provides supervised and unsupervised-based solutions to tackle intrusions, and phishing URL detection. The prototype model uses various classical machine learning (ML) classifiers and deep learning architectures. The research specifically focuses on detecting and classifying intrusions and phishing URL detection. © The Institution of Engineering and Technology 2020.
  • Item
    Extraction of named entities from social media text in tamil language using N-gram embedding for disaster management
    (Springer Verlag service@springer.de, 2020) Remmiya Devi, G.R.; Anand Kumar, M.A.; Padannayil, K.P.
    In the present era, data in any form is considered with greater importance. More specifically, text data has rich and brief information than any other form of data. Extraction and analysis of these data can result in various new findings through text analytics. This has led to applications such as search engines, extraction of product names, sentiment analysis, document classification and few more. Companies are much focused on sentimental analysis to review the positive, negative and neutral comments for their products. Summarization of text is a notable application of Natural Language Processing that reveals the gist of brief documents. Apart from these, on concerning welfare of the society, application based on information extraction can be developed. Handling an emergency situation requires collection of vast information. Extraction of such data can be supportive during disaster management. In order to perceive such task, system must learn the meaning of human languages. To ease the accessibility of text data across language barriers is the primary motive of Natural Language Processing (NLP) systems. The proposed systems has utilized word embedding model, specifically skip gram model to implement the most fundamental task of NLP—entity extraction in social media text. Implementation of N-gram embedding methods paved way for creation of rich context knowledge for the system to handle social media text. Classification of named entities using the proposed system has been carried out using machine learning classifier Support Vector Machine (SVM). © Springer Nature Switzerland AG 2020.
  • Item
    MedNLU: Natural Language Understander for Medical Texts
    (Springer Science and Business Media Deutschland GmbH, 2020) Barathi Ganesh, H.B.; Reshma, U.; Padannayil, K.P.; Anand Kumar, M.
    Natural Language Understanding is one of the essential tasks for building clinical text-based applications. Understanding of these clinical texts can be achieved through Vector Space Models and Sequential Modelling tasks. This paper is focused on sequential modelling i.e. Named Entity Recognition and Part of Speech Tagging by attaining a state of the art performance of 93.8% as F1 score for i2b2 clinical corpus and achieves 97.29% as F1 score for GENIA corpus. This paper also states the performance of feature fusion by integrating word embedding, feature embedding and character embedding for sequential modelling tasks. We also propose a framework based on a sequential modelling architecture, named MedNLU, which has the capability of performing Part of Speech Tagging, Chunking, and Entity Recognition on clinical texts. The sequence modeler in MedNLU is an integrated framework of Convolutional Neural Network, Conditional Random Fields and Bi-directional Long-Short Term Memory network. © 2020, Springer Nature Switzerland AG.
  • Item
    Ontological Structure-Based Retrieval System for Tamil
    (Springer Science and Business Media Deutschland GmbH info@springer-sbm.com, 2021) Rajendran, S.; Padannayil, K.P.; Anand Kumar, M.; Sankaralingam, C.
    Ontological structure of Tamil (OST) is an outcome of an extensive research activity that went on in the field of lexical semantics of Tamil for the last three decades. Rajendran’s (Semantic structure of Tamil vocabulary. Report of the UGC sponsored postdoctoral work (in manuscript). Deccan College Post-Doctoral Research Institute, Pune, 1983) post-doctoral research work went through several stages before culminating into OST. It depicts the travel from Tamil thesaurus to Tamil WordNet and into OST. OST is a lexical resource which amalgamates all sorts of information available in a dictionary, thesaurus and WordNet. The Dravidian WordNets (in which Tamil WordNet is one of the four components) built under the Indo-WordNet project depended on an ontology developed by Western conceptualization of the world found in English. This has not taken into consideration the Indian conceptualization of the world depicted in the nikhandu tradition. There are many lexical gaps between English WordNet and Tamil WordNet. Moreover, building a WordNet based on Hindi WordNet which in turn is built on English WordNet will take many years to complete and it would miss the conceptualization depicted in Indian tradition. Apart from this, the extension approach of building Tamil WordNet using Hindi WordNet cannot fulfil Dravidian conceptualization. A merger approach of building separate WordNets and collapsing them into one would have been a preferable approach. The present OST tried to overcome the lacunae found in Tamil WordNet. OST is based on the Indian and Dravidian conceptualization and the process of building one is comparatively very simple. We have the plan to mend it into a generic one so that all the Dravidian languages can be easily accommodated into it. © 2021, Springer Nature Switzerland AG.
  • Item
    Semantic Similarity and Paraphrase Identification for Malayalam Using Deep Autoencoders
    (Springer Science and Business Media Deutschland GmbH, 2021) Praveena, R.; Anand Kumar, M.; Padannayil, K.P.
    In this chapter, we deal with the sentence-level paraphrase identification for the Malayalam language. We use recursive autoencoder architecture for the unsupervised learning of phrase representations to extract features for paraphrase identification. Sentence’s features of varying lengths are converted to fixed-size representation using the convolution method of dynamic pooling. Initially, the Malayalam paraphrase identification system was designed to identify paraphrases and non-paraphrases alone and later extended to identify semi-equivalent paraphrases. Along with semantic features, conventional statistical features are further taken into account, resulting in improved system performance. The proposed system was implemented using word2vec embedding and obtained 77.67% accuracy for the two-class system and 66.07% for the three-class system. This chapter also discusses different experiments done for choosing the best parameters and embedding models. © 2021, The Author(s), under exclusive license to Springer Nature Switzerland AG.
  • Item
    Overview of Arnekt IECSIL at Fire-2018 track on information extraction for conversational systems in Indian languages
    (CEUR-WS ceurws@sunsite.informatik.rwth-aachen.de, 2018) Barathi Ganesh, H.; Padannayil, K.P.; Reshma, U.; Kale, M.; Mankame, P.; Kulkarni, G.; Kale, A.; Anand Kumar, M.
    This overview paper describes the first shared task on Information Extractor for Conversational Systems in Indian Languages (IECSIL) which has been organized by FIRE 2018. Motivated by the need of Information Extractor, corpora has been developed to perform the Named Entity Recognition (Task A) and Relation Extraction (Task B) for five Indian languages (Hindi, Tamil, Malayalam, Telugu and Kannada). Task A is to identify and classify the named entities to one of the many classes and Task B is to extract the relation among the entities present in the sentences. Altogether, nearly 100 submission of 10 different teams were evaluated. In this paper, we have given an overview of the approaches and also discussed the results that the participated teams have attained. © 2018 CEUR-WS. All Rights Reserved.
  • Item
    Overview of the second shared task on Indian native language identification (INLI)
    (CEUR-WS ceurws@sunsite.informatik.rwth-aachen.de, 2018) Anand Kumar, M.; Barathi Ganesh, H.; Ajay, S.G.; Padannayil, K.P.
    This overview paper describes the second shared task on Indian Native Language Identification (INLI) that was organized by FIRE 2018. Given a corpus with comments in English from various Facebook newspapers pages, the objective of the task is to identify the native language among the following six Indian languages: Bengali, Hindi, Kannada, Malayalam, Tamil, and Telugu. Altogether, 31 approaches of 14 different teams are evaluated. In this paper, we report the overview of the participant’s systems and the results of second INLI shared task. We have also compared the results of the first INLI shared task conducted with FIRE-2017. © 2018 CEUR-WS. All Rights Reserved.
  • Item
    Indian native language identification - INLI 2018
    (Association for Computing Machinery acmhelp@acm.org, 2018) Anand Kumar, M.; Barathi Ganesh, H.B.; Padannayil, K.P.; Ajay, S.G.
    The growth of digital platforms enables the industries to serve user specific services. Most of the time, the information of the internet users are not explicitly available and it acts as a constrain in developing the personalized applications. There comes the need for author profiling tasks, which intends to predict the internet users characteristics from their texts. Native language Identification is one among the author profiling task, that predicts the authors native language from their texts available in other language. We have proposed Indian Native Language Identification task, where the internet users texts are written in English and participants needs to find, whether the user’s native language is from Tamil, Malayalam, Kannada, Telugu, Bengali and Hindi. The corpus is collected from texts from regional news paper pages available in Facebook by considering the hypothesis that the user belongs to a particular region will read the news from respective regional news paper. © 2018 Association for Computing Machinery.