Faculty Publications

Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736

Publications by NITK Faculty

Browse

Search Results

Now showing 1 - 10 of 23

Gaining Actionable Insights in COVID-19 Dataset Using Word Embeddings
(Springer Science and Business Media Deutschland GmbH, 2022) Jha, R.A.; Ananthanarayana, V.S.
The field of unsupervised natural language processing (NLP) is gradually growing in prominence and popularity due to the overwhelming amount of scientific and medical data available as text, such as published journals and papers. To make use of this data, several techniques are used to extract information from these texts. Here, in this paper, we have made use of COVID-19 corpus (https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge ) related to the deadly corona virus, SARS-CoV-2, to extract useful information which can be invaluable in finding the cure of the disease. We make use of two word-embeddings model, Word2Vec and global vector for word representation (GloVe), to efficiently encode all the information available in the corpus. We then follow some simple steps to find the possible cures of the disease. We got useful results using these word-embeddings models, and also, we observed that Word2Vec model performed better than GloVe model on the used dataset. Another point highlighted by this work is that latent information about potential future discoveries are significantly contained in past papers and publications. © 2022, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
Automatic identification and ranking of emergency AIDS in social media macro community
(CEUR-WS, 2017) Gautam, B.; Annappa, B.
Online social microblogging platforms including Twiter are increasingly used for aiding relief operations during disaster events. During most of the calamities that can be natural disasters or even armed atacks, non-governmental organizations look for critical information about resources to support effected people. Despite the recent advancement of natural language processing with deep neural networks, retrieval and ranking of short text becomes a challenging task because a lot of conversational and sympathy content merged with the critical information. In this paper, we address the problem of categorical information retrieval and ranking of most relevance information while considering the presence of short-text and multilingual languages that arise during such events. Our proposed model is based on the formation of embedding vector with the help of textual and statistical preprocessing, and finally, entire training 2,100,000 vectors were normalized using feed-forward neural network for need and availability tweets. Another important contribution of this paper lies in novel weighted Ranking Key algorithm based on top five general terms to rank the classified tweets in most relevance with classification. Lastly, we test our model on Nepal Earthquake dataset (contains short text and multilingual language tweets) and achieved 6.81% of mean average precision on 5,250,000 unlabeled embedding vectors of disaster relief tweets.
When and where?: Behavior dominant location forecasting with micro-blog streams
(IEEE Computer Society, 2018) Gautam, B.; Annappa, B.; Singh, A.; Agrawal, A.
The proliferation of smartphones and wearable devices has increased the availability of large amounts of geospatial streams to provide significant automated discovery of knowledge in pervasive environments, but most prominent information related to altering interests have not yet adequately capitalized. In this paper, we provide a novel algorithm to exploit the dynamic fluctuations in user's point-of-interest while forecasting the future place of visit with fine granularity. Our proposed algorithm is based on the dynamic formation of collective personality communities using different languages, opinions, geographical and temporal distributions for finding out optimized equivalent content. We performed extensive empirical experiments involving, real-time streams derived from 0.6 million stream tuples of micro-blog comprising 1945 social person fusion with graph algorithm and feed-forward neural network model as a predictive classification model. Lastly, The framework achieves 62.10% mean average precision on 1,20,000 embeddings on unlabeled users and surprisingly 85.92% increment on the state-of-the-art approach. Â© 2018 IEEE.
Detecting Semantic Similarity of Documents Using Natural Language Processing
(Elsevier B.V., 2021) Agarwala, S.; Anagawadi, A.; Reddy Guddeti, R.M.
The similarity of documents in natural languages can be judged based on how similar the embeddings corresponding to their textual content are. Embeddings capture the lexical and semantic information of texts, and they can be obtained through bag-of-words approaches using the embeddings of constituent words or through pre-trained encoders. This paper examines various existing approaches to obtain embeddings from texts, which is then used to detect similarity between them. A novel model which builds upon the Universal Sentence Encoder is also developed to do the same. The explored models are tested on the SICK-dataset, and the correlation between the ground truth values given in the dataset and the predicted similarity is computed using the Pearson, Spearman and Kendall's Tau correlation metrics. Experimental results demonstrate that the novel model outperforms the existing approaches. Finally, an application is developed using the novel model to detect semantic similarity between a set of documents. Â© 2021 Elsevier B.V.. All rights reserved.
Impact of Vector Embeddings on the Performance of Tolerance Near Sets-based Sentiment Classifier for Text Classification
(Elsevier B.V., 2023) Hegde, T.; Sanjay, K.S.; Thomas, S.M.; Kambhammettu, R.; Anand Kumar, M.; Ramanna, S.
In recent years, Natural Language Processing (NLP) has gained significant attention, and sentiment analysis is an essential subfield of NLP that deals with identifying the sentiment or emotion conveyed in the text. Tolerance near sets (TNS) is a mathematical framework that has shown promising results in sentiment analysis tasks. However, the choice of word embeddings can significantly impact the performance of TNS-based classifiers. This paper investigates the impact of using different embeddings on the performance of tolerance near sets-based sentiment classifiers. This paper compares the use of different embeddings, including DistilBERT, MiniLM, and Word Embeddings, and their combinations, to understand their impact on TNS-based sentiment analysis. The TSC 2.0 model proposed in this paper achieves a weighted F1 score of 92.1% in one of the datasets, an improvement due to the sentence embeddings used. Experimental results have led to the observation that tie-breaking and variance-based classification may have led to a noticeable improvement in cases with more than three. Â© 2023 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
fastText-Based Siamese Network forÂ Hindi Semantic Textual Similarity
(Springer Science and Business Media Deutschland GmbH, 2025) Chandrashekar, A.; Rushad, M.; Nambiar, A.; Rashmi, V.; Koolagudi, S.G.
Semantic textual similarity is a measurement of the degree of similarity or equivalence between two sentences semantically. Semantic sentence pairs have the ability to substitute text from each other and retain their meaning. Various rule-based and machine learning models have gained quick prominence in the field, especially in a language like English, where there is an abundance of lexical tools and resources. However, other languages like Hindi have not seen much improvement in state-of-the-art methods and models to evaluate semantic similarity of text data. This paper proposes a fastText-based Siamese neural network architecture to evaluate the semantic equivalency between a Hindi sentence pair. The pair is scored on a scale of 0â€“5, where 0 indicates least similar and 5 indicates most similar. The corpus contains a combination of two datasets containing manually scored sentence pairs. The performance parameters used to evaluate this approach are model accuracy and model loss over a training period of multiple epochs. The proposed architecture incorporates a fastText-based embedding layer and a bi-directional Long Short Term Memory layer to achieve a similarity score. The proposed architecture can extract semantic and various global features of the text to determine a similarity score. This model achieves an accuracy of 85.5% on a compiled Hindi-Hindi sentence pair dataset, which is a considerable improvement over existing rule and supervise-based systems. Â© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
Antibiofouling hollow-fiber membranes for dye rejection by embedding chitosan and silver-loaded chitosan nanoparticles
(Springer Verlag, 2019) Kolangare, I.M.; Isloor, A.M.; Zulhairun, Z.A.; Kulal, A.; A.F., A.F.; Siddique, I.; Asiri, A.M.
The removal of toxic dyes from the wastewater and industrial effluents is a major environmental challenge. Various techniques have been employed for the removal of dyes, including the application of nano-sized adsorbents, nanocomposite membranes and photodegradation. Membrane filtration is an alterntive but suffers from drawbacks such as fouling. Here we present a simple approach for the development of antibiofouling membranes based on chitosan. The application of chitosan-based nanoparticles as additives for wastewater treatment is poorly explored. The chitosan and silver-loaded chitosan nanoparticles were synthesized by ionic gelation method and incorporated to fabricate hollow-fiber membranes by dry–wet spinning technique. The prepared membranes were characterized by morphological study, permeability test, antibiofouling study and dye rejection study. The nanocomposite hollow-fiber membranes displayed superior performance than their pristine form. The incorporation of 0.30 weight percent of the chitosan and silver-loaded chitosan nanoparticles into the hollow-fiber membranes enhanced the antifouling property with flux recovery ratio of 81.21 and 86.13%, respectively. The dye rejection results showed maximum rejection of 89.27 and 86.04% for Reactive Black 5 and Reactive Orange 16, respectively. Hence, it can be concluded that hollow-fiber membranes with silver-loaded chitosan nanoparticles are pertinent in developing antibiofouling membranes for the treatment of industrial dye effluents. © 2018, Springer Nature Switzerland AG.
Enhanced protein structural class prediction using effective feature modeling and ensemble of classifiers
(Institute of Electrical and Electronics Engineers Inc., 2021) Bankapur, S.; Patil, N.
Protein Secondary Structural Class (PSSC) information is important in investigating further challenges of protein sequences like protein fold recognition, protein tertiary structure prediction, and analysis of protein functions for drug discovery. Identification of PSSC using biological methods is time-consuming and cost-intensive. Several computational models have been developed to predict the structural class; however, they lack in generalization of the model. Hence, predicting PSSC based on protein sequences is still proving to be an uphill task. In this article, we proposed an effective, novel and generalized prediction model consisting of a feature modeling and an ensemble of classifiers. The proposed feature modeling extracts discriminating information (features) by leveraging three techniques: (i) Embedding – features are extracted on the basis of spatial residue arrangements of the sequences using word embedding approaches; (ii) SkipXGram Bi-gram – various sets of skipped bi-gram features are extracted from the sequences; and (iii) General Statistical (GS) based features are extracted which covers the global information of structural sequences. The combined effective sets of features are trained and classified using an ensemble of three classifiers: Support Vector Machine (SVM), Random Forest (RF), and Gradient Boosting Machines (GBM). The proposed model when assessed on five benchmark datasets (high and low sequence similarity), viz. z277, z498, 25PDB, 1189, and FC699, reported an overall accuracy of 93.55, 97.58, 81.82, 81.11, and 93.93 percent respectively. The proposed model is further validated on a large-scale updated low similarity (?25%) dataset, where it achieved an overall accuracy of 81.11 percent. The proposed generalized model is robust and consistently outperformed several state-of-the-art models on all the five benchmark datasets. © 2021 Institute of Electrical and Electronics Engineers Inc.. All rights reserved.
Application of word embedding and machine learning in detecting phishing websites
(Springer, 2022) Rao, R.S.; Umarekar, A.; Pais, A.R.
Phishing is an attack whose aim is to gain personal information such as passwords, credit card details etc. from online users by deceiving them through fake websites, emails or any legitimate internet service. There exists many techniques to detect phishing sites such as third-party based techniques, source code based methods and URL based methods but still users are getting trapped into revealing their sensitive information. In this paper, we propose a new technique which detects phishing sites with word embeddings using plain text and domain specific text extracted from the source code. We applied various word embedding for the evaluation of our model using ensemble and multimodal approaches. From the experimental evaluation, we observed that multimodal with domain specific text achieved a significant accuracy of 99.34% with TPR of 99.59%, FPR of 0.93%, and MCC of 98.68% © 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Classification of Phishing Email Using Word Embedding and Machine Learning Techniques
(River Publishers, 2022) Somesha, M.; Pais, A.R.
Email phishing is a cyber-attack, bringing substantial financial damage to corporate and commercial organizations. A phishing email is a special type of spamming, used to trick the user to disclose personal information to access his digital assets. Phishing attack is generally triggered by emailing links to spoofed websites that collect sensitive information. The APWG survey suggests that the existing countermeasures remain ineffective and insufficient for detecting phishing attacks. Hence there is a need for an efficient mechanism to detect phishing emails to provide better security against such attacks to the common user. The existing open-source data sets are limited in diversity, hence they do not capture the real picture of the attack. Hence there is a need for real-time input data set to design accurate email anti-phishing solutions. In the current work, it has been created a real-time in-house corpus of phishing and legitimate emails and proposed efficient techniques to detect phishing emails using a word embedding and machine learning algorithms. The proposed system uses only four email header-based heuristics for the classification of emails. The proposed word embedding cum machine learning framework comprises six word embedding techniques with five machine learning classifiers to evaluate the best performing combination. Among all six combinations, Random Forest consistently performed the best with FastText (CBOW) by achieving an accuracy of 99.50% with a false positive rate of 0.053%, TF-IDF achieved an accuracy of 99.39% with a false positive rate of 0.4% and Count Vectorizer achieved an accuracy of 99.18% with a false positive rate of 0.98% respectively for three datasets used. © 2022 River Publishers.

Faculty Publications

Browse

Filters

Settings

Sort By

Results per page

Search Results