Faculty Publications
Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736
Publications by NITK Faculty
Browse
15 results
Search Results
Item Attention based Image Captioning using Depth-wise Separable Convolution(Institute of Electrical and Electronics Engineers Inc., 2021) Mallick, V.R.; Naik, D.Automatically generating descriptions for an image has been one of the trending topics in the field of Computer Vision. This is due to the fact that various real-life applications like self-driving cars, Google image search, etc. are dependent on it. The backbone of this work is the encoder-decoder architecture of deep learning. The basic image captioning model has CNN as an encoder and RNN as a decoder. Various deep CNNs like VGG-16 and VGG-19, ResNet, Inception have been explored but despite the comparatively better performance, Xception is not that familiar in this field. Again for the decoder, GRU is not been used much, despite being comparatively faster than LSTM. Keeping these things in mind, and being attracted by the accuracy of Xception and efficiency of GRU, we propose an architecture for image captioning task with Xception as encoder and GRU as decoder with an attention mechanism. © 2021 IEEE.Item Comparitive Study of GRU and LSTM Cells Based Video Captioning Models(Institute of Electrical and Electronics Engineers Inc., 2021) Maru, H.; Chandana, T.S.S.; Naik, D.Video Captioning task involves generating descriptive text for the events and objects in the videos. It mainly involves taking a video, which is nothing but a sequence of frames, as data from the user and giving a single or multiple sentences (sequence of words) to the user. A lot of research has been done in the area of video captioning. Most of this work is based on using Long Short Term Memory (LSTM) units for avoiding the vanishing gradients problem. In this work, we purpose to implement a video captioning model using Gated Recurrent Units(GRU's), attention mechanism and word embeddings and compare the functionalities and results with traditional models that use LSTM's or Recurrent Neural Networks(RNN's). We train and test our model on the standard MSVD (Microsoft Research Video Description Corpus) dataset. We use a wide range of performance metrics like BLEU score, METEOR score, ROUGE-1, ROUGE-2 and ROUGE-L to evaluate the performance. © 2021 IEEE.Item Effect of Batch Normalization and Stacked LSTMs on Video Captioning(Institute of Electrical and Electronics Engineers Inc., 2021) Sarathi, V.; Mujumdar, A.; Naik, D.Integration of visual content with natural language for generating images or video description has been a challenging task for many years. Recent research in image captioning using Long Short term memory (LSTM) recently has motivated its possible application in video captioning where a video is converted into an array of frames, or images, and this array along with the captions for the video are used to train the LSTM network to associate the video with sentences. However very little is known about using fine tuning techniques such as batch normalization or Stacked LSTMs models in video captioning and how it affects the performance of the model.For this project, we want to compare the performance of the base model described in [1] with batch normalization and stacked LSTMs with base model as our reference. © 2021 IEEE.Item Image Captioning with Attention Based Model(Institute of Electrical and Electronics Engineers Inc., 2021) Yv, S.S.; Choubey, Y.; Naik, D.Defining the content of an image automatically in Artificial Intelligence is basically a rudimentary problem that connects computer vision and NLP (Natural Language Processing). In the proposed work, a generative model is presented by combining the recent developments in machine learning and computer vision based on a deep recurrent architecture that describes the image using natural language phrases. By integrating the training picture, the trained model maximizes the likelihood of the target description sentence. The efficiency of the model, its accuracy and the language it learns is only dependent on the image descriptions, which was demonstrated by experiments performed on several datasets. © 2021 IEEE.Item Describing Image with Attention based GRU(Institute of Electrical and Electronics Engineers Inc., 2021) Mallick, V.R.; Naik, D.Generating descriptions for images are popular research topic in current world. Based on encoder-decoder model, CNN works as an encoder to encode the images and then passes it to decoder RNN as input to generate the image description in natural language sentences. LSTM is widely used as RNN decoder. Attention mechanism has also played an important role in this field by enhancing the object detection. Inspired by this recent advancement in this field of computer vision, we used GRU in place of LSTM as a decoder for our image captioning model. We incorporated attention mechanism with GRU decoder to enhance the precision of generated captions. GRU have lesser tensor operations in comparison to LSTM, hence it will be faster in training. © 2021 IEEE.Item Multi-stream Multi-attention Deep Neural Network for Context-Aware Human Action Recognition(Institute of Electrical and Electronics Engineers Inc., 2022) Rashmi, M.; Guddeti, R.M.R.Technological innovations in deep learning models have enabled reasonably close solutions to a wide variety of computer vision tasks such as object detection, face recognition, and many more. On the other hand, Human Action Recognition (HAR) is still far from human-level ability due to several challenges such as diversity in performing actions. Due to data availability in multiple modalities, HAR using video data recorded by RGB-D cameras is frequently used in current research. This paper proposes an approach for recognizing human actions using depth and skeleton data captured using the Kinect depth sensor. Attention modules have been introduced in recent years to assist in focusing on the most important features in computer vision tasks. This paper proposes a multi-stream deep learning model with multiple attention blocks for HAR. At first, the depth and skeletal modalities' action data are represented using two distinct action descriptors. Each generates an image from the action data gathered from numerous frames. The proposed deep learning model is trained using these descriptors. Additionally, we propose a set of score fusion techniques for accurate HAR using all the features and trained CNN + LSTM streams. The proposed method is evaluated on two benchmark datasets using well known cross-subject evaluation protocol. The proposed technique achieved 89.83% and 90.7% accuracy on the MSRAction3D and UTDMHAD datasets, respectively. The experimental results establish the validity and effectiveness of the proposed model. © 2022 IEEE.Item LSTM-Attention Architecture for Online Bilingual Sexism Detection(CEUR-WS, 2023) Ravi, S.; Kelkar, S.; Anand Kumar, M.The paper describes the results submitted by ‘Team-SMS’ at EXIST 2023. A dataset of 6920 tweets for training, 1038 for validation, and 2076 tweets for testing was provided by the task organizers to train and test our models. Our models include LSTM models coupled with attention layers and without attention. For calculation of soft scores according to the task we tried to mimic human performance by taking an average of different machine learning model predictions using Multinomial Naive Bayes, Linear Support Vector Classifier, Multi Layer Perceptron, XGBoost, LSTM using GloVe embeddings, and LSTM using fastText embeddings. We discuss our approach to remove the ambiguity in the labeling process and detailed description of our work. © 2023 Copyright for this paper by its authors.Item Multi-Res-Attention UNet: A CNN Model for the Segmentation of Focal Cortical Dysplasia Lesions from Magnetic Resonance Images(Institute of Electrical and Electronics Engineers Inc., 2021) Thomas, E.; Pawan, S.J.; Kumar, S.; Horo, A.; Niyas, S.; Vinayagamani, S.; Kesavadas, C.; Rajan, J.In this work, we have focused on the segmentation of Focal Cortical Dysplasia (FCD) regions from MRI images. FCD is a congenital malformation of brain development that is considered as the most common causative of intractable epilepsy in adults and children. To our knowledge, the latest work concerning the automatic segmentation of FCD was proposed using a fully convolutional neural network (FCN) model based on UNet. While there is no doubt that the model outperformed conventional image processing techniques by a considerable margin, it suffers from several pitfalls. First, it does not account for the large semantic gap of feature maps passed from the encoder to the decoder layer through the long skip connections. Second, it fails to leverage the salient features that represent complex FCD lesions and suppress most of the irrelevant features in the input sample. We propose Multi-Res-Attention UNet; a novel hybrid skip connection-based FCN architecture that addresses these drawbacks. Moreover, we have trained it from scratch for the detection of FCD from 3 T MRI 3D FLAIR images and conducted 5-fold cross-validation to evaluate the model. FCD detection rate (Recall) of 92% was achieved for patient wise analysis. © 2013 IEEE.Item Video summarization and captioning using dynamic mode decomposition for surveillance(Springer Science and Business Media B.V., 2021) Radarapu, R.; Gopal, A.S.S.; Nh, M.; Anand Kumar, M.Video surveillance has become a major tool in security maintenance. But analyzing in a playback version to detect any motion or any sort of movements might be tedious work because only for a short length of the video there would be any motion. There would be a lot of time wasted in analyzing the video and also it is impossible to always find the accurate frame where the transition has occurred. So there is a need in obtaining a summary video that captures any changes/motion. With the advancements in image processing using OpenCV and deep learning, video summarization is no longer an impossible work. Captions are generated for the summarized videos using an encoder–decoder captioning model. With the help of large, well-labeled video data sets like common objects in context, Microsoft video description, video captioning is a feasible task. Encoder–decoder models are used extensively to extract text from visual features with the arrival of long short term memory (LSTM). Attention mechanism has been widely used on decoder for the work of video captioning. Keyframes are obtained from very long videos using methods like dynamic mode decomposition, an algorithm in fluid dynamics, OpenCV’s absdiff(). We propose these tools for motion detection and video/image captioning for very long videos which are common in video surveillance. © 2021, Bharati Vidyapeeth's Institute of Computer Applications and Management.Item Spatiotemporal Assessment of Satellite Image Time Series for Land Cover Classification Using Deep Learning Techniques: A Case Study of Reunion Island, France(MDPI, 2022) Navnath, N.N.; Chandrasekaran, K.; Stateczny, A.; Sundaram, V.M.; Prabhavathy, P.Current Earth observation systems generate massive amounts of satellite image time series to keep track of geographical areas over time to monitor and identify environmental and climate change. Efficiently analyzing such data remains an unresolved issue in remote sensing. In classifying land cover, utilizing SITS rather than one image might benefit differentiating across classes because of their varied temporal patterns. The aim was to forecast the land cover class of a group of pixels as a multi-class single-label classification problem given their time series gathered using satellite images. In this article, we exploit SITS to assess the capability of several spatial and temporal deep learning models with the proposed architecture. The models implemented are the bidirectional gated recurrent unit (GRU), temporal convolutional neural networks (TCNN), GRU + TCNN, attention on TCNN, and attention of GRU + TCNN. The proposed architecture integrates univariate, multivariate, and pixel coordinates for the Reunion Island’s landcover classification (LCC). the evaluation of the proposed architecture with deep neural networks on the test dataset determined that blending univariate and multivariate with a recurrent neural network and pixel coordinates achieved increased accuracy with higher F1 scores for each class label. The results suggest that the models also performed exceptionally well when executed in a partitioned manner for the LCC task compared to the temporal models. This study demonstrates that using deep learning approaches paired with spatiotemporal SITS data addresses the difficult task of cost-effectively classifying land cover, contributing to a sustainable environment. © 2022 by the authors.
