Faculty Publications

Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736

Publications by NITK Faculty

Browse

Search Results

Now showing 1 - 7 of 7

Attention based Image Captioning using Depth-wise Separable Convolution
(Institute of Electrical and Electronics Engineers Inc., 2021) Mallick, V.R.; Naik, D.
Automatically generating descriptions for an image has been one of the trending topics in the field of Computer Vision. This is due to the fact that various real-life applications like self-driving cars, Google image search, etc. are dependent on it. The backbone of this work is the encoder-decoder architecture of deep learning. The basic image captioning model has CNN as an encoder and RNN as a decoder. Various deep CNNs like VGG-16 and VGG-19, ResNet, Inception have been explored but despite the comparatively better performance, Xception is not that familiar in this field. Again for the decoder, GRU is not been used much, despite being comparatively faster than LSTM. Keeping these things in mind, and being attracted by the accuracy of Xception and efficiency of GRU, we propose an architecture for image captioning task with Xception as encoder and GRU as decoder with an attention mechanism. Â© 2021 IEEE.
Comparitive Study of GRU and LSTM Cells Based Video Captioning Models
(Institute of Electrical and Electronics Engineers Inc., 2021) Maru, H.; Chandana, T.S.S.; Naik, D.
Video Captioning task involves generating descriptive text for the events and objects in the videos. It mainly involves taking a video, which is nothing but a sequence of frames, as data from the user and giving a single or multiple sentences (sequence of words) to the user. A lot of research has been done in the area of video captioning. Most of this work is based on using Long Short Term Memory (LSTM) units for avoiding the vanishing gradients problem. In this work, we purpose to implement a video captioning model using Gated Recurrent Units(GRU's), attention mechanism and word embeddings and compare the functionalities and results with traditional models that use LSTM's or Recurrent Neural Networks(RNN's). We train and test our model on the standard MSVD (Microsoft Research Video Description Corpus) dataset. We use a wide range of performance metrics like BLEU score, METEOR score, ROUGE-1, ROUGE-2 and ROUGE-L to evaluate the performance. Â© 2021 IEEE.
Effect of Batch Normalization and Stacked LSTMs on Video Captioning
(Institute of Electrical and Electronics Engineers Inc., 2021) Sarathi, V.; Mujumdar, A.; Naik, D.
Integration of visual content with natural language for generating images or video description has been a challenging task for many years. Recent research in image captioning using Long Short term memory (LSTM) recently has motivated its possible application in video captioning where a video is converted into an array of frames, or images, and this array along with the captions for the video are used to train the LSTM network to associate the video with sentences. However very little is known about using fine tuning techniques such as batch normalization or Stacked LSTMs models in video captioning and how it affects the performance of the model.For this project, we want to compare the performance of the base model described in [1] with batch normalization and stacked LSTMs with base model as our reference. Â© 2021 IEEE.
Image Captioning with Attention Based Model
(Institute of Electrical and Electronics Engineers Inc., 2021) Yv, S.S.; Choubey, Y.; Naik, D.
Defining the content of an image automatically in Artificial Intelligence is basically a rudimentary problem that connects computer vision and NLP (Natural Language Processing). In the proposed work, a generative model is presented by combining the recent developments in machine learning and computer vision based on a deep recurrent architecture that describes the image using natural language phrases. By integrating the training picture, the trained model maximizes the likelihood of the target description sentence. The efficiency of the model, its accuracy and the language it learns is only dependent on the image descriptions, which was demonstrated by experiments performed on several datasets. Â© 2021 IEEE.
Describing Image with Attention based GRU
(Institute of Electrical and Electronics Engineers Inc., 2021) Mallick, V.R.; Naik, D.
Generating descriptions for images are popular research topic in current world. Based on encoder-decoder model, CNN works as an encoder to encode the images and then passes it to decoder RNN as input to generate the image description in natural language sentences. LSTM is widely used as RNN decoder. Attention mechanism has also played an important role in this field by enhancing the object detection. Inspired by this recent advancement in this field of computer vision, we used GRU in place of LSTM as a decoder for our image captioning model. We incorporated attention mechanism with GRU decoder to enhance the precision of generated captions. GRU have lesser tensor operations in comparison to LSTM, hence it will be faster in training. Â© 2021 IEEE.
Semantic context driven language descriptions of videos using deep neural network
(Springer Science and Business Media Deutschland GmbH, 2022) Naik, D.; Jaidhar, C.D.
The massive addition of data to the internet in text, images, and videos made computer vision-based tasks challenging in the big data domain. Recent exploration of video data and progress in visual information captioning has been an arduous task in computer vision. Visual captioning is attributable to integrating visual information with natural language descriptions. This paper proposes an encoder-decoder framework with a 2D-Convolutional Neural Network (CNN) model and layered Long Short Term Memory (LSTM) as the encoder and an LSTM model integrated with an attention mechanism working as the decoder with a hybrid loss function. Visual feature vectors extracted from the video frames using a 2D-CNN model capture spatial features. Specifically, the visual feature vectors are fed into the layered LSTM to capture the temporal information. The attention mechanism enables the decoder to perceive and focus on relevant objects and correlate the visual context and language content for producing semantically correct captions. The visual features and GloVe word embeddings are input into the decoder to generate natural semantic descriptions for the videos. The performance of the proposed framework is evaluated on the video captioning benchmark dataset Microsoft Video Description (MSVD) using various well-known evaluation metrics. The experimental findings indicate that the suggested framework outperforms state-of-the-art techniques. Compared to the state-of-the-art research methods, the proposed model significantly increased all measures, B@1, B@2, B@3, B@4, METEOR, and CIDEr, with the score of 78.4, 64.8, 54.2, and 43.7, 32.3, and 70.7, respectively. The progression in all scores indicates a more excellent grasp of the context of the inputs, which results in more accurate caption prediction. © 2022, The Author(s).
A novel Multi-Layer Attention Framework for visual description prediction using bidirectional LSTM
(Springer Science and Business Media Deutschland GmbH, 2022) Naik, D.; Jaidhar, C.D.
The massive influx of text, images, and videos to the internet has recently increased the challenge of computer vision-based tasks in big data. Integrating visual data with natural language to generate video explanations has been a challenge for decades. However, recent experiments on image/video captioning that employ Long-Short-Term-Memory (LSTM) have piqued the interest of researchers studying its possible application in video captioning. The proposed video captioning architecture combines the bidirectional multilayer LSTM (BiLSTM) encoder and unidirectional decoder. The innovative architecture also considers temporal relations when creating superior global video representations. In contrast to the majority of prior work, the most relevant features of a video are selected and utilized specifically for captioning purposes. Existing methods utilize a single-layer attention mechanism for linking visual input with phrase meaning. This approach employs LSTMs and a multilayer attention mechanism to extract characteristics from movies, construct links between multi-modal (words and visual material) representations, and generate sentences with rich semantic coherence. In addition, we evaluated the performance of the suggested system using a benchmark dataset for video captioning. The obtained results reveal superior performance relative to state-of-the-art works in METEOR and promising performance relative to the BLEU score. In terms of quantitative performance, the proposed approach outperforms most existing methodologies. © 2022, The Author(s).

Faculty Publications

Browse

Filters

Settings

Sort By

Results per page

Search Results