Faculty Publications

Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736

Publications by NITK Faculty

Browse

Search Results

Now showing 1 - 5 of 5

A Novel Approach for Video Captioning Based on Semantic Cross Embedding and Skip-Connection
(Springer Science and Business Media Deutschland GmbH, 2021) Radarapu, R.; Bandari, N.; Muthyam, S.; Naik, D.
Video Captioning is the task of describing the content of a video in simple natural language. Encoder-Decoder architecture is the most widely used architecture for this task. Recent works exploit the use of 3D Convolutional Neural Networks (CNNs), Transformers or by changing the structure of basic Long Short-Term Memory (LSTM) units used in Encoder-Decoder to improve the performance. In this paper, we propose the use of a sentence vector to improve the performance of the Encoder-Decoder model. This sentence vector acts as an intermediary between the video space and the text space. Thus, it is referred to as semantic cross embedding that bridges the two vector spaces, in this paper. The sentence vector is generated from the video and is used by the Decoder, along with previously generated words to generate a suitable description. We also employ the use of a skip-connection in the Encoder part of the model. Skip-connection is usually employed to tackle the vanishing gradients problem in deep neural networks. However, our experiments show that a two-layer LSTM with a skip-connection performs better than the Bidirectional LSTM, for our model. Also, the use of a sentence vector improves performance considerably. All our experiments are performed on the MSVD dataset. Â© 2021, Springer Nature Singapore Pte Ltd.
Video to Text Generation Using Sentence Vector and Skip Connections
(Springer, 2023) Mule, H.; Naik, D.
Nowadays, video data is increasing rapidly and the need of robust algorithms to process the interpretation of the video. A textual alternative will be more effective and save time. We aim to produce the caption for the video. The most famous architecture used for this is the encoder-decoder (E-D) model. Recent attempts have focused on improving performance by including 3D-CNN, transformers, or structural changes in the basic LSTM units used in E-D. Sentence vectors are used in this work, improving the E-D modelâ€™s performance. From the video file, a sentence vector is generated and used by the decoder to generate an accurate description by using previously generated words. Skip connection in the encoder part avoids the vanishing gradients problem. All of our studies use the MSVD and CHARADES datasets. Four famous metrics, BLEU@4, METEOR, ROUGE, and CIDER, are used for performance evaluation. We have compared the performance of BERT, ELMo, and GloVe word embeddings. On experimental analysis, BERT embedding outperformed the ELMo and GloVe embeddings. For feature extraction, pretrained CNNs, NASNet-Large, VGG-16, Inception-v4, and Resnet152 are used, and NASNet-Large outperformed other models. Â© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
Semantic context driven language descriptions of videos using deep neural network
(Springer Science and Business Media Deutschland GmbH, 2022) Naik, D.; Jaidhar, C.D.
The massive addition of data to the internet in text, images, and videos made computer vision-based tasks challenging in the big data domain. Recent exploration of video data and progress in visual information captioning has been an arduous task in computer vision. Visual captioning is attributable to integrating visual information with natural language descriptions. This paper proposes an encoder-decoder framework with a 2D-Convolutional Neural Network (CNN) model and layered Long Short Term Memory (LSTM) as the encoder and an LSTM model integrated with an attention mechanism working as the decoder with a hybrid loss function. Visual feature vectors extracted from the video frames using a 2D-CNN model capture spatial features. Specifically, the visual feature vectors are fed into the layered LSTM to capture the temporal information. The attention mechanism enables the decoder to perceive and focus on relevant objects and correlate the visual context and language content for producing semantically correct captions. The visual features and GloVe word embeddings are input into the decoder to generate natural semantic descriptions for the videos. The performance of the proposed framework is evaluated on the video captioning benchmark dataset Microsoft Video Description (MSVD) using various well-known evaluation metrics. The experimental findings indicate that the suggested framework outperforms state-of-the-art techniques. Compared to the state-of-the-art research methods, the proposed model significantly increased all measures, B@1, B@2, B@3, B@4, METEOR, and CIDEr, with the score of 78.4, 64.8, 54.2, and 43.7, 32.3, and 70.7, respectively. The progression in all scores indicates a more excellent grasp of the context of the inputs, which results in more accurate caption prediction. © 2022, The Author(s).
A novel Multi-Layer Attention Framework for visual description prediction using bidirectional LSTM
(Springer Science and Business Media Deutschland GmbH, 2022) Naik, D.; Jaidhar, C.D.
The massive influx of text, images, and videos to the internet has recently increased the challenge of computer vision-based tasks in big data. Integrating visual data with natural language to generate video explanations has been a challenge for decades. However, recent experiments on image/video captioning that employ Long-Short-Term-Memory (LSTM) have piqued the interest of researchers studying its possible application in video captioning. The proposed video captioning architecture combines the bidirectional multilayer LSTM (BiLSTM) encoder and unidirectional decoder. The innovative architecture also considers temporal relations when creating superior global video representations. In contrast to the majority of prior work, the most relevant features of a video are selected and utilized specifically for captioning purposes. Existing methods utilize a single-layer attention mechanism for linking visual input with phrase meaning. This approach employs LSTMs and a multilayer attention mechanism to extract characteristics from movies, construct links between multi-modal (words and visual material) representations, and generate sentences with rich semantic coherence. In addition, we evaluated the performance of the suggested system using a benchmark dataset for video captioning. The obtained results reveal superior performance relative to state-of-the-art works in METEOR and promising performance relative to the BLEU score. In terms of quantitative performance, the proposed approach outperforms most existing methodologies. © 2022, The Author(s).
Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM
(Springer, 2024) Naik, D.; Jaidhar, C.D.
The principal objective of video/image captioning is to portray the dynamics of a video clip in plain natural language. Captioning is motivated by its ability to make the video more accessible to deaf and hard-of-hearing individuals, to help people focus on and recall information more readily, and to watch it in sound-sensitive locations. The most frequently utilized design paradigm is the revolutionary structurally improved encoder-decoder configuration. Recent developments emphasize the utilization of various creative structural modifications to maximize efficiency while demonstrating their viability in real-world applications. The utilization of well-known and well-researched technological advancements such as deep Convolutional Neural Networks (CNNs) and Sentence Transformers are trending in encoder-decoders. This paper proposes an approach for efficiently captioning videos using CNN and a short-connected LSTM-based encoder-decoder model blended with a sentence context vector. This sentence context vector emphasizes the relationship between the video and text spaces. Inspired by the human visual system, the attention mechanism is utilized to selectively concentrate on the context of the important frames. Also, a contextual hybrid embedding block is presented for connecting the two vector spaces generated during the encoding and decoding stages. The proposed architecture is investigated through well-known CNN architectures and various word embeddings. It is assessed using two benchmark video captioning datasets, MSVD and MSR-VTT, considering standard evaluation metrics such as BLEU, METEOR, ROUGH, and CIDEr. In accordance with experimental exploration, when the proposed model with NASNet-large alone is viewed across all three embeddings, the BERT findings on MSVD Dataset performed better than the results obtained with the other two embeddings. Inception-v4 outperformed VGG-16, ResNet-152, and NASNet-Large for feature extraction. Considering word embedding initiatives, BERT is far superior to ELMo and GloVe based on the MSR-VTT dataset. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.

Faculty Publications

Browse

Filters

Settings

Sort By

Results per page

Search Results