Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM

dc.contributor.authorNaik, D.
dc.contributor.authorJaidhar, C.D.
dc.date.accessioned2026-02-04T12:25:42Z
dc.date.issued2024
dc.description.abstractThe principal objective of video/image captioning is to portray the dynamics of a video clip in plain natural language. Captioning is motivated by its ability to make the video more accessible to deaf and hard-of-hearing individuals, to help people focus on and recall information more readily, and to watch it in sound-sensitive locations. The most frequently utilized design paradigm is the revolutionary structurally improved encoder-decoder configuration. Recent developments emphasize the utilization of various creative structural modifications to maximize efficiency while demonstrating their viability in real-world applications. The utilization of well-known and well-researched technological advancements such as deep Convolutional Neural Networks (CNNs) and Sentence Transformers are trending in encoder-decoders. This paper proposes an approach for efficiently captioning videos using CNN and a short-connected LSTM-based encoder-decoder model blended with a sentence context vector. This sentence context vector emphasizes the relationship between the video and text spaces. Inspired by the human visual system, the attention mechanism is utilized to selectively concentrate on the context of the important frames. Also, a contextual hybrid embedding block is presented for connecting the two vector spaces generated during the encoding and decoding stages. The proposed architecture is investigated through well-known CNN architectures and various word embeddings. It is assessed using two benchmark video captioning datasets, MSVD and MSR-VTT, considering standard evaluation metrics such as BLEU, METEOR, ROUGH, and CIDEr. In accordance with experimental exploration, when the proposed model with NASNet-large alone is viewed across all three embeddings, the BERT findings on MSVD Dataset performed better than the results obtained with the other two embeddings. Inception-v4 outperformed VGG-16, ResNet-152, and NASNet-Large for feature extraction. Considering word embedding initiatives, BERT is far superior to ELMo and GloVe based on the MSR-VTT dataset. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
dc.identifier.citationMultimedia Tools and Applications, 2024, 83, 4, pp. 11187-11213
dc.identifier.issn13807501
dc.identifier.urihttps://doi.org/10.1007/s11042-023-15978-7
dc.identifier.urihttps://idr.nitk.ac.in/handle/123456789/21504
dc.publisherSpringer
dc.subjectAudition
dc.subjectComputer vision
dc.subjectConvolution
dc.subjectConvolutional neural networks
dc.subjectDecoding
dc.subjectDeep neural networks
dc.subjectLarge dataset
dc.subjectLong short-term memory
dc.subjectNetwork architecture
dc.subjectSignal encoding
dc.subjectVector spaces
dc.subjectVectors
dc.subjectContext vector
dc.subjectConvolutional neural network
dc.subjectEmbeddings
dc.subjectEncoder-decoder
dc.subjectImage captioning
dc.subjectLSTM
dc.subjectMulti-head attention
dc.subjectVideo captioning
dc.subjectVideo image
dc.subjectVideo-clips
dc.titleVideo Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM

Files

Collections