Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM

Naik, D.; Jaidhar, C.D.

Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM

dc.contributor.author	Naik, D.
dc.contributor.author	Jaidhar, C.D.
dc.date.accessioned	2026-02-04T12:25:42Z
dc.date.issued	2024
dc.description.abstract	The principal objective of video/image captioning is to portray the dynamics of a video clip in plain natural language. Captioning is motivated by its ability to make the video more accessible to deaf and hard-of-hearing individuals, to help people focus on and recall information more readily, and to watch it in sound-sensitive locations. The most frequently utilized design paradigm is the revolutionary structurally improved encoder-decoder configuration. Recent developments emphasize the utilization of various creative structural modifications to maximize efficiency while demonstrating their viability in real-world applications. The utilization of well-known and well-researched technological advancements such as deep Convolutional Neural Networks (CNNs) and Sentence Transformers are trending in encoder-decoders. This paper proposes an approach for efficiently captioning videos using CNN and a short-connected LSTM-based encoder-decoder model blended with a sentence context vector. This sentence context vector emphasizes the relationship between the video and text spaces. Inspired by the human visual system, the attention mechanism is utilized to selectively concentrate on the context of the important frames. Also, a contextual hybrid embedding block is presented for connecting the two vector spaces generated during the encoding and decoding stages. The proposed architecture is investigated through well-known CNN architectures and various word embeddings. It is assessed using two benchmark video captioning datasets, MSVD and MSR-VTT, considering standard evaluation metrics such as BLEU, METEOR, ROUGH, and CIDEr. In accordance with experimental exploration, when the proposed model with NASNet-large alone is viewed across all three embeddings, the BERT findings on MSVD Dataset performed better than the results obtained with the other two embeddings. Inception-v4 outperformed VGG-16, ResNet-152, and NASNet-Large for feature extraction. Considering word embedding initiatives, BERT is far superior to ELMo and GloVe based on the MSR-VTT dataset. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
dc.identifier.citation	Multimedia Tools and Applications, 2024, 83, 4, pp. 11187-11213
dc.identifier.issn	13807501
dc.identifier.uri	https://doi.org/10.1007/s11042-023-15978-7
dc.identifier.uri	https://idr.nitk.ac.in/handle/123456789/21504
dc.publisher	Springer
dc.subject	Audition
dc.subject	Computer vision
dc.subject	Convolution
dc.subject	Convolutional neural networks
dc.subject	Decoding
dc.subject	Deep neural networks
dc.subject	Large dataset
dc.subject	Long short-term memory
dc.subject	Network architecture
dc.subject	Signal encoding
dc.subject	Vector spaces
dc.subject	Vectors
dc.subject	Context vector
dc.subject	Convolutional neural network
dc.subject	Embeddings
dc.subject	Encoder-decoder
dc.subject	Image captioning
dc.subject	LSTM
dc.subject	Multi-head attention
dc.subject	Video captioning
dc.subject	Video image
dc.subject	Video-clips
dc.title	Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM

Collections

Journal Articles

Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM

Files

Collections