Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM
| dc.contributor.author | Naik, D. | |
| dc.contributor.author | Jaidhar, C.D. | |
| dc.date.accessioned | 2026-02-04T12:25:42Z | |
| dc.date.issued | 2024 | |
| dc.description.abstract | The principal objective of video/image captioning is to portray the dynamics of a video clip in plain natural language. Captioning is motivated by its ability to make the video more accessible to deaf and hard-of-hearing individuals, to help people focus on and recall information more readily, and to watch it in sound-sensitive locations. The most frequently utilized design paradigm is the revolutionary structurally improved encoder-decoder configuration. Recent developments emphasize the utilization of various creative structural modifications to maximize efficiency while demonstrating their viability in real-world applications. The utilization of well-known and well-researched technological advancements such as deep Convolutional Neural Networks (CNNs) and Sentence Transformers are trending in encoder-decoders. This paper proposes an approach for efficiently captioning videos using CNN and a short-connected LSTM-based encoder-decoder model blended with a sentence context vector. This sentence context vector emphasizes the relationship between the video and text spaces. Inspired by the human visual system, the attention mechanism is utilized to selectively concentrate on the context of the important frames. Also, a contextual hybrid embedding block is presented for connecting the two vector spaces generated during the encoding and decoding stages. The proposed architecture is investigated through well-known CNN architectures and various word embeddings. It is assessed using two benchmark video captioning datasets, MSVD and MSR-VTT, considering standard evaluation metrics such as BLEU, METEOR, ROUGH, and CIDEr. In accordance with experimental exploration, when the proposed model with NASNet-large alone is viewed across all three embeddings, the BERT findings on MSVD Dataset performed better than the results obtained with the other two embeddings. Inception-v4 outperformed VGG-16, ResNet-152, and NASNet-Large for feature extraction. Considering word embedding initiatives, BERT is far superior to ELMo and GloVe based on the MSR-VTT dataset. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature. | |
| dc.identifier.citation | Multimedia Tools and Applications, 2024, 83, 4, pp. 11187-11213 | |
| dc.identifier.issn | 13807501 | |
| dc.identifier.uri | https://doi.org/10.1007/s11042-023-15978-7 | |
| dc.identifier.uri | https://idr.nitk.ac.in/handle/123456789/21504 | |
| dc.publisher | Springer | |
| dc.subject | Audition | |
| dc.subject | Computer vision | |
| dc.subject | Convolution | |
| dc.subject | Convolutional neural networks | |
| dc.subject | Decoding | |
| dc.subject | Deep neural networks | |
| dc.subject | Large dataset | |
| dc.subject | Long short-term memory | |
| dc.subject | Network architecture | |
| dc.subject | Signal encoding | |
| dc.subject | Vector spaces | |
| dc.subject | Vectors | |
| dc.subject | Context vector | |
| dc.subject | Convolutional neural network | |
| dc.subject | Embeddings | |
| dc.subject | Encoder-decoder | |
| dc.subject | Image captioning | |
| dc.subject | LSTM | |
| dc.subject | Multi-head attention | |
| dc.subject | Video captioning | |
| dc.subject | Video image | |
| dc.subject | Video-clips | |
| dc.title | Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM |
