Semantic context driven language descriptions of videos using deep neural network

Naik, D.; Jaidhar, C.D.

Semantic context driven language descriptions of videos using deep neural network

dc.contributor.author	Naik, D.
dc.contributor.author	Jaidhar, C.D.
dc.date.accessioned	2026-02-04T12:27:30Z
dc.date.issued	2022
dc.description.abstract	The massive addition of data to the internet in text, images, and videos made computer vision-based tasks challenging in the big data domain. Recent exploration of video data and progress in visual information captioning has been an arduous task in computer vision. Visual captioning is attributable to integrating visual information with natural language descriptions. This paper proposes an encoder-decoder framework with a 2D-Convolutional Neural Network (CNN) model and layered Long Short Term Memory (LSTM) as the encoder and an LSTM model integrated with an attention mechanism working as the decoder with a hybrid loss function. Visual feature vectors extracted from the video frames using a 2D-CNN model capture spatial features. Specifically, the visual feature vectors are fed into the layered LSTM to capture the temporal information. The attention mechanism enables the decoder to perceive and focus on relevant objects and correlate the visual context and language content for producing semantically correct captions. The visual features and GloVe word embeddings are input into the decoder to generate natural semantic descriptions for the videos. The performance of the proposed framework is evaluated on the video captioning benchmark dataset Microsoft Video Description (MSVD) using various well-known evaluation metrics. The experimental findings indicate that the suggested framework outperforms state-of-the-art techniques. Compared to the state-of-the-art research methods, the proposed model significantly increased all measures, B@1, B@2, B@3, B@4, METEOR, and CIDEr, with the score of 78.4, 64.8, 54.2, and 43.7, 32.3, and 70.7, respectively. The progression in all scores indicates a more excellent grasp of the context of the inputs, which results in more accurate caption prediction. © 2022, The Author(s).
dc.identifier.citation	Journal of Big Data, 2022, 9, 1, pp. -
dc.identifier.uri	https://doi.org/10.1186/s40537-022-00569-4
dc.identifier.uri	https://idr.nitk.ac.in/handle/123456789/22306
dc.publisher	Springer Science and Business Media Deutschland GmbH
dc.subject	Benchmarking
dc.subject	Computer vision
dc.subject	Convolution
dc.subject	Convolutional neural networks
dc.subject	Decoding
dc.subject	Deep neural networks
dc.subject	Multilayer neural networks
dc.subject	Neural network models
dc.subject	Petroleum reservoir evaluation
dc.subject	Semantic Segmentation
dc.subject	Semantic Web
dc.subject	Semantics
dc.subject	Signal encoding
dc.subject	Visual languages
dc.subject	Attention
dc.subject	Attention mechanisms
dc.subject	Convolutional neural network
dc.subject	Features vector
dc.subject	Language description
dc.subject	Neural network model
dc.subject	Semantic context
dc.subject	Video captioning
dc.subject	Visual feature
dc.subject	Visual information
dc.subject	Long short-term memory
dc.title	Semantic context driven language descriptions of videos using deep neural network

Collections

Journal Articles

Semantic context driven language descriptions of videos using deep neural network

Files

Collections