Semantic context driven language descriptions of videos using deep neural network

dc.contributor.authorNaik, D.
dc.contributor.authorJaidhar, C.D.
dc.date.accessioned2026-02-04T12:27:30Z
dc.date.issued2022
dc.description.abstractThe massive addition of data to the internet in text, images, and videos made computer vision-based tasks challenging in the big data domain. Recent exploration of video data and progress in visual information captioning has been an arduous task in computer vision. Visual captioning is attributable to integrating visual information with natural language descriptions. This paper proposes an encoder-decoder framework with a 2D-Convolutional Neural Network (CNN) model and layered Long Short Term Memory (LSTM) as the encoder and an LSTM model integrated with an attention mechanism working as the decoder with a hybrid loss function. Visual feature vectors extracted from the video frames using a 2D-CNN model capture spatial features. Specifically, the visual feature vectors are fed into the layered LSTM to capture the temporal information. The attention mechanism enables the decoder to perceive and focus on relevant objects and correlate the visual context and language content for producing semantically correct captions. The visual features and GloVe word embeddings are input into the decoder to generate natural semantic descriptions for the videos. The performance of the proposed framework is evaluated on the video captioning benchmark dataset Microsoft Video Description (MSVD) using various well-known evaluation metrics. The experimental findings indicate that the suggested framework outperforms state-of-the-art techniques. Compared to the state-of-the-art research methods, the proposed model significantly increased all measures, B@1, B@2, B@3, B@4, METEOR, and CIDEr, with the score of 78.4, 64.8, 54.2, and 43.7, 32.3, and 70.7, respectively. The progression in all scores indicates a more excellent grasp of the context of the inputs, which results in more accurate caption prediction. © 2022, The Author(s).
dc.identifier.citationJournal of Big Data, 2022, 9, 1, pp. -
dc.identifier.urihttps://doi.org/10.1186/s40537-022-00569-4
dc.identifier.urihttps://idr.nitk.ac.in/handle/123456789/22306
dc.publisherSpringer Science and Business Media Deutschland GmbH
dc.subjectBenchmarking
dc.subjectComputer vision
dc.subjectConvolution
dc.subjectConvolutional neural networks
dc.subjectDecoding
dc.subjectDeep neural networks
dc.subjectMultilayer neural networks
dc.subjectNeural network models
dc.subjectPetroleum reservoir evaluation
dc.subjectSemantic Segmentation
dc.subjectSemantic Web
dc.subjectSemantics
dc.subjectSignal encoding
dc.subjectVisual languages
dc.subjectAttention
dc.subjectAttention mechanisms
dc.subjectConvolutional neural network
dc.subjectFeatures vector
dc.subjectLanguage description
dc.subjectNeural network model
dc.subjectSemantic context
dc.subjectVideo captioning
dc.subjectVisual feature
dc.subjectVisual information
dc.subjectLong short-term memory
dc.titleSemantic context driven language descriptions of videos using deep neural network

Files

Collections