Generating Short Video Description using Deep-LSTM and Attention Mechanism
No Thumbnail Available
Date
2021
Journal Title
Journal ISSN
Volume Title
Publisher
Institute of Electrical and Electronics Engineers Inc.
Abstract
In modern days, extensive amount of data is produced from videos, because most of the populations have video capturing devices such as mobile phone, camera, etc. The video comprises of photographic data, textual data, and auditory data. Our aim is to investigate and recognize the visual feature of the video and to generate the caption so that users can get the information of the video in an instant of time. Many technologies capture static content of the frame but for video captioning, dynamic information is more important compared to static information. In this work, we introduced an Encoder-Decoder architecture using Deep-Long Short-Term Memory (Deep-LSTM) and Bahdanau Attention. In the encoder, Convolution Neural Network (CNN) VGG16 and Deep-LSTM are used for deducing information from frames and Deep-LSTM combined with attention mechanism for describing action performed in the video. We evaluated the performance of our model on MSVD dataset, which shows significant improvement as compared to the other video captioning model. © 2021 IEEE.
Description
Keywords
Computer Vision, Machine Translation, Natural Language Processing, Recurrent Neural Network, Video Captioning
Citation
2021 6th International Conference for Convergence in Technology, I2CT 2021, 2021, Vol., , p. -
