Faculty Publications

Now showing 1 - 2 of 2

Affective Feedback Synthesis Towards Multimodal Text and Image Data
(Association for Computing Machinery, 2023) Kumar, P.; Bhatt, G.; Ingle, O.; Goyal, D.; Raman, B.
In this article, we have defined a novel task of affective feedback synthesis that generates feedback for input text and corresponding images in a way similar to humans responding to multimodal data. A feedback synthesis system has been proposed and trained using ground-truth human comments along with image-text input. We have also constructed a large-scale dataset consisting of images, text, Twitter user comments, and the number of likes for the comments by crawling news articles through Twitter feeds. The proposed system extracts textual features using a transformer-based textual encoder. The visual features have been extracted using a Faster region-based convolutional neural networks model. The textual and visual features have been concatenated to construct multimodal features that the decoder uses to synthesize the feedback. We have compared the results of the proposed system with baseline models using quantitative and qualitative measures. The synthesized feedbacks have been analyzed using automatic and human evaluation. They have been found to be semantically similar to the ground-truth comments and relevant to the given text-image input. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM
(Springer, 2024) Naik, D.; Jaidhar, C.D.
The principal objective of video/image captioning is to portray the dynamics of a video clip in plain natural language. Captioning is motivated by its ability to make the video more accessible to deaf and hard-of-hearing individuals, to help people focus on and recall information more readily, and to watch it in sound-sensitive locations. The most frequently utilized design paradigm is the revolutionary structurally improved encoder-decoder configuration. Recent developments emphasize the utilization of various creative structural modifications to maximize efficiency while demonstrating their viability in real-world applications. The utilization of well-known and well-researched technological advancements such as deep Convolutional Neural Networks (CNNs) and Sentence Transformers are trending in encoder-decoders. This paper proposes an approach for efficiently captioning videos using CNN and a short-connected LSTM-based encoder-decoder model blended with a sentence context vector. This sentence context vector emphasizes the relationship between the video and text spaces. Inspired by the human visual system, the attention mechanism is utilized to selectively concentrate on the context of the important frames. Also, a contextual hybrid embedding block is presented for connecting the two vector spaces generated during the encoding and decoding stages. The proposed architecture is investigated through well-known CNN architectures and various word embeddings. It is assessed using two benchmark video captioning datasets, MSVD and MSR-VTT, considering standard evaluation metrics such as BLEU, METEOR, ROUGH, and CIDEr. In accordance with experimental exploration, when the proposed model with NASNet-large alone is viewed across all three embeddings, the BERT findings on MSVD Dataset performed better than the results obtained with the other two embeddings. Inception-v4 outperformed VGG-16, ResNet-152, and NASNet-Large for feature extraction. Considering word embedding initiatives, BERT is far superior to ELMo and GloVe based on the MSR-VTT dataset. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.

Faculty Publications

Browse

Filters

Settings

Sort By

Results per page

Search Results