An Ensemble of Vision-Language Transformer-Based Captioning Model With Rotatory Positional Embeddings
No Thumbnail Available
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Institute of Electrical and Electronics Engineers Inc.
Abstract
Image captioning is a dynamic and crucial research area focused on automatically generating image textual descriptions. Traditional models, primarily employing an encoder-decoder framework with Convolutional Neural Networks (CNNs), often struggle to capture the complex spatial and sequential relationships inherent in visual data. This gap in performance underscores the necessity for more sophisticated solutions. The proposed work introduces a groundbreaking ensemble model that integrates CNN, Graph Convolutional Network (GCN), Bidirectional Long Short-Term Memory (BiLSTM), and Transformer architectures. Our approach achieves an outstanding 97% increase in CIDEr scores on the Flickr30K dataset and a remarkable 28.6% improvement on the Flickr8K dataset, thanks to the innovative implementation of Rotary Positional Encoding (RoPE). By strategically incorporating GCN and BiLSTM layers, our model adeptly captures essential relationships within the data. This groundbreaking research effectively addresses the challenges of image captioning, leveraging a powerful combination of advanced architectures. As a result, our model significantly enhances the generation of accurate and contextually rich captions, positioning it as a game-changer for automated image-to-text applications. The proposed Ensemble model with RoPE, achieved impressive performance on the Flickr8k and Flickr30k datasets, with scores of 80.62 and 95.0 for BLEU-1, 72.01 and 90.51 for BLEU-2, 63.12 and 81.24 for BLEU-3, 48.32 and 68.8 for BLEU-4, 74.26 and 81.89 for METEOR, 80.24 and 84.29 for ROUGE-L, 118.94 and 155.77 for CIDEr, and 48.7 and 39.0 for SPICE, respectively. © 2013 IEEE.
Description
Keywords
Digital elevation model, Graph embeddings, Graph neural networks, Image coding, Image enhancement, Long short-term memory, Network coding, Network embeddings, Visual languages, Attention mechanisms, Bidirectional long short term memory model, Convolutional neural network, Convolutional neural network model, Embeddings, Graph convolution network, Image caption, Image caption generation, Memory modeling, Neural network model, Positional embedding, Rotary positional embedding, Short term memory, Transformer, Convolutional neural networks
Citation
IEEE Access, 2025, 13, , pp. 59841-59865
