An Ensemble of Vision-Language Transformer-Based Captioning Model With Rotatory Positional Embeddings

Sathyanarayana, K.B.; Naik, D.

An Ensemble of Vision-Language Transformer-Based Captioning Model With Rotatory Positional Embeddings

Date

2025

Authors

Sathyanarayana, K.B.

Naik, D.

Publisher

Institute of Electrical and Electronics Engineers Inc.

Abstract

Image captioning is a dynamic and crucial research area focused on automatically generating image textual descriptions. Traditional models, primarily employing an encoder-decoder framework with Convolutional Neural Networks (CNNs), often struggle to capture the complex spatial and sequential relationships inherent in visual data. This gap in performance underscores the necessity for more sophisticated solutions. The proposed work introduces a groundbreaking ensemble model that integrates CNN, Graph Convolutional Network (GCN), Bidirectional Long Short-Term Memory (BiLSTM), and Transformer architectures. Our approach achieves an outstanding 97% increase in CIDEr scores on the Flickr30K dataset and a remarkable 28.6% improvement on the Flickr8K dataset, thanks to the innovative implementation of Rotary Positional Encoding (RoPE). By strategically incorporating GCN and BiLSTM layers, our model adeptly captures essential relationships within the data. This groundbreaking research effectively addresses the challenges of image captioning, leveraging a powerful combination of advanced architectures. As a result, our model significantly enhances the generation of accurate and contextually rich captions, positioning it as a game-changer for automated image-to-text applications. The proposed Ensemble model with RoPE, achieved impressive performance on the Flickr8k and Flickr30k datasets, with scores of 80.62 and 95.0 for BLEU-1, 72.01 and 90.51 for BLEU-2, 63.12 and 81.24 for BLEU-3, 48.32 and 68.8 for BLEU-4, 74.26 and 81.89 for METEOR, 80.24 and 84.29 for ROUGE-L, 118.94 and 155.77 for CIDEr, and 48.7 and 39.0 for SPICE, respectively. © 2013 IEEE.

Keywords

Digital elevation model, Graph embeddings, Graph neural networks, Image coding, Image enhancement, Long short-term memory, Network coding, Network embeddings, Visual languages, Attention mechanisms, Bidirectional long short term memory model, Convolutional neural network, Convolutional neural network model, Embeddings, Graph convolution network, Image caption, Image caption generation, Memory modeling, Neural network model, Positional embedding, Rotary positional embedding, Short term memory, Transformer, Convolutional neural networks

Citation

IEEE Access, 2025, 13, , pp. 59841-59865

URI

https://doi.org/10.1109/ACCESS.2025.3556449
https://idr.nitk.ac.in/handle/123456789/20717

Collections

Journal Articles

Full item page

An Ensemble of Vision-Language Transformer-Based Captioning Model With Rotatory Positional Embeddings

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By