XGL-T transformer model for intelligent image captioning

Sharma, Dhruv; Dhiman, Chhavi; Kumar, Dinesh

doi:10.1007/s11042-023-15291-3

XGL-T transformer model for intelligent image captioning

Published: 24 May 2023

Volume 83, pages 4219–4240, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

371 Accesses
2 Citations
Explore all metrics

Abstract

Image captioning extracts multiple semantic features from an image and integrates them into a sentence-level description. For efficient description of the captions, it becomes necessary to learn higher order interactions between detected objects and the relationship among them. Most of the existing systems take into account the first order interactions while ignoring the higher order ones. It is challenging to extract discriminant higher order semantics visual features in images with highly populated objects for caption generation. In this paper, an efficient higher order interaction learning framework is proposed using encoder-decoder based image captioning. A scaled version of Gaussian Error Linear Unit (GELU) activation function, x-GELU is introduced that controls the vanishing gradients and enhances the feature learning. To leverage higher order interactions among multiple objects, an efficient XGL Transformer (XGL-T) model is introduced that exploits both spatial and channel-wise attention by integrating four XGL attention modules in image encoder and one in Bilinear Long Short-Term Memory guided sentence decoder. The proposed model captures rich semantic concepts from objects, attributes, and their relationships. Extensive experiments are conducted on publicly available MSCOCO Karapathy test split and the best performance of the work is observed as 81.5 BLEU@1, 67.1 BLEU@2, 51.6 BLEU@3, 39.9 BLEU@4, 134 CIDEr, 59.9 ROUGE-L, 29.8 METEOR, 23.8 SPICE using CIDEr-D Score Optimization Strategy. The scores validate the significant improvements over state-of-the-art results. An ablation study is also carried out to support the experimental observations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multilevel attention and relation network based image captioning model

Article 16 September 2022

Adaptive Syncretic Attention for Constrained Image Captioning

Article 26 April 2019

Relational Attention with Textual Enhanced Transformer for Image Captioning

Data availability

Data sharing not applicable to this article as no datasets were generated during the current study.

References

Amirkhani D, Bastanfard A (2021) An objective method to evaluate exemplar-based inpainted images quality using Jaccard index. Multimed Tools Appl 80:26199–26212
Article Google Scholar
Anderson P, Fernando B, Johnson M, Gould S (2016) SPICE: Semantic Propositional Image Caption Evaluation. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision – ECCV 2016. Lecture notes in computer science, vol 9909. Springer, Cham. https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Anderson P et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp 6077–6086. https://doi.org/10.1109/CVPR.2018.00636
Book Google Scholar
Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, pp 65–72
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T (2016) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 6298–6306
Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura JM, Parikh D, Batra D (2016) Visual dialog. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 1080–1089
Dauphin YN, Fan A, Auli M, Grangier D (2017) Proceedings of the 34th international conference on machine learning, PMLR 70, pp 933–941
Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition, pp 248–255
Deng Z, Jiang Z, Lan R, Huang W, Luo X (2020) Image captioning using DenseNet network and adaptive attention. Signal Process Image Commun 85:115836. https://doi.org/10.1016/j.image.2020.115836
Article Google Scholar
Diederik PK, Jimmy LB (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP 2016 - conference on empirical methods in natural language processing, proceedings. Association for Computational Linguistics (ACL), pp 457–468. https://doi.org/10.18653/v1/d16-1044
Gordo A, Larlus D (2017) Beyond instance-level image retrieval: leveraging captions to learn a global visual representation for semantic retrieval. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp 5272–5281. https://doi.org/10.1109/CVPR.2017.560
Book Google Scholar
Goyal A, Bochkovskiy A, Deng J, Koltun V (2021) Non-deep Networks. ArXiv, abs/2110.07641
He S, Liao W, Tavakoli HR, Yang MY, Rosenhahn B, Pugeault N (2021) Image captioning through image transformer. In: Ishikawa H, Liu C-L, Pajdla T, Shi J (eds) Computer Vision – ACCV 2020: 15th Asian Conference on Computer Vision, 2020
Hendrycks D, Gimpel K (2016) Gaussian Error Linear Units (GELUs)
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372
Article Google Scholar
Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. https://doi.org/10.48550/arXiv.1908.06954
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, pp 2407–2415. https://doi.org/10.1109/ICCV.2015.277
Book Google Scholar
Jiang W, Ma L, Jiang YG, Liu W, Zhang T (2018) Recurrent fusion network for image captioning. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer vision – ECCV 2018. Lecture notes in computer science, vol 11206. Springer, Cham. https://doi.org/10.1007/978-3-030-01216-8_31
Chapter Google Scholar
Karaoglu S, Tao R, Gevers T, Smeulders AW (2017) Words matter: scene text for image classification and retrieval. IEEE Transactions on Multimedia 19:1063–1076
Article Google Scholar
Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676. https://doi.org/10.1109/TPAMI.2016.2598339
Article Google Scholar
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Proceedings of the 32nd international conference on Neural Information Processing Systems (NIPS'18). Curran Associates Inc., Red Hook, NY, USA, pp 1571–1581
Krishna R, Zhu Y, Groth O, Johnson J (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Article MathSciNet Google Scholar
Li N, Chen Z (2018) Image cationing with visual-semantic LSTM. International Joint Conference on Artificial Intelligence
Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: IEEE conference on computer vision and Pattern recognition. IEEE, Long Beach, CA
Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In text summarization branches out. Association for Computational Linguistics, Barcelona, Spain, pp 74–81
Google Scholar
Lin X, Parikh D (2016) Leveraging visual question answering for image-caption ranking. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision – ECCV 2016. Lecture notes in computer science, vol 9906. Springer, Cham. https://doi.org/10.1007/978-3-319-46475-6_17
Chapter Google Scholar
Lin TY et al (2014) Microsoft COCO: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer vision – ECCV 2014. Lecture notes in computer science, vol 8693. Springer, Cham. https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. IEEE conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp 3242–3250. https://doi.org/10.1109/CVPR.2017.345
Book Google Scholar
Nguyen A, Pham K, Ngo D, Ngo T, Pham L (2021) An analysis of state-of-the-art activation functions for supervised deep neural network. 2021 International Conference on System Science and Engineering (ICSSE), Ho Chi Minh City, Vietnam, pp 215–220. https://doi.org/10.1109/ICSSE52999.2021.9538437
Book Google Scholar
Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp 10968–10977. https://doi.org/10.1109/CVPR42600.2020.01098
Book Google Scholar
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp 311–318. https://doi.org/10.3115/1073083.1073135
Qin Y, Du J, Zhang Y, Lu H (2019) Look back and predict forward in image captioning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, pp 8359–8367. https://doi.org/10.1109/CVPR.2019.00856
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Selfcritical sequence training for image captioning. In: IEEE conference on computer vision and Pattern recognition, Hawaiʻi Convention Center. IEEE, NY, USA
Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics, vol 1: Long Papers. Association for Computational Linguistics, Melbourne, Australia, pp 2556–2565
Shazeer N (2020) GLU variants improve transformer. In arXiv:2002.05202 [cs.LG]
Song Z, Zhou X (2021) Exploring explicit and implicit visual relationships for image captioning. 2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE, Shenzhen, China, pp 1–6. https://doi.org/10.1109/ICME51207.2021.9428310
Sutskever I, Vinayls O, Le QV (2014) "sequence to sequence learning with neural networks," in advances in neural information processing systems (NIPS)
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, pp 6000–6010
Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: consensus-based image description evaluation, vol 2015. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp 4566–4575. https://doi.org/10.1109/CVPR.2015.7299087
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935
Wang Y, Xu N, Liu A-A, Li W, Zhang Y (2022) High-order interaction learning for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 32(7):4417–4430. https://doi.org/10.1109/TCSVT.2021.3121062
Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? 2016 IEEE conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp 203–212. https://doi.org/10.1109/CVPR.2016.29
Wu S, Wieland J, Farivar O, Schiller J (2017) Automatic alt-text: computer-generated image descriptions for blind users on a social network service. In: Proceedings of the 2017 ACM conference on Computer SuFTpported Cooperative Work and Social Computing (CSCW '17). Association for Computing Machinery, New York, NY, USA, pp 1180–1192. https://doi.org/10.1145/2998181.2998364
Xu H, Saenko K (2016) "Ask, attend and answer: exploring questionguided spatial attention for visual question answering," in European conference on computer vision (ECCV)
Xu K, Ba JL, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on International Conference on Machine Learning (ICML'15), vol 37. JMLR.org, pp 2048–2057
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, pp 10677–10686. https://doi.org/10.1109/CVPR.2019.01094
Yang X, Liu Y, Wang X (2022) ReFormer: the relational transformer for image captioning. In Proceedings of the 30th ACM international conference on Multimedia (MM '22). Association for Computing Machinery, New York, NY, USA, pp 5398–5406. https://doi.org/10.1145/3503161.3548409
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Venice, Italy, pp 4904–4912. https://doi.org/10.1109/ICCV.2017.524
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. 2016 IEEE conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp 4651–4659. https://doi.org/10.1109/CVPR.2016.503
Yu J, Li J, Yu Z, Huang Q (2020) Multimodal transformer with multiview visual representation for image captioning. IEEE Trans Circ Syst Vid Technol 30(12):4467–4480
Article Google Scholar

Download references

Author information

Authors and Affiliations

Delhi Technological University, Delhi, India
Dhruv Sharma, Chhavi Dhiman & Dinesh Kumar

Authors

Dhruv Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Chhavi Dhiman
View author publications
You can also search for this author in PubMed Google Scholar
Dinesh Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chhavi Dhiman.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sharma, D., Dhiman, C. & Kumar, D. XGL-T transformer model for intelligent image captioning. Multimed Tools Appl 83, 4219–4240 (2024). https://doi.org/10.1007/s11042-023-15291-3

Download citation

Received: 01 August 2022
Revised: 07 October 2022
Accepted: 06 April 2023
Published: 24 May 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-15291-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

XGL-T transformer model for intelligent image captioning

Abstract

Access this article

Similar content being viewed by others

Multilevel attention and relation network based image captioning model

Adaptive Syncretic Attention for Constrained Image Captioning

Relational Attention with Textual Enhanced Transformer for Image Captioning

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

XGL-T transformer model for intelligent image captioning

Abstract

Access this article

Similar content being viewed by others

Multilevel attention and relation network based image captioning model

Adaptive Syncretic Attention for Constrained Image Captioning

Relational Attention with Textual Enhanced Transformer for Image Captioning

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation