Skip to main content
Log in

XGL-T transformer model for intelligent image captioning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Image captioning extracts multiple semantic features from an image and integrates them into a sentence-level description. For efficient description of the captions, it becomes necessary to learn higher order interactions between detected objects and the relationship among them. Most of the existing systems take into account the first order interactions while ignoring the higher order ones. It is challenging to extract discriminant higher order semantics visual features in images with highly populated objects for caption generation. In this paper, an efficient higher order interaction learning framework is proposed using encoder-decoder based image captioning. A scaled version of Gaussian Error Linear Unit (GELU) activation function, x-GELU is introduced that controls the vanishing gradients and enhances the feature learning. To leverage higher order interactions among multiple objects, an efficient XGL Transformer (XGL-T) model is introduced that exploits both spatial and channel-wise attention by integrating four XGL attention modules in image encoder and one in Bilinear Long Short-Term Memory guided sentence decoder. The proposed model captures rich semantic concepts from objects, attributes, and their relationships. Extensive experiments are conducted on publicly available MSCOCO Karapathy test split and the best performance of the work is observed as 81.5 BLEU@1, 67.1 BLEU@2, 51.6 BLEU@3, 39.9 BLEU@4, 134 CIDEr, 59.9 ROUGE-L, 29.8 METEOR, 23.8 SPICE using CIDEr-D Score Optimization Strategy. The scores validate the significant improvements over state-of-the-art results. An ablation study is also carried out to support the experimental observations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

Data sharing not applicable to this article as no datasets were generated during the current study.

References

  1. Amirkhani D, Bastanfard A (2021) An objective method to evaluate exemplar-based inpainted images quality using Jaccard index. Multimed Tools Appl 80:26199–26212

    Article  Google Scholar 

  2. Anderson P, Fernando B, Johnson M, Gould S (2016) SPICE: Semantic Propositional Image Caption Evaluation. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision – ECCV 2016. Lecture notes in computer science, vol 9909. Springer, Cham. https://doi.org/10.1007/978-3-319-46454-1_24

    Chapter  Google Scholar 

  3. Anderson P et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp 6077–6086. https://doi.org/10.1109/CVPR.2018.00636

    Book  Google Scholar 

  4. Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, pp 65–72

  5. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T (2016) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 6298–6306

  6. Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura JM, Parikh D, Batra D (2016) Visual dialog. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 1080–1089

  7. Dauphin YN, Fan A, Auli M, Grangier D (2017) Proceedings of the 34th international conference on machine learning, PMLR 70, pp 933–941

  8. Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition, pp 248–255

  9. Deng Z, Jiang Z, Lan R, Huang W, Luo X (2020) Image captioning using DenseNet network and adaptive attention. Signal Process Image Commun 85:115836. https://doi.org/10.1016/j.image.2020.115836

    Article  Google Scholar 

  10. Diederik PK, Jimmy LB (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

  11. Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP 2016 - conference on empirical methods in natural language processing, proceedings. Association for Computational Linguistics (ACL), pp 457–468. https://doi.org/10.18653/v1/d16-1044

  12. Gordo A, Larlus D (2017) Beyond instance-level image retrieval: leveraging captions to learn a global visual representation for semantic retrieval. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp 5272–5281. https://doi.org/10.1109/CVPR.2017.560

    Book  Google Scholar 

  13. Goyal A, Bochkovskiy A, Deng J, Koltun V (2021) Non-deep Networks. ArXiv, abs/2110.07641

  14. He S, Liao W, Tavakoli HR, Yang MY, Rosenhahn B, Pugeault N (2021) Image captioning through image transformer. In: Ishikawa H, Liu C-L, Pajdla T, Shi J (eds) Computer Vision – ACCV 2020: 15th Asian Conference on Computer Vision, 2020

  15. Hendrycks D, Gimpel K (2016) Gaussian Error Linear Units (GELUs)

  16. Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372

    Article  Google Scholar 

  17. Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. https://doi.org/10.48550/arXiv.1908.06954

  18. Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, pp 2407–2415. https://doi.org/10.1109/ICCV.2015.277

    Book  Google Scholar 

  19. Jiang W, Ma L, Jiang YG, Liu W, Zhang T (2018) Recurrent fusion network for image captioning. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer vision – ECCV 2018. Lecture notes in computer science, vol 11206. Springer, Cham. https://doi.org/10.1007/978-3-030-01216-8_31

    Chapter  Google Scholar 

  20. Karaoglu S, Tao R, Gevers T, Smeulders AW (2017) Words matter: scene text for image classification and retrieval. IEEE Transactions on Multimedia 19:1063–1076

    Article  Google Scholar 

  21. Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676. https://doi.org/10.1109/TPAMI.2016.2598339

    Article  Google Scholar 

  22. Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Proceedings of the 32nd international conference on Neural Information Processing Systems (NIPS'18). Curran Associates Inc., Red Hook, NY, USA, pp 1571–1581

  23. Krishna R, Zhu Y, Groth O, Johnson J (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73

    Article  MathSciNet  Google Scholar 

  24. Li N, Chen Z (2018) Image cationing with visual-semantic LSTM. International Joint Conference on Artificial Intelligence

  25. Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning.  In: IEEE conference on computer vision and Pattern recognition. IEEE, Long Beach, CA

  26. Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In text summarization branches out. Association for Computational Linguistics, Barcelona, Spain, pp 74–81

    Google Scholar 

  27. Lin X, Parikh D (2016) Leveraging visual question answering for image-caption ranking. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision – ECCV 2016. Lecture notes in computer science, vol 9906. Springer, Cham. https://doi.org/10.1007/978-3-319-46475-6_17

    Chapter  Google Scholar 

  28. Lin TY et al (2014) Microsoft COCO: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer vision – ECCV 2014. Lecture notes in computer science, vol 8693. Springer, Cham. https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  29. Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. IEEE conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp 3242–3250. https://doi.org/10.1109/CVPR.2017.345

    Book  Google Scholar 

  30. Nguyen A, Pham K, Ngo D, Ngo T, Pham L (2021) An analysis of state-of-the-art activation functions for supervised deep neural network. 2021 International Conference on System Science and Engineering (ICSSE), Ho Chi Minh City, Vietnam, pp 215–220. https://doi.org/10.1109/ICSSE52999.2021.9538437

    Book  Google Scholar 

  31. Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp 10968–10977. https://doi.org/10.1109/CVPR42600.2020.01098

    Book  Google Scholar 

  32. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp 311–318. https://doi.org/10.3115/1073083.1073135

  33. Qin Y, Du J, Zhang Y, Lu H (2019) Look back and predict forward in image captioning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, pp 8359–8367. https://doi.org/10.1109/CVPR.2019.00856

  34. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Selfcritical sequence training for image captioning. In: IEEE conference on computer vision and Pattern recognition, Hawaiʻi Convention Center. IEEE, NY, USA

  35. Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics, vol 1: Long Papers. Association for Computational Linguistics, Melbourne, Australia, pp 2556–2565

  36. Shazeer N (2020) GLU variants improve transformer. In arXiv:2002.05202 [cs.LG]

  37. Song Z, Zhou X (2021) Exploring explicit and implicit visual relationships for image captioning. 2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE, Shenzhen, China, pp 1–6. https://doi.org/10.1109/ICME51207.2021.9428310

  38. Sutskever I, Vinayls O, Le QV (2014) "sequence to sequence learning with neural networks," in advances in neural information processing systems (NIPS)

  39. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, pp 6000–6010

  40. Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: consensus-based image description evaluation, vol 2015. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp 4566–4575. https://doi.org/10.1109/CVPR.2015.7299087

  41. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935

  42. Wang Y, Xu N, Liu A-A, Li W, Zhang Y (2022) High-order interaction learning for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 32(7):4417–4430. https://doi.org/10.1109/TCSVT.2021.3121062

  43. Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? 2016 IEEE conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp 203–212. https://doi.org/10.1109/CVPR.2016.29

  44. Wu S, Wieland J, Farivar O, Schiller J (2017) Automatic alt-text: computer-generated image descriptions for blind users on a social network service. In: Proceedings of the 2017 ACM conference on Computer SuFTpported Cooperative Work and Social Computing (CSCW '17). Association for Computing Machinery, New York, NY, USA, pp 1180–1192. https://doi.org/10.1145/2998181.2998364

  45. Xu H, Saenko K (2016) "Ask, attend and answer: exploring questionguided spatial attention for visual question answering," in European conference on computer vision (ECCV)

  46. Xu K, Ba JL, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on International Conference on Machine Learning (ICML'15), vol 37. JMLR.org, pp 2048–2057

  47. Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, pp 10677–10686. https://doi.org/10.1109/CVPR.2019.01094

  48. Yang X, Liu Y, Wang X (2022) ReFormer: the relational transformer for image captioning. In Proceedings of the 30th ACM international conference on Multimedia (MM '22). Association for Computing Machinery, New York, NY, USA, pp 5398–5406. https://doi.org/10.1145/3503161.3548409

  49. Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Venice, Italy, pp 4904–4912. https://doi.org/10.1109/ICCV.2017.524

  50. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. 2016 IEEE conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp 4651–4659. https://doi.org/10.1109/CVPR.2016.503

  51. Yu J, Li J, Yu Z, Huang Q (2020) Multimodal transformer with multiview visual representation for image captioning. IEEE Trans Circ Syst Vid Technol 30(12):4467–4480

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chhavi Dhiman.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, D., Dhiman, C. & Kumar, D. XGL-T transformer model for intelligent image captioning. Multimed Tools Appl 83, 4219–4240 (2024). https://doi.org/10.1007/s11042-023-15291-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15291-3

Keywords

Navigation