Abstract
Image captioning extracts multiple semantic features from an image and integrates them into a sentence-level description. For efficient description of the captions, it becomes necessary to learn higher order interactions between detected objects and the relationship among them. Most of the existing systems take into account the first order interactions while ignoring the higher order ones. It is challenging to extract discriminant higher order semantics visual features in images with highly populated objects for caption generation. In this paper, an efficient higher order interaction learning framework is proposed using encoder-decoder based image captioning. A scaled version of Gaussian Error Linear Unit (GELU) activation function, x-GELU is introduced that controls the vanishing gradients and enhances the feature learning. To leverage higher order interactions among multiple objects, an efficient XGL Transformer (XGL-T) model is introduced that exploits both spatial and channel-wise attention by integrating four XGL attention modules in image encoder and one in Bilinear Long Short-Term Memory guided sentence decoder. The proposed model captures rich semantic concepts from objects, attributes, and their relationships. Extensive experiments are conducted on publicly available MSCOCO Karapathy test split and the best performance of the work is observed as 81.5 BLEU@1, 67.1 BLEU@2, 51.6 BLEU@3, 39.9 BLEU@4, 134 CIDEr, 59.9 ROUGE-L, 29.8 METEOR, 23.8 SPICE using CIDEr-D Score Optimization Strategy. The scores validate the significant improvements over state-of-the-art results. An ablation study is also carried out to support the experimental observations.
Similar content being viewed by others
Data availability
Data sharing not applicable to this article as no datasets were generated during the current study.
References
Amirkhani D, Bastanfard A (2021) An objective method to evaluate exemplar-based inpainted images quality using Jaccard index. Multimed Tools Appl 80:26199–26212
Anderson P, Fernando B, Johnson M, Gould S (2016) SPICE: Semantic Propositional Image Caption Evaluation. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision – ECCV 2016. Lecture notes in computer science, vol 9909. Springer, Cham. https://doi.org/10.1007/978-3-319-46454-1_24
Anderson P et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp 6077–6086. https://doi.org/10.1109/CVPR.2018.00636
Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, pp 65–72
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T (2016) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 6298–6306
Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura JM, Parikh D, Batra D (2016) Visual dialog. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 1080–1089
Dauphin YN, Fan A, Auli M, Grangier D (2017) Proceedings of the 34th international conference on machine learning, PMLR 70, pp 933–941
Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition, pp 248–255
Deng Z, Jiang Z, Lan R, Huang W, Luo X (2020) Image captioning using DenseNet network and adaptive attention. Signal Process Image Commun 85:115836. https://doi.org/10.1016/j.image.2020.115836
Diederik PK, Jimmy LB (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP 2016 - conference on empirical methods in natural language processing, proceedings. Association for Computational Linguistics (ACL), pp 457–468. https://doi.org/10.18653/v1/d16-1044
Gordo A, Larlus D (2017) Beyond instance-level image retrieval: leveraging captions to learn a global visual representation for semantic retrieval. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp 5272–5281. https://doi.org/10.1109/CVPR.2017.560
Goyal A, Bochkovskiy A, Deng J, Koltun V (2021) Non-deep Networks. ArXiv, abs/2110.07641
He S, Liao W, Tavakoli HR, Yang MY, Rosenhahn B, Pugeault N (2021) Image captioning through image transformer. In: Ishikawa H, Liu C-L, Pajdla T, Shi J (eds) Computer Vision – ACCV 2020: 15th Asian Conference on Computer Vision, 2020
Hendrycks D, Gimpel K (2016) Gaussian Error Linear Units (GELUs)
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372
Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. https://doi.org/10.48550/arXiv.1908.06954
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, pp 2407–2415. https://doi.org/10.1109/ICCV.2015.277
Jiang W, Ma L, Jiang YG, Liu W, Zhang T (2018) Recurrent fusion network for image captioning. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer vision – ECCV 2018. Lecture notes in computer science, vol 11206. Springer, Cham. https://doi.org/10.1007/978-3-030-01216-8_31
Karaoglu S, Tao R, Gevers T, Smeulders AW (2017) Words matter: scene text for image classification and retrieval. IEEE Transactions on Multimedia 19:1063–1076
Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676. https://doi.org/10.1109/TPAMI.2016.2598339
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Proceedings of the 32nd international conference on Neural Information Processing Systems (NIPS'18). Curran Associates Inc., Red Hook, NY, USA, pp 1571–1581
Krishna R, Zhu Y, Groth O, Johnson J (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Li N, Chen Z (2018) Image cationing with visual-semantic LSTM. International Joint Conference on Artificial Intelligence
Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: IEEE conference on computer vision and Pattern recognition. IEEE, Long Beach, CA
Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In text summarization branches out. Association for Computational Linguistics, Barcelona, Spain, pp 74–81
Lin X, Parikh D (2016) Leveraging visual question answering for image-caption ranking. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision – ECCV 2016. Lecture notes in computer science, vol 9906. Springer, Cham. https://doi.org/10.1007/978-3-319-46475-6_17
Lin TY et al (2014) Microsoft COCO: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer vision – ECCV 2014. Lecture notes in computer science, vol 8693. Springer, Cham. https://doi.org/10.1007/978-3-319-10602-1_48
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. IEEE conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp 3242–3250. https://doi.org/10.1109/CVPR.2017.345
Nguyen A, Pham K, Ngo D, Ngo T, Pham L (2021) An analysis of state-of-the-art activation functions for supervised deep neural network. 2021 International Conference on System Science and Engineering (ICSSE), Ho Chi Minh City, Vietnam, pp 215–220. https://doi.org/10.1109/ICSSE52999.2021.9538437
Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp 10968–10977. https://doi.org/10.1109/CVPR42600.2020.01098
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp 311–318. https://doi.org/10.3115/1073083.1073135
Qin Y, Du J, Zhang Y, Lu H (2019) Look back and predict forward in image captioning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, pp 8359–8367. https://doi.org/10.1109/CVPR.2019.00856
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Selfcritical sequence training for image captioning. In: IEEE conference on computer vision and Pattern recognition, Hawaiʻi Convention Center. IEEE, NY, USA
Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics, vol 1: Long Papers. Association for Computational Linguistics, Melbourne, Australia, pp 2556–2565
Shazeer N (2020) GLU variants improve transformer. In arXiv:2002.05202 [cs.LG]
Song Z, Zhou X (2021) Exploring explicit and implicit visual relationships for image captioning. 2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE, Shenzhen, China, pp 1–6. https://doi.org/10.1109/ICME51207.2021.9428310
Sutskever I, Vinayls O, Le QV (2014) "sequence to sequence learning with neural networks," in advances in neural information processing systems (NIPS)
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, pp 6000–6010
Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: consensus-based image description evaluation, vol 2015. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp 4566–4575. https://doi.org/10.1109/CVPR.2015.7299087
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935
Wang Y, Xu N, Liu A-A, Li W, Zhang Y (2022) High-order interaction learning for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 32(7):4417–4430. https://doi.org/10.1109/TCSVT.2021.3121062
Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? 2016 IEEE conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp 203–212. https://doi.org/10.1109/CVPR.2016.29
Wu S, Wieland J, Farivar O, Schiller J (2017) Automatic alt-text: computer-generated image descriptions for blind users on a social network service. In: Proceedings of the 2017 ACM conference on Computer SuFTpported Cooperative Work and Social Computing (CSCW '17). Association for Computing Machinery, New York, NY, USA, pp 1180–1192. https://doi.org/10.1145/2998181.2998364
Xu H, Saenko K (2016) "Ask, attend and answer: exploring questionguided spatial attention for visual question answering," in European conference on computer vision (ECCV)
Xu K, Ba JL, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on International Conference on Machine Learning (ICML'15), vol 37. JMLR.org, pp 2048–2057
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, pp 10677–10686. https://doi.org/10.1109/CVPR.2019.01094
Yang X, Liu Y, Wang X (2022) ReFormer: the relational transformer for image captioning. In Proceedings of the 30th ACM international conference on Multimedia (MM '22). Association for Computing Machinery, New York, NY, USA, pp 5398–5406. https://doi.org/10.1145/3503161.3548409
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Venice, Italy, pp 4904–4912. https://doi.org/10.1109/ICCV.2017.524
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. 2016 IEEE conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp 4651–4659. https://doi.org/10.1109/CVPR.2016.503
Yu J, Li J, Yu Z, Huang Q (2020) Multimodal transformer with multiview visual representation for image captioning. IEEE Trans Circ Syst Vid Technol 30(12):4467–4480
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sharma, D., Dhiman, C. & Kumar, D. XGL-T transformer model for intelligent image captioning. Multimed Tools Appl 83, 4219–4240 (2024). https://doi.org/10.1007/s11042-023-15291-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15291-3