• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology

Article

Indian Journal of Science and Technology

Year: 2023, Volume: 16, Issue: 47, Pages: 4525-4546

Review Article

A Review of Recent Advances in Visual Question Answering: Capsule Networks and Vision Transformers in Focus

Received Date:18 October 2023, Accepted Date:28 October 2023, Published Date:31 December 2023

Abstract

Objectives: Multimodal deep learning, incorporating images, text, videos, speech, and acoustic signals, has grown significantly. This article aims to explore the untapped possibilities of multimodal deep learning in Visual Question Answering (VQA) and address a research gap in the development of effective techniques for comprehensive image feature extraction. Methods: This article provides a comprehensive overview of VQA and the associated challenges. It emphasizes the need for an extensive representation of images in VQA and pinpoints the specific research gap pertaining to image feature extraction and highlights the fundamental concepts of VQA, the challenges faced, different approaches and applications used for VQA tasks. A substantial portion of this review is devoted to investigating recent advancements in image feature extraction techniques. Findings: Most existing VQA research predominantly emphasizes the accurate matching of answers to given questions, often overlooking the necessity for a comprehensive representation of images. These models primarily rely on question content analysis while underemphasizing image understanding or sometimes neglect image examination entirely. There is also a tendency in multimodal systems to neglect or overemphasize one modality, notably the visual one, which challenges genuine multimodal integration. This article reveals that there is limited benchmarking for image feature extraction techniques. Evaluating the quality of extracted image features is crucial for VQA tasks. Novelty: While many VQA studies have primarily concentrated on the accuracy of answers to questions, this review emphasizes the importance of comprehensive image representation. The paper explores recent advances in Capsules Networks (CapsNets) and Vision Transformers (ViTs) as alternatives to traditional Convolutional Neural Networks (CNNs), for development of more effective image feature extraction techniques which can help to address the limitations of existing VQA models that focus primarily on question content analysis.

Keywords: Feature Extraction, Visual Question Answering, Multimodal Deep Learning, Capsule Networks, Vision Transformer, Datasets

References

  1. Cao J, Gan Z, Cheng Y, Yu L, Chen YCC, Liu J. Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models. Computer Vision – ECCV 2020. 2020;p. 565–580. Available from: https://doi.org/10.48550/arXiv.2005.07310
  2. Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW. VisualBERT: A Simple and Performant Baseline for Vision and Language. 2019. Available from: http://arxiv.org/abs/1908.03557
  3. Liu L, Su X, Guo H, Zhu D. A Transformer-based Medical Visual Question Answering Model. 2022 26th International Conference on Pattern Recognition (ICPR). 2022;2022:1712–1720. Available from: https://doi.org/10.1109/ICPR56361.2022.9956469
  4. Yamada M, Amario D, Takemoto V, Boix K, Sasaki X, Sasaki T. Transformer Module Networks for Systematic Generalization in Visual Question Answering. 2022. Available from: http://arxiv.org/abs/2201.11316
  5. Yu Z, Jin Z, Yu J, Xu M, Wang H, Fan J. Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering. IEEE Transactions on Multimedia. 2023;p. 1–15. Available from: https://doi.org/10.48550/arXiv.2203.12814
  6. Heo YJJ, Kim ESS, Choi WS, Zhang BTT. Hypergraph Transformer: Weakly-Supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. Available from: https://doi.org/10.48550/arXiv.2204.10448
  7. Seenivasan L, Islam M, Krishna AK, Ren H. Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer. Lecture Notes in Computer Science. 2022;p. 33–43. Available from: https://doi.org/10.48550/arXiv.2206.11053
  8. Siebert T, Clasen KN, Ravanbakhsh M, Demir B. Multi-modal fusion transformer for visual question answering in remote sensing. Image and Signal Processing for Remote Sensing XXVIII. 2022. Available from: https://doi.org/10.48550/arXiv.2210.04510
  9. Bazi Y, Rahhal MMA, Mekhalfi ML, Zuair MAA, Melgani F. Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery. IEEE Transactions on Geoscience and Remote Sensing. 2022;60:1–11. Available from: https://doi.org/10.1109/TGRS.2022.3192460
  10. Zhang H, Wu W. CAT: Re-Conv Attention in Transformer for Visual Question Answering. 2022 26th International Conference on Pattern Recognition (ICPR). 2022;2022:1471–1478. Available from: https://doi.org/10.1109/ICPR56361.2022.9956247
  11. Ding H, Li LE, Hu Z, Xu Y, Hakkani-Tur D, Du Z, et al. Semantic Aligned Multi-modal Transformer for Vision-LanguageUnderstanding: A Preliminary Study on Visual QA. Proceedings of the Third Workshop on Multimodal Artificial Intelligence. 2021. Available from: https://assets.amazon.science/4f/aa/de1faccc4ff9866c76b82e57a29b/multimodal-reltag-final-34.pdf
  12. Khan AU, Mazaheri A, Da VLN, Shah M, Mmft-Bert. Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering. 2020. Available from: http://arxiv.org/abs/2010.14095
  13. Ye S, Kong W, Yao C, Ren J, Jiang X. Video Question Answering Using Clip-Guided Visual-Text Attention. 2023 IEEE International Conference on Image Processing (ICIP). 2023. Available from: https://doi.org/10.48550/arXiv.2303.03131
  14. Mishra A, Anand A, Guha P. Dual Attention and Question Categorization-Based Visual Question Answering. IEEE Transactions on Artificial Intelligence. 2023;4(1):81–91. Available from: https://doi.org/10.1109/TAI.2022.3160418
  15. Pan H, He S, Zhang K, Qu B, Chen C, Shi K. AMAM: An Attention-based Multimodal Alignment Model for Medical Visual Question Answering. Knowledge-Based Systems. 2022;255:109763. Available from: https://doi.org/10.1016/j.knosys.2022.109763
  16. Song J, Zeng P, Gao L, Shen HT. From Pixels to Objects: Cubic Visual Attention for Visual Question Answering. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. 2018. Available from: https://doi.org/10.48550/arXiv.2206.01923
  17. Xia Q, Yu C, Hou Y, Peng P, Zheng Z, Chen W. Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism. Electronics. 2022;11(11):1778. Available from: https://doi.org/10.3390/electronics11111778
  18. Yang Z, Wu L, Wen P, Chen P. Visual Question Answering reasoning with external knowledge based on bimodal graph neural network. Electronic Research Archive. 2023;31(4):1948–1965. Available from: https://doi.org/10.3934/era.2023100
  19. Hu X, Gu L, Kobayashi K, Chen AQ, Lu Q, Lu Z, et al. Interpretable Medical Image Visual Question Answering via Multi-Modal Relationship Graph Learning. 2023. Available from: http://arxiv.org/abs/2302.09636
  20. Qian Y, Hu Y, Wang R, Feng F, Wang X. Question-Driven Graph Fusion Network for Visual Question Answering. 2022 IEEE International Conference on Multimedia and Expo (ICME). 2022. Available from: https://doi.org/10.48550/arXiv.2204.00975
  21. Li M, Moens MFF. Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence. 36(10):10983–10992. Available from: https://doi.org/10.48550/arXiv.2203.02985
  22. Li H, Li X, Karimi B, Chen J, Sun M. Joint Learning of Object Graph and Relation Graph for Visual Question Answering. 2022 IEEE International Conference on Multimedia and Expo (ICME). 2022. Available from: https://doi.org/10.48550/arXiv.2205.04188
  23. Zhu Z. From Shallow to Deep: Compositional Reasoning over Graphs for Visual Question Answering. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022. Available from: https://doi.org/10.48550/arXiv.2206.12533
  24. Wang Y, Yasunaga M, Ren H, Wada S, Leskovec J, Vqa-Gnn. Reasoning with Multimodal Semantic Graph for Visual Question Answering. 2022. Available from: http://arxiv.org/abs/2205.11501
  25. Salaberria A, Azkune G, Lacalle OLD, Soroa A, Agirre E. Image captioning for effective use of language models in knowledge-based visual question answering. Expert Systems with Applications. 2023;212:118669. Available from: https://doi.org/10.48550/arXiv.2109.08029
  26. Zhu H, Togo R, Ogawa T, Haseyama M. Interpretable Visual Question Answering Referring to Outside Knowledge. 2023 IEEE International Conference on Image Processing (ICIP). 2023. Available from: https://doi.org/10.48550/arXiv.2303.04388
  27. Lerner P, Ferret O, Guinaudeau C. Multimodal Inverse Cloze Task for Knowledge-Based Visual Question Answering. Lecture Notes in Computer Science. 2023;p. 569–587. Available from: https://doi.org/10.48550/arXiv.2301.04366
  28. Shao Z, Yu Z, Wang M, Yu J. Prompting Large Language Models with Answer Heuristics for Knowledge-Based Visual Question Answering. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. Available from: https://doi.org/10.48550/arXiv.2303.01903
  29. Li Z, Guo Y, Wang K, Wei Y, Nie L, Kankanhalli M. Joint Answering and Explanation for Visual Commonsense Reasoning. IEEE Transactions on Image Processing. 2023;32:3836–3846. Available from: https://doi.org/10.48550/arXiv.2202.12626
  30. Li H, Huang J, Jin P, Song GP, Wu Q, Chen J. Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering. IEEE Transactions on Image Processing. 2023;32:3367–3382. Available from: https://doi.org/10.1109/TIP.2023.3276570
  31. Shen X, Han D, Chen C, Luo G, Wu Z. An effective spatial relational reasoning networks for visual question answering. PLOS ONE. 2022;17(11):e0277693. Available from: https://doi.org/10.1371/journal.pone.0277693
  32. Wu Y, Ma Y, Wan S. Multi-scale relation reasoning for multi-modal Visual Question Answering. Signal Processing: Image Communication. 2021;96:116319. Available from: https://doi.org/10.1016/j.image.2021.116319
  33. Han Y, Yin J, Wu J, Wei Y, Nie L. Semantic-Aware Modular Capsule Routing for Visual Question Answering. IEEE Transactions on Image Processing. 2023;32:5537–5549. Available from: https://doi.org/10.48550/arXiv.2207.10404
  34. Zhou Y, Ji R, Su JR, Sun X, Chen W. Dynamic Capsule Attention for Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence. 33(01):9324–9331. Available from: https://doi.org/10.1609/aaai.v33i01.33019324
  35. Khan AU, Kuehne H, Gan C, Lobo NDV, Shah M. Weakly Supervised Grounding for VQA in Vision-Language Transformers. Lecture Notes in Computer Science. 2022;p. 652–670. Available from: https://doi.org/10.48550/arXiv.2207.02334
  36. Bai L, Islam M, Seenivasan L, Ren H. Surgical-VQLA:Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery. 2023 IEEE International Conference on Robotics and Automation (ICRA). 2023;p. 6859–6865. Available from: https://doi.org/10.48550/arXiv.2305.11692
  37. Seenivasan L, Islam M, Kannan G, Ren H. SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery. Lecture Notes in Computer Science. 2023;p. 281–290. Available from: https://doi.org/10.48550/arXiv.2304.09974
  38. Gupta D, Attal K, Demner-Fushman D. A dataset for medical instructional video classification and question answering. Scientific Data. 2023;10(1). Available from: https://doi.org/10.1038/s41597-023-02036-y
  39. Bazi Y, Rahhal MMA, Bashmal L, Zuair M. Vision–Language Model for Visual Question Answering in Medical Imagery. Bioengineering. 2023;10(3):380. Available from: https://doi.org/10.3390/bioengineering10030380
  40. Biten AF, Litman R, Xie Y, Appalaraju S, Manmatha R. LaTr: Layout-Aware Transformer for Scene-Text VQA. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022;2022:16527–16564. Available from: https://doi.org/10.48550/arXiv.2112.12494
  41. Tiong AMH, Li J, Li B, Savarese S, Hoi SCH. Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. Available from: https://doi.org/10.48550/arXiv.2210.08773
  42. Yuan Z, Mou L, Xiong Z, Zhu XX. Change Detection Meets Visual Question Answering. IEEE Transactions on Geoscience and Remote Sensing. 2022;60:1–13. Available from: https://doi.org/10.48550/arXiv.2112.06343
  43. Le T, Nguyen HT, Nguyen L, , . Vision And Text Transformer For Predicting Answerability On Visual Question Answering. Proceedings - International Conference on Image Processing. 2021;p. 934–942. Available from: https://doi.org/10.1109/ICIP42928.2021.9506796
  44. Yan M, Xu H, Li C, Tian J, Bi B, Wang W, et al. Achieving Human Parity on Visual Question Answering. ACM Transactions on Information Systems. 2023;41(3):1–40. Available from: https://doi.org/10.48550/arXiv.2111.08896
  45. Li P, Liu G, Tan L, Liao J, Zhong S. Self-Supervised Vision-Language Pretraining for Medial Visual Question Answering. 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI). 2023. Available from: https://doi.org/10.48550/arXiv.2211.13594
  46. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. Available from: https://doi.org/10.1109/CVPR.2016.90
  47. Whitehead S, Wu H, Ji H, Feris R, Saenko K. Separating Skills and Concepts for Novel Visual Question Answering. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021. Available from: https://doi.org/10.48550/arXiv.2107.09106
  48. Yang X, Gao C, Zhang H, Cai J. Auto-Parsing Network for Image Captioning and Visual Question Answering. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021;p. 2177–2187. Available from: https://doi.org/10.48550/arXiv.2108.10568
  49. Banerjee P, Gokhale T, Yang Y, Baral C. Weakly Supervised Relative Spatial Reasoning for Visual Question Answering. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. Available from: https://doi.org/10.48550/arXiv.2109.01934
  50. Chen L, Zheng Y, Niu Y, Zhang H, Xiao J. Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2023;p. 1–16. Available from: https://doi.org/10.48550/arXiv.2003.06576
  51. Nguyen BX, Do T, Tran H, Tjiputra E, Tran QD, Nguyen AX. Coarse-to-Fine Reasoning for Visual Question Answering. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 2022. Available from: https://doi.org/10.48550/arXiv.2110.02526
  52. Cao J, Qin X, Zhao S, Shen J. Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering. IEEE Transactions on Neural Networks and Learning Systems. 2022;p. 1–12. Available from: https://doi.org/10.48550/arXiv.2112.07270 Focus to learn more
  53. Ben-Younes H, Cadene R, Thome N, Cord M. BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection. Proceedings of the AAAI Conference on Artificial Intelligence. 33(01):8102–8109. Available from: https://doi.org/10.48550/arXiv.1902.00038
  54. Guo W, Zhang Y, Yang J, Yuan X. Re-Attention for Visual Question Answering. IEEE Transactions on Image Processing. 2021;30:6730–6743. Available from: https://doi.org/10.1109/TIP.2021.3097180
  55. Liu F, Xu G, Wu Q, Du Q, Jia W, Tan M. Cascade Reasoning Network for Text-based Visual Question Answering. Proceedings of the 28th ACM International Conference on Multimedia. 2020;p. 4060–4069. Available from: https://tanmingkui.github.io/files/publications/Cascade.pdf
  56. Dancette C, Cadene R, Teney D, Cord M. Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. Available from: https://doi.org/10.48550/arXiv.2104.03149
  57. Yang C, Wu W, Wang Y, Zhou H. Multi-Modality Global Fusion Attention Network for Visual Question Answering. Electronics. 2020;9(11):1882. Available from: https://doi.org/10.3390/electronics9111882
  58. Park S, Hwang S, Hong J, Byun H. Fair-VQA: Fairness-Aware Visual Question Answering Through Sensitive Attribute Prediction. IEEE Access. 2020;8:215091–215099. Available from: https://doi.org/10.1109/ACCESS.2020.3041503
  59. Xiong P, You Q, Yu P, Liu Z, Wu Y, Sa-Vqa. Structured Alignment of Visual and Semantic Representations for Visual Question Answering. 2022. Available from: http://arxiv.org/abs/2201.10654
  60. Ding Y, Yu J, Liu B, Hu Y, Cui M, Wu Q. MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022. Available from: https://doi.org/10.48550/arXiv.2203.09138
  61. Wang R, Qian Y, Feng F, Wang X, Jiang H. Co-VQA : Answering by Interactive Sub Question Sequence. Findings of the Association for Computational Linguistics: ACL 2022. 2022. Available from: https://doi.org/10.48550/arXiv.2204.00879
  62. Gupta V, Li Z, Kortylewski A, Zhang C, Li Y, Yuille A. SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022. Available from: https://doi.org/10.48550/arXiv.2204.02285
  63. Chae J, Kim J. Uncertainty-based Visual Question Answering: Estimating Semantic Inconsistency between Image and Knowledge Base. 2022 International Joint Conference on Neural Networks (IJCNN). 2022. Available from: https://doi.org/10.48550/arXiv.2207.13242
  64. Nooralahzadeh F, Sennrich R. Improving the Cross-Lingual Generalisation in Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence. 2022;37(11):13419–13427. Available from: https://doi.org/10.48550/arXiv.2209.02982
  65. Reich D, Putze F, Schultz T. Visually Grounded VQA by Lattice-based Retrieval. 2022. Available from: http://arxiv.org/abs/2211.08086
  66. Cao F, Luo S, Nunez F, Wen Z, Poon J, Han SC. SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering. Robotics. 2022;12(4):114. Available from: https://doi.org/10.48550/arXiv.2212.08283
  67. Hu X, Gu L, Kobayashi K, Chen AQ, Lu Q, Lu Z, et al. Interpretable Medical Image Visual Question Answering via Multi-Modal Relationship Graph Learning. 2023. Available from: http://arxiv.org/abs/2302.09636
  68. Raj H, Dadhania J, Bhardwaj A, P K. Multi-Image Visual Question Answering. 2021. Available from: http://arxiv.org/abs/2112.13706
  69. Gupta P, Gupta M. NewsKVQA: Knowledge-Aware News Video Question Answering. Advances in Knowledge Discovery and Data Mining. 2022;p. 3–15. Available from: https://doi.org/10.48550/arXiv.2202.04015
  70. Piergiovanni A, Li W, Kuo W, Saffar M, Bertsch F, Angelova A. Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering. 2022. Available from: http://arxiv.org/abs/2205.00949
  71. Chang S, Palzer D, Li J, Fosler-Lussier E, XN. MapQA: A Dataset for Question Answering on Choropleth Maps. 2022. Available from: http://arxiv.org/abs/2211.08545
  72. Tang X, Zhang W, Yu Y, Turner K, Derr T, Wang M, et al. Interpretable Visual Understanding with Cognitive Attention Network. Lecture Notes in Computer Science. 2021;p. 555–568. Available from: https://doi.org/10.48550/arXiv.2108.02924
  73. Wu T, Garcia N, Otani M, Chu C, Nakashima Y, Takemura H. Transferring Domain-Agnostic Knowledge in Video Question Answering. 2021. Available from: https://www.bmvc2021-virtualconference.com/assets/papers/1187.pdf
  74. Jiang J, Liu Z, Zheng N. LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering. IEEE Transactions on Multimedia. 2023;25:5002–5013. Available from: https://doi.org/10.1109/TMM.2022.3185900
  75. Wang J, Bao BKK, Xu C. DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering. IEEE Transactions on Multimedia. 2022;24:3369–3380. Available from: https://doi.org/10.1109/TMM.2021.3097171
  76. Gao L, Chen T, Li X, Zeng P, Zhao L, Li YF. Generalized pyramid co-attention with learnable aggregation net for video question answering. Pattern Recognition. 2021;120:108145. Available from: https://doi.org/10.1016/j.patcog.2021.108145
  77. Manmadhan S, Kovoor BC. Multi-Tier Attention Network using Term-weighted Question Features for Visual Question Answering. Image and Vision Computing. 2021;115:104291. Available from: https://doi.org/10.1016/j.imavis.2021.104291
  78. Chen Z, Chen J, Geng Y, Pan JZ, Yuan Z, Chen H. Zero-Shot Visual Question Answering Using Knowledge Graph. The Semantic Web – ISWC 2021. 2021;p. 146–162. Available from: https://doi.org/10.48550/arXiv.2107.05348
  79. Yu Z, Yu J, Xiang C, Fan J, Tao DJ. Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering. IEEE Transactions on Neural Networks and Learning Systems. 2018;29(12):5947–5959. Available from: https://doi.org/10.48550/arXiv.1708.03619
  80. Chappuis C, Lobry S, Kellenberger B, Saux B, Le, Tuia D. How to find a good image-text embedding for remote sensing visual question answering? 2021. Available from: http://arxiv.org/abs/2109.11848
  81. Lan Y, Guo Y, Chen Q, Lin S, Chen Y, Deng X. Visual question answering model for fruit tree disease decision-making based on multimodal deep learning. Frontiers in Plant Science. 2023;13. Available from: https://doi.org/10.3389/fpls.2022.1064399
  82. Gao L, Zeng P, Song J, Li YFF, Liu W, Mei T, et al. Structured Two-Stream Attention Network for Video Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence. 2022;33(01):6391–6398. Available from: https://dl.acm.org/doi/pdf/10.1609/aaai.v33i01.33016391#:~:text=First%2C%20we%20infer%20rich%20long,focuses%20on%20the%20relevant%20text.
  83. Sood E, Kögel F, Müller P, Thomas D, Bâce M, Bulling A. Multimodal Integration of Human-Like Attention in Visual Question Answering. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 2023. Available from: https://doi.org/10.48550/arXiv.2109.13139
  84. Liu H, Gong S, Ji Y, Yang JY, Xing T, Liu C. Multimodal Cross-guided Attention Networks for Visual Question Answering. Advances in Intelligent Systems Research. 2018. Available from: https://www.atlantis-press.com/article/25897541.pdf
  85. Cascante-Bonilla P, Wu H, Wang L, Feris R, Ordonez V. Sim VQA: Exploring Simulated Environments for Visual Question Answering. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022. Available from: https://doi.org/10.1109/CVPR52688.2022.00500
  86. Winterbottom T, Xiao S, Mclean A, Moubayed NA. Bilinear pooling in video-QA: empirical challenges and motivational drift from neurological parallels. PeerJ Computer Science. 2022;8:e974. Available from: https://doi.org/10.7717/peerj-cs.974
  87. Lee D, Cheon Y, Han WSS. Regularizing Attention Networks for Anomaly Detection in Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence. 35(3):1845–1853. Available from: https://doi.org/10.48550/arXiv.2009.10054
  88. Seo A, Kang GCC, Park J, Zhang BTT. Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. Available from: https://doi.org/10.48550/arXiv.2106.10446
  89. Changpinyo S, Kukliansy D, Szpektor I, Chen X, Ding N, Soricut R. All You May Need for VQA are Image Captions. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. Available from: https://doi.org/10.48550/arXiv.2205.01883
  90. Li H, Huang J, Jin P, Song GP, Wu Q, Chen J. Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering. IEEE Transactions on Image Processing. 2023;32:3367–3382. Available from: https://doi.org/10.1109/TIP.2023.3276570
  91. Pan H, He S, Zhang K, Qu B, Chen C, Shi K. MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering. 2021. Available from: http://arxiv.org/abs/2107.03216
  92. Salaberria A, Azkune G, Lacalle OLD, Soroa A, Agirre E. Image captioning for effective use of language models in knowledge-based visual question answering. Expert Systems with Applications. 2023;212:118669. Available from: https://doi.org/10.48550/arXiv.2109.08029
  93. Girshick R, Donahue J, Darrell T, Malik J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014. Available from: https://doi.org/10.48550/arXiv.1311.2524
  94. Sarkar A, Rahnemoonfar M. VQA-Aid: Visual Question Answering for Post-Disaster Damage Assessment and Analysis. 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS. 2021. Available from: https://doi.org/10.48550/arXiv.2106.10548
  95. Silva JD, Martins B, Magalhães J. Contrastive training of a multimodal encoder for medical visual question answering. Intelligent Systems with Applications. 2023;18:200221. Available from: https://doi.org/10.1016/j.iswa.2023.200221
  96. Redmon J, Divvala S, Girshick R, Farhadi A. You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. Available from: https://doi.org/10.48550/arXiv.1506.02640
  97. Gómez L, Biten AF, Tito R, Mafla A, Rusiñol M, Valveny E. Multimodal grid features and cell pointers for scene text visual question answering. Pattern Recognition Letter. 2021;150:242–251. Available from: https://doi.org/10.48550/arXiv.2006.00923
  98. Brock A, De S, Smith SL, Simonyan K. 2021. Available from: http://arxiv.org/abs/2102.06171
  99. Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y. 2022. Available from: http://arxiv.org/abs/2204.14198
  100. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T. 2020. Available from: http://arxiv.org/abs/2010.11929
  101. Li LH, Zhang P, Zhang H, Yang J, Li CH, Zhong Y, et al. Grounded Language-Image Pre-training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022. Available from: https://doi.org/10.48550/arXiv.2112.03857
  102. Radford A, Kim JW, Hallacy C, AR, Gabriel G, Agarwal S, et al. Learning Transferable Visual Models From Natural Language Supervision. 2021. Available from: http://arxiv.org/abs/2103.00020
  103. Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;39(6):1137–1149. Available from: https://doi.org/10.48550/arXiv.1506.01497
  104. Shuang K, Guo J, Wang Z. Comprehensive-perception dynamic reasoning for visual question answering. Pattern Recognition. 2022;131:108878. Available from: https://doi.org/10.1016/j.patcog.2022.108878
  105. Zhang S, Chen M, Chen J, Zou F, Li YFF, Lu P. Multimodal feature-wise co-attention method for visual question answering. Information Fusion. 2021;73:1–10. Available from: https://doi.org/10.1016/j.inffus.2021.02.022
  106. Guo D, Xu C, Tao D. Bilinear Graph Networks for Visual Question Answering. IEEE Transactions on Neural Networks and Learning Systems. 2023;34(2):1023–1034. Available from: https://doi.org/10.48550/arXiv.1907.09815
  107. Liu Y, Guo Y, Yin J, Song X, Liu W, Nie L, et al. Answer Questions with Right Image Regions: A Visual Attention Regularization Approach. ACM Transactions on Multimedia Computing, Communications, and Applications. 2022;18(4):1–18. Available from: https://doi.org/10.48550/arXiv.2102.01916
  108. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018;p. 6077–6086. Available from: https://doi.org/10.48550/arXiv.1707.07998
  109. Sabour S, Frosst N, Hinton GE. Dynamic Routing Between Capsules. 2017. Available from: http://arxiv.org/abs/1710.09829
  110. Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: a Method for Automatic Evaluation of Machine Translation. 2002. Available from: https://aclanthology.org/P02-1040
  111. Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based image description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015. Available from: https://doi.org/10.48550/arXiv.1411.5726
  112. Anderson P, Fernando B, Johnson M, Gould SM. SPICE: Semantic Propositional Image Caption Evaluation. Computer Vision – ECCV 2016. 2016;p. 382–398. Available from: https://doi.org/10.48550/arXiv.1607.08822

Copyright

© 2023 Prakash & Devananda. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)

DON'T MISS OUT!

Subscribe now for latest articles and news.