A Review of Recent Advances in Visual Question Answering: Capsule Networks and Vision Transformers in Focus

Miranda Surya Prakash; S N Devananda

doi:10.17485/IJST/v16i47.2643

Article

A Review of Recent Advances in Visual Question Answering: Capsule Networks and Vision Transformers in Focus

VIEWS 304
PDF 107

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v16i47.2643

Year: 2023, Volume: 16, Issue: 47, Pages: 4525-4546

Review Article

A Review of Recent Advances in Visual Question Answering: Capsule Networks and Vision Transformers in Focus

Miranda Surya Prakash^1*, S N Devananda²

¹Research Scholar, Department of ECE, PES Institute of Technology and Management, Visvesvaraya Technological University, Shimogha, Karnataka, India
²Professor, Department of ECE, PES Institute of Technology and Management, Visvesvaraya Technological University, Shimogha, Karnataka, India

*Corresponding Author
Email: [email protected]

Received Date:18 October 2023, Accepted Date:28 October 2023, Published Date:31 December 2023

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: Multimodal deep learning, incorporating images, text, videos, speech, and acoustic signals, has grown significantly. This article aims to explore the untapped possibilities of multimodal deep learning in Visual Question Answering (VQA) and address a research gap in the development of effective techniques for comprehensive image feature extraction. Methods: This article provides a comprehensive overview of VQA and the associated challenges. It emphasizes the need for an extensive representation of images in VQA and pinpoints the specific research gap pertaining to image feature extraction and highlights the fundamental concepts of VQA, the challenges faced, different approaches and applications used for VQA tasks. A substantial portion of this review is devoted to investigating recent advancements in image feature extraction techniques. Findings: Most existing VQA research predominantly emphasizes the accurate matching of answers to given questions, often overlooking the necessity for a comprehensive representation of images. These models primarily rely on question content analysis while underemphasizing image understanding or sometimes neglect image examination entirely. There is also a tendency in multimodal systems to neglect or overemphasize one modality, notably the visual one, which challenges genuine multimodal integration. This article reveals that there is limited benchmarking for image feature extraction techniques. Evaluating the quality of extracted image features is crucial for VQA tasks. Novelty: While many VQA studies have primarily concentrated on the accuracy of answers to questions, this review emphasizes the importance of comprehensive image representation. The paper explores recent advances in Capsules Networks (CapsNets) and Vision Transformers (ViTs) as alternatives to traditional Convolutional Neural Networks (CNNs), for development of more effective image feature extraction techniques which can help to address the limitations of existing VQA models that focus primarily on question content analysis.

Keywords: Feature Extraction, Visual Question Answering, Multimodal Deep Learning, Capsule Networks, Vision Transformer, Datasets

References

Cao J, Gan Z, Cheng Y, Yu L, Chen YCC, Liu J. Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models. Computer Vision – ECCV 2020. 2020;p. 565–580. Available from: https://doi.org/10.48550/arXiv.2005.07310
Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW. VisualBERT: A Simple and Performant Baseline for Vision and Language. 2019. Available from: http://arxiv.org/abs/1908.03557
Liu L, Su X, Guo H, Zhu D. A Transformer-based Medical Visual Question Answering Model. 2022 26th International Conference on Pattern Recognition (ICPR). 2022;2022:1712–1720. Available from: https://doi.org/10.1109/ICPR56361.2022.9956469
Yamada M, Amario D, Takemoto V, Boix K, Sasaki X, Sasaki T. Transformer Module Networks for Systematic Generalization in Visual Question Answering. 2022. Available from: http://arxiv.org/abs/2201.11316
Sikarwar A, Kreiman G. On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering. 2022. Available from: http://arxiv.org/abs/2201.03965
Yu Z, Jin Z, Yu J, Xu M, Wang H, Fan J. Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering. IEEE Transactions on Multimedia. 2023;p. 1–15. Available from: https://doi.org/10.48550/arXiv.2203.12814
Heo YJJ, Kim ESS, Choi WS, Zhang BTT. Hypergraph Transformer: Weakly-Supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. Available from: https://doi.org/10.48550/arXiv.2204.10448
Seenivasan L, Islam M, Krishna AK, Ren H. Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer. Lecture Notes in Computer Science. 2022;p. 33–43. Available from: https://doi.org/10.48550/arXiv.2206.11053
Siebert T, Clasen KN, Ravanbakhsh M, Demir B. Multi-modal fusion transformer for visual question answering in remote sensing. Image and Signal Processing for Remote Sensing XXVIII. 2022. Available from: https://doi.org/10.48550/arXiv.2210.04510
Bazi Y, Rahhal MMA, Mekhalfi ML, Zuair MAA, Melgani F. Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery. IEEE Transactions on Geoscience and Remote Sensing. 2022;60:1–11. Available from: https://doi.org/10.1109/TGRS.2022.3192460
Zhang H, Wu W. CAT: Re-Conv Attention in Transformer for Visual Question Answering. 2022 26th International Conference on Pattern Recognition (ICPR). 2022;2022:1471–1478. Available from: https://doi.org/10.1109/ICPR56361.2022.9956247
Ding H, Li LE, Hu Z, Xu Y, Hakkani-Tur D, Du Z, et al. Semantic Aligned Multi-modal Transformer for Vision-LanguageUnderstanding: A Preliminary Study on Visual QA. Proceedings of the Third Workshop on Multimodal Artificial Intelligence. 2021. Available from: https://assets.amazon.science/4f/aa/de1faccc4ff9866c76b82e57a29b/multimodal-reltag-final-34.pdf
Khan AU, Mazaheri A, Da VLN, Shah M, Mmft-Bert. Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering. 2020. Available from: http://arxiv.org/abs/2010.14095
Ye S, Kong W, Yao C, Ren J, Jiang X. Video Question Answering Using Clip-Guided Visual-Text Attention. 2023 IEEE International Conference on Image Processing (ICIP). 2023. Available from: https://doi.org/10.48550/arXiv.2303.03131
Mishra A, Anand A, Guha P. Dual Attention and Question Categorization-Based Visual Question Answering. IEEE Transactions on Artificial Intelligence. 2023;4(1):81–91. Available from: https://doi.org/10.1109/TAI.2022.3160418
Pan H, He S, Zhang K, Qu B, Chen C, Shi K. AMAM: An Attention-based Multimodal Alignment Model for Medical Visual Question Answering. Knowledge-Based Systems. 2022;255:109763. Available from: https://doi.org/10.1016/j.knosys.2022.109763
Song J, Zeng P, Gao L, Shen HT. From Pixels to Objects: Cubic Visual Attention for Visual Question Answering. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. 2018. Available from: https://doi.org/10.48550/arXiv.2206.01923
Huang X, Gong H. A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering. IEEE Transactions on Medical Imaging. 2023;p. 1. Available from: https://doi.org/10.48550/arXiv.2210.00220
Xia Q, Yu C, Hou Y, Peng P, Zheng Z, Chen W. Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism. Electronics. 2022;11(11):1778. Available from: https://doi.org/10.3390/electronics11111778
Yang Z, Wu L, Wen P, Chen P. Visual Question Answering reasoning with external knowledge based on bimodal graph neural network. Electronic Research Archive. 2023;31(4):1948–1965. Available from: https://doi.org/10.3934/era.2023100
Hu X, Gu L, Kobayashi K, Chen AQ, Lu Q, Lu Z, et al. Interpretable Medical Image Visual Question Answering via Multi-Modal Relationship Graph Learning. 2023. Available from: http://arxiv.org/abs/2302.09636
Jiang L, Meng Z. Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph. Electronics. 2023;12(6):1390. Available from: https://doi.org/10.3390/electronics12061390
Qian Y, Hu Y, Wang R, Feng F, Wang X. Question-Driven Graph Fusion Network for Visual Question Answering. 2022 IEEE International Conference on Multimedia and Expo (ICME). 2022. Available from: https://doi.org/10.48550/arXiv.2204.00975
Li M, Moens MFF. Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence. 36(10):10983–10992. Available from: https://doi.org/10.48550/arXiv.2203.02985
Li H, Li X, Karimi B, Chen J, Sun M. Joint Learning of Object Graph and Relation Graph for Visual Question Answering. 2022 IEEE International Conference on Multimedia and Expo (ICME). 2022. Available from: https://doi.org/10.48550/arXiv.2205.04188
Zhu Z. From Shallow to Deep: Compositional Reasoning over Graphs for Visual Question Answering. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022. Available from: https://doi.org/10.48550/arXiv.2206.12533
Wang Y, Yasunaga M, Ren H, Wada S, Leskovec J, Vqa-Gnn. Reasoning with Multimodal Semantic Graph for Visual Question Answering. 2022. Available from: http://arxiv.org/abs/2205.11501
Salaberria A, Azkune G, Lacalle OLD, Soroa A, Agirre E. Image captioning for effective use of language models in knowledge-based visual question answering. Expert Systems with Applications. 2023;212:118669. Available from: https://doi.org/10.48550/arXiv.2109.08029
Zhu H, Togo R, Ogawa T, Haseyama M. Interpretable Visual Question Answering Referring to Outside Knowledge. 2023 IEEE International Conference on Image Processing (ICIP). 2023. Available from: https://doi.org/10.48550/arXiv.2303.04388
Lerner P, Ferret O, Guinaudeau C. Multimodal Inverse Cloze Task for Knowledge-Based Visual Question Answering. Lecture Notes in Computer Science. 2023;p. 569–587. Available from: https://doi.org/10.48550/arXiv.2301.04366
Shao Z, Yu Z, Wang M, Yu J. Prompting Large Language Models with Answer Heuristics for Knowledge-Based Visual Question Answering. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. Available from: https://doi.org/10.48550/arXiv.2303.01903
Liu Y, Li G, Lin L. Causality-aware Visual Scene Discovery for Cross-Modal Question Reasoning. 2023. Available from: http://arxiv.org/abs/2304.08083
Li Z, Guo Y, Wang K, Wei Y, Nie L, Kankanhalli M. Joint Answering and Explanation for Visual Commonsense Reasoning. IEEE Transactions on Image Processing. 2023;32:3836–3846. Available from: https://doi.org/10.48550/arXiv.2202.12626
Li H, Huang J, Jin P, Song GP, Wu Q, Chen J. Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering. IEEE Transactions on Image Processing. 2023;32:3367–3382. Available from: https://doi.org/10.1109/TIP.2023.3276570
Shen X, Han D, Chen C, Luo G, Wu Z. An effective spatial relational reasoning networks for visual question answering. PLOS ONE. 2022;17(11):e0277693. Available from: https://doi.org/10.1371/journal.pone.0277693
Wu Y, Ma Y, Wan S. Multi-scale relation reasoning for multi-modal Visual Question Answering. Signal Processing: Image Communication. 2021;96:116319. Available from: https://doi.org/10.1016/j.image.2021.116319
Han Y, Yin J, Wu J, Wei Y, Nie L. Semantic-Aware Modular Capsule Routing for Visual Question Answering. IEEE Transactions on Image Processing. 2023;32:5537–5549. Available from: https://doi.org/10.48550/arXiv.2207.10404
Cao Q, Liang X, Wang K, Lin L. Linguistically Driven Graph Capsule Network for Visual Question Reasoning. 2020. Available from: http://arxiv.org/abs/2003.10065
Tian W, Li H, Zhao ZQ. Dual Capsule Attention Mask Network with Mutual Learning for Visual Question Answering. 2022. Available from: https://aclanthology.org/2022.coling-1.500
Zhou Y, Ji R, Su JR, Sun X, Chen W. Dynamic Capsule Attention for Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence. 33(01):9324–9331. Available from: https://doi.org/10.1609/aaai.v33i01.33019324
Khan AU, Kuehne H, Gan C, Lobo NDV, Shah M. Weakly Supervised Grounding for VQA in Vision-Language Transformers. Lecture Notes in Computer Science. 2022;p. 652–670. Available from: https://doi.org/10.48550/arXiv.2207.02334
Bai L, Islam M, Seenivasan L, Ren H. Surgical-VQLA:Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery. 2023 IEEE International Conference on Robotics and Automation (ICRA). 2023;p. 6859–6865. Available from: https://doi.org/10.48550/arXiv.2305.11692
Seenivasan L, Islam M, Kannan G, Ren H. SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery. Lecture Notes in Computer Science. 2023;p. 281–290. Available from: https://doi.org/10.48550/arXiv.2304.09974
Liu Y, Li G, Lin L. Causality-aware Visual Scene Discovery for Cross-Modal Question Reasoning. 2023. Available from: http://arxiv.org/abs/2304.08083
Gupta D, Attal K, Demner-Fushman D. A dataset for medical instructional video classification and question answering. Scientific Data. 2023;10(1). Available from: https://doi.org/10.1038/s41597-023-02036-y
Bazi Y, Rahhal MMA, Bashmal L, Zuair M. Vision–Language Model for Visual Question Answering in Medical Imagery. Bioengineering. 2023;10(3):380. Available from: https://doi.org/10.3390/bioengineering10030380
Biten AF, Litman R, Xie Y, Appalaraju S, Manmatha R. LaTr: Layout-Aware Transformer for Scene-Text VQA. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022;2022:16527–16564. Available from: https://doi.org/10.48550/arXiv.2112.12494
Tiong AMH, Li J, Li B, Savarese S, Hoi SCH. Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. Available from: https://doi.org/10.48550/arXiv.2210.08773
Yuan Z, Mou L, Xiong Z, Zhu XX. Change Detection Meets Visual Question Answering. IEEE Transactions on Geoscience and Remote Sensing. 2022;60:1–13. Available from: https://doi.org/10.48550/arXiv.2112.06343
Le T, Nguyen HT, Nguyen L, , . Vision And Text Transformer For Predicting Answerability On Visual Question Answering. Proceedings - International Conference on Image Processing. 2021;p. 934–942. Available from: https://doi.org/10.1109/ICIP42928.2021.9506796
Yan M, Xu H, Li C, Tian J, Bi B, Wang W, et al. Achieving Human Parity on Visual Question Answering. ACM Transactions on Information Systems. 2023;41(3):1–40. Available from: https://doi.org/10.48550/arXiv.2111.08896
Li P, Liu G, Tan L, Liao J, Zhong S. Self-Supervised Vision-Language Pretraining for Medial Visual Question Answering. 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI). 2023. Available from: https://doi.org/10.48550/arXiv.2211.13594
He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. Available from: https://doi.org/10.1109/CVPR.2016.90
Whitehead S, Wu H, Ji H, Feris R, Saenko K. Separating Skills and Concepts for Novel Visual Question Answering. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021. Available from: https://doi.org/10.48550/arXiv.2107.09106
Yang X, Gao C, Zhang H, Cai J. Auto-Parsing Network for Image Captioning and Visual Question Answering. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021;p. 2177–2187. Available from: https://doi.org/10.48550/arXiv.2108.10568
Banerjee P, Gokhale T, Yang Y, Baral C. Weakly Supervised Relative Spatial Reasoning for Visual Question Answering. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. Available from: https://doi.org/10.48550/arXiv.2109.01934
Gamage B, Hong LC. Improved RAMEN: Towards Domain Generalization for Visual Question Answering. 2021. Available from: http://arxiv.org/abs/2109.02370
Chen L, Zheng Y, Niu Y, Zhang H, Xiao J. Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2023;p. 1–16. Available from: https://doi.org/10.48550/arXiv.2003.06576
Nguyen BX, Do T, Tran H, Tjiputra E, Tran QD, Nguyen AX. Coarse-to-Fine Reasoning for Visual Question Answering. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 2022. Available from: https://doi.org/10.48550/arXiv.2110.02526
Cao J, Qin X, Zhao S, Shen J. Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering. IEEE Transactions on Neural Networks and Learning Systems. 2022;p. 1–12. Available from: https://doi.org/10.48550/arXiv.2112.07270 Focus to learn more
Ben-Younes H, Cadene R, Thome N, Cord M. BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection. Proceedings of the AAAI Conference on Artificial Intelligence. 33(01):8102–8109. Available from: https://doi.org/10.48550/arXiv.1902.00038
Guo W, Zhang Y, Yang J, Yuan X. Re-Attention for Visual Question Answering. IEEE Transactions on Image Processing. 2021;30:6730–6743. Available from: https://doi.org/10.1109/TIP.2021.3097180
Liu F, Xu G, Wu Q, Du Q, Jia W, Tan M. Cascade Reasoning Network for Text-based Visual Question Answering. Proceedings of the 28th ACM International Conference on Multimedia. 2020;p. 4060–4069. Available from: https://tanmingkui.github.io/files/publications/Cascade.pdf
Dancette C, Cadene R, Teney D, Cord M. Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. Available from: https://doi.org/10.48550/arXiv.2104.03149
Yang C, Wu W, Wang Y, Zhou H. Multi-Modality Global Fusion Attention Network for Visual Question Answering. Electronics. 2020;9(11):1882. Available from: https://doi.org/10.3390/electronics9111882
Jiang L, Meng Z. Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph. Electronics. 2023;12(6):1390. Available from: https://doi.org/10.3390/electronics12061390
Park S, Hwang S, Hong J, Byun H. Fair-VQA: Fairness-Aware Visual Question Answering Through Sensitive Attribute Prediction. IEEE Access. 2020;8:215091–215099. Available from: https://doi.org/10.1109/ACCESS.2020.3041503
Xiong P, You Q, Yu P, Liu Z, Wu Y, Sa-Vqa. Structured Alignment of Visual and Semantic Representations for Visual Question Answering. 2022. Available from: http://arxiv.org/abs/2201.10654
Xiong P, Shen Y, Jin H. MGA-VQA: Multi-Granularity Alignment for Visual Question Answering. 2022. Available from: http://arxiv.org/abs/2201.10656
Ding Y, Yu J, Liu B, Hu Y, Cui M, Wu Q. MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022. Available from: https://doi.org/10.48550/arXiv.2203.09138
Wang R, Qian Y, Feng F, Wang X, Jiang H. Co-VQA : Answering by Interactive Sub Question Sequence. Findings of the Association for Computational Linguistics: ACL 2022. 2022. Available from: https://doi.org/10.48550/arXiv.2204.00879
Gupta V, Li Z, Kortylewski A, Zhang C, Li Y, Yuille A. SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022. Available from: https://doi.org/10.48550/arXiv.2204.02285
Chae J, Kim J. Uncertainty-based Visual Question Answering: Estimating Semantic Inconsistency between Image and Knowledge Base. 2022 International Joint Conference on Neural Networks (IJCNN). 2022. Available from: https://doi.org/10.48550/arXiv.2207.13242
Nooralahzadeh F, Sennrich R. Improving the Cross-Lingual Generalisation in Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence. 2022;37(11):13419–13427. Available from: https://doi.org/10.48550/arXiv.2209.02982
Reich D, Putze F, Schultz T. Visually Grounded VQA by Lattice-based Retrieval. 2022. Available from: http://arxiv.org/abs/2211.08086
Si Q, Liu Y, Lin Z, Fu P, Wang W. Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering. 2022. Available from: http://arxiv.org/abs/2210.14558
Cao F, Luo S, Nunez F, Wen Z, Poon J, Han SC. SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering. Robotics. 2022;12(4):114. Available from: https://doi.org/10.48550/arXiv.2212.08283
Hu X, Gu L, Kobayashi K, Chen AQ, Lu Q, Lu Z, et al. Interpretable Medical Image Visual Question Answering via Multi-Modal Relationship Graph Learning. 2023. Available from: http://arxiv.org/abs/2302.09636
Peng M, Wang C, Gao Y, Shi Y, Zhou XD. Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering. 2021. Available from: http://arxiv.org/abs/2109.04735
Raj H, Dadhania J, Bhardwaj A, P K. Multi-Image Visual Question Answering. 2021. Available from: http://arxiv.org/abs/2112.13706
Gupta P, Gupta M. NewsKVQA: Knowledge-Aware News Video Question Answering. Advances in Knowledge Discovery and Data Mining. 2022;p. 3–15. Available from: https://doi.org/10.48550/arXiv.2202.04015
Piergiovanni A, Li W, Kuo W, Saffar M, Bertsch F, Angelova A. Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering. 2022. Available from: http://arxiv.org/abs/2205.00949
Xu Z, Zhong W, Su Q, Ou Z, Zhang F. Modeling Semantic Composition with Syntactic Hypergraph for Video Question Answering. 2022. Available from: http://arxiv.org/abs/2205.06530
Chang S, Palzer D, Li J, Fosler-Lussier E, XN. MapQA: A Dataset for Question Answering on Choropleth Maps. 2022. Available from: http://arxiv.org/abs/2211.08545
Tang X, Zhang W, Yu Y, Turner K, Derr T, Wang M, et al. Interpretable Visual Understanding with Cognitive Attention Network. Lecture Notes in Computer Science. 2021;p. 555–568. Available from: https://doi.org/10.48550/arXiv.2108.02924
Wu T, Garcia N, Otani M, Chu C, Nakashima Y, Takemura H. Transferring Domain-Agnostic Knowledge in Video Question Answering. 2021. Available from: https://www.bmvc2021-virtualconference.com/assets/papers/1187.pdf
Jiang J, Liu Z, Zheng N. LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering. IEEE Transactions on Multimedia. 2023;25:5002–5013. Available from: https://doi.org/10.1109/TMM.2022.3185900
Wang J, Bao BKK, Xu C. DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering. IEEE Transactions on Multimedia. 2022;24:3369–3380. Available from: https://doi.org/10.1109/TMM.2021.3097171
Gao L, Chen T, Li X, Zeng P, Zhao L, Li YF. Generalized pyramid co-attention with learnable aggregation net for video question answering. Pattern Recognition. 2021;120:108145. Available from: https://doi.org/10.1016/j.patcog.2021.108145
Manmadhan S, Kovoor BC. Multi-Tier Attention Network using Term-weighted Question Features for Visual Question Answering. Image and Vision Computing. 2021;115:104291. Available from: https://doi.org/10.1016/j.imavis.2021.104291
Chen Z, Chen J, Geng Y, Pan JZ, Yuan Z, Chen H. Zero-Shot Visual Question Answering Using Knowledge Graph. The Semantic Web – ISWC 2021. 2021;p. 146–162. Available from: https://doi.org/10.48550/arXiv.2107.05348
Yu Z, Yu J, Xiang C, Fan J, Tao DJ. Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering. IEEE Transactions on Neural Networks and Learning Systems. 2018;29(12):5947–5959. Available from: https://doi.org/10.48550/arXiv.1708.03619
Chappuis C, Lobry S, Kellenberger B, Saux B, Le, Tuia D. How to find a good image-text embedding for remote sensing visual question answering? 2021. Available from: http://arxiv.org/abs/2109.11848
Ilievski I, Feng J. Multimodal Learning and Reasoning for Visual Question Answering. Available from: https://proceedings.neurips.cc/paper/2017/hash/f61d6947467ccd3aa5af24db320235dd-Abstract.html
Lan Y, Guo Y, Chen Q, Lin S, Chen Y, Deng X. Visual question answering model for fruit tree disease decision-making based on multimodal deep learning. Frontiers in Plant Science. 2023;13. Available from: https://doi.org/10.3389/fpls.2022.1064399
Gao L, Zeng P, Song J, Li YFF, Liu W, Mei T, et al. Structured Two-Stream Attention Network for Video Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence. 2022;33(01):6391–6398. Available from: https://dl.acm.org/doi/pdf/10.1609/aaai.v33i01.33016391#:~:text=First%2C%20we%20infer%20rich%20long,focuses%20on%20the%20relevant%20text.
Ramnath K, Hasegawa-Johnson M. Seeing is Knowing! Fact-based Visual Question Answering using Knowledge Graph Embeddings. 2020. Available from: http://arxiv.org/abs/2012.15484
Sood E, Kögel F, Müller P, Thomas D, Bâce M, Bulling A. Multimodal Integration of Human-Like Attention in Visual Question Answering. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 2023. Available from: https://doi.org/10.48550/arXiv.2109.13139
Gouthaman KV, Mittal A. On the role of question encoder sequence model in robust visual question answering. Pattern Recognit. 2022;131. Available from: https://doi.org/10.1016/j.patcog.2022.108883
Reich D, Putze F, Schultz T. Adventurer’s Treasure Hunt: A Transparent System for Visually Grounded Compositional Visual Question Answering based on Scene Graphs. 2021. Available from: http://arxiv.org/abs/2106.14476
Zhang P, Lan H. Multiple Context Learning Networks for Visual Question Answering. Sci Program. 2022. Available from: https://doi.org/10.1155/2022/4378553
Liu H, Gong S, Ji Y, Yang JY, Xing T, Liu C. Multimodal Cross-guided Attention Networks for Visual Question Answering. Advances in Intelligent Systems Research. 2018. Available from: https://www.atlantis-press.com/article/25897541.pdf
Cascante-Bonilla P, Wu H, Wang L, Feris R, Ordonez V. Sim VQA: Exploring Simulated Environments for Visual Question Answering. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022. Available from: https://doi.org/10.1109/CVPR52688.2022.00500
Winterbottom T, Xiao S, Mclean A, Moubayed NA. Bilinear pooling in video-QA: empirical challenges and motivational drift from neurological parallels. PeerJ Computer Science. 2022;8:e974. Available from: https://doi.org/10.7717/peerj-cs.974
Lee D, Cheon Y, Han WSS. Regularizing Attention Networks for Anomaly Detection in Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence. 35(3):1845–1853. Available from: https://doi.org/10.48550/arXiv.2009.10054
Seo A, Kang GCC, Park J, Zhang BTT. Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. Available from: https://doi.org/10.48550/arXiv.2106.10446
Changpinyo S, Kukliansy D, Szpektor I, Chen X, Ding N, Soricut R. All You May Need for VQA are Image Captions. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. Available from: https://doi.org/10.48550/arXiv.2205.01883
Li H, Huang J, Jin P, Song GP, Wu Q, Chen J. Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering. IEEE Transactions on Image Processing. 2023;32:3367–3382. Available from: https://doi.org/10.1109/TIP.2023.3276570
Pan H, He S, Zhang K, Qu B, Chen C, Shi K. MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering. 2021. Available from: http://arxiv.org/abs/2107.03216
Salaberria A, Azkune G, Lacalle OLD, Soroa A, Agirre E. Image captioning for effective use of language models in knowledge-based visual question answering. Expert Systems with Applications. 2023;212:118669. Available from: https://doi.org/10.48550/arXiv.2109.08029
Girshick R, Donahue J, Darrell T, Malik J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014. Available from: https://doi.org/10.48550/arXiv.1311.2524
Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. 2014. Available from: http://arxiv.org/abs/1409.1556
Sarkar A, Rahnemoonfar M. VQA-Aid: Visual Question Answering for Post-Disaster Damage Assessment and Analysis. 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS. 2021. Available from: https://doi.org/10.48550/arXiv.2106.10548
Tan M, Le QV, Efficientnet. Rethinking Model Scaling for Convolutional Neural Networks. 2019. Available from: http://arxiv.org/abs/1905.11946
Silva JD, Martins B, Magalhães J. Contrastive training of a multimodal encoder for medical visual question answering. Intelligent Systems with Applications. 2023;18:200221. Available from: https://doi.org/10.1016/j.iswa.2023.200221
Redmon J, Divvala S, Girshick R, Farhadi A. You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. Available from: https://doi.org/10.48550/arXiv.1506.02640
Gómez L, Biten AF, Tito R, Mafla A, Rusiñol M, Valveny E. Multimodal grid features and cell pointers for scene text visual question answering. Pattern Recognition Letter. 2021;150:242–251. Available from: https://doi.org/10.48550/arXiv.2006.00923
Brock A, De S, Smith SL, Simonyan K. 2021. Available from: http://arxiv.org/abs/2102.06171
Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y. 2022. Available from: http://arxiv.org/abs/2204.14198
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T. 2020. Available from: http://arxiv.org/abs/2010.11929
Thai TM, Luu ST. Integrating Image Features with Convolutional Sequence-to-sequence Network for Multilingual Visual Question Answering. 2023. Available from: http://arxiv.org/abs/2303.12671
Li LH, Zhang P, Zhang H, Yang J, Li CH, Zhong Y, et al. Grounded Language-Image Pre-training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022. Available from: https://doi.org/10.48550/arXiv.2112.03857
Lin Y, Xie Y, Chen D, Xu Y, Zhu C, Yuan L. REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering. 2022. Available from: http://arxiv.org/abs/2206.01201
Radford A, Kim JW, Hallacy C, AR, Gabriel G, Agarwal S, et al. Learning Transferable Visual Models From Natural Language Supervision. 2021. Available from: http://arxiv.org/abs/2103.00020
Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;39(6):1137–1149. Available from: https://doi.org/10.48550/arXiv.1506.01497
Shuang K, Guo J, Wang Z. Comprehensive-perception dynamic reasoning for visual question answering. Pattern Recognition. 2022;131:108878. Available from: https://doi.org/10.1016/j.patcog.2022.108878
Zhang S, Chen M, Chen J, Zou F, Li YFF, Lu P. Multimodal feature-wise co-attention method for visual question answering. Information Fusion. 2021;73:1–10. Available from: https://doi.org/10.1016/j.inffus.2021.02.022
Guo D, Xu C, Tao D. Bilinear Graph Networks for Visual Question Answering. IEEE Transactions on Neural Networks and Learning Systems. 2023;34(2):1023–1034. Available from: https://doi.org/10.48550/arXiv.1907.09815
Liu Y, Guo Y, Yin J, Song X, Liu W, Nie L, et al. Answer Questions with Right Image Regions: A Visual Attention Regularization Approach. ACM Transactions on Multimedia Computing, Communications, and Applications. 2022;18(4):1–18. Available from: https://doi.org/10.48550/arXiv.2102.01916
Sharma H, Jalal AS. A framework for visual question answering with the integration of scene-text using PHOCs and fisher vectors. Expert Systems with Applications. 2022;190:116159. Available from: https://doi.org/10.1016/j.eswa.2021.116159
Zhang W, Yu J, Zhao W, Ran C. DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation. Information Fusion. 2021;72:70–79. Available from: https://doi.org/10.1016/j.inffus.2021.02.006
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018;p. 6077–6086. Available from: https://doi.org/10.48550/arXiv.1707.07998
Wang J, Zhao Z, JW. Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering. 2022. Available from: http://arxiv.org/abs/2209.03609
Sabour S, Frosst N, Hinton GE. Dynamic Routing Between Capsules. 2017. Available from: http://arxiv.org/abs/1710.09829
Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: a Method for Automatic Evaluation of Machine Translation. 2002. Available from: https://aclanthology.org/P02-1040
Banerjee S, Lavie A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. 2005. Available from: https://aclanthology.org/W05-0909
Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based image description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015. Available from: https://doi.org/10.48550/arXiv.1411.5726
Anderson P, Fernando B, Johnson M, Gould SM. SPICE: Semantic Propositional Image Caption Evaluation. Computer Vision – ECCV 2016. 2016;p. 382–398. Available from: https://doi.org/10.48550/arXiv.1607.08822
Lin CY. ROUGE: A Package for Automatic Evaluation of Summaries. 2004. Available from: https://aclanthology.org/W04-1013

Copyright

© 2023 Prakash & Devananda. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)