Multimodal Cross-guided Attention Networks for Visual Question Answering
- DOI
- 10.2991/cmsa-18.2018.80How to use a DOI?
- Keywords
- visual question answering; attention; cross-guided; gated activation
- Abstract
Visual Question Answering (VQA) is an attractive topic combin-ing computer vision with natural language processing. It is more challenging than text-based question answering because of its multimodal nature. The VQA reasoning process requires both effective semantic embedding and fine-grained visual compre-hension. Existing approaches predominantly infer answers from visual spatial information, while neglecting important semantic information in questions and the guidance information between images and questions. To remedy this, we imitate the human mechanism of cross-reasoning about visual and textual infor-mation and propose a multimodal cross-guided attention net-work (MCAN) for VQA which employs a cross-guided joint learning strategy with a gated activation learning method, which can simultaneously capture both rich visual spatial information and significant semantic information. We evaluate the proposed model on two public datasets: VQA dataset and COCO-QA da-taset. Extensive experiments show state-of-the-art performance on the datasets.
- Copyright
- © 2018, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Haibin Liu AU - Shengrong Gong AU - Yi Ji AU - Jianyu Yang AU - Tengfei Xing AU - Chunping Liu PY - 2018/04 DA - 2018/04 TI - Multimodal Cross-guided Attention Networks for Visual Question Answering BT - Proceedings of the 2018 International Conference on Computer Modeling, Simulation and Algorithm (CMSA 2018) PB - Atlantis Press SP - 347 EP - 353 SN - 1951-6851 UR - https://doi.org/10.2991/cmsa-18.2018.80 DO - 10.2991/cmsa-18.2018.80 ID - Liu2018/04 ER -