Abstract: Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since ...