MQA: Answering the Question via Robotic Manipulation

RSS 2021

All authors are from Tsinghua University; Equal Contribution

Abstract

In this paper, we propose a novel task, Manipulation Question Answering (MQA), where the robot performs manipulation actions to change the environment in order to answer a given question. To solve this problem, a framework consisting of a QA module and a manipulation module is proposed. For the QA module, we adopt the method for the Visual Question Answering (VQA) task. For the manipulation module, a Deep Q Network (DQN) model is designed to generate manipulation actions for the robot to interact with the environment. We consider the situation where the robot continuously manipulating objects inside a bin until the answer to the question is found. Besides, a novel dataset that contains a variety of object models, scenarios and corresponding question-answer pairs is established in a simulation environment. Extensive experiments have been conducted to validate the effectiveness of the proposed framework.

Interpolate start reference image.

MQA Dataset

As our task is newly proposed, there is no suitable off-theshelf dataset for experiments. Therefore, we establish our own dataset for the MQA task. The MQA dataset is built under the V-REP simulation environment, and it is composed of a variety of 3D object models, different scenes, and corresponding question-answer pairs for every scene.


Interpolate start reference image.

Algorithm

The proposed MQA Algorithm is mainly composed of two parts, manipulation module and QA module. An overview of the workflow is demonstrated in Fig.6. When a new MQA task starts, the manipulation module will be activated first. The manipulation module will take the RGBD images of the scene and the question as input and output manipulation actions. The agent explores the environment until the question can be answered. The manipulation module decides when to stop exploring. Then, the QA module will give an answer based on the initial scene, the final scene and the question


Interpolate start reference image.

Simulation experiments

We test our QA system in new scenes. First, we use our action model to generate action. When the action model concludes that there is no need for action, the QA module will give an answer.


Q: How many keyboards are there in the bin? A: 3

Q: Is there a key in the bin? A: yes

Real experiments

We used UR5 robot, Kinect camera and real objects similar to dataset to build the experimental scene, and transferred our model from simulation to the real world directly. This is possible because the output of the model is a path independent of dynamics and the images in the simulation are similar to the real images. The robot first thinks there may be a key under the box, so the robot sucks it away and finds a key. After sucking the box away, there may be a key under the milk carton, so the robot pushes the milk carton away and finds no key. After two actions , the robot concludes there is no need for action and outputs the answer 1.


BibTeX


      @inproceedings{deng2020mqa, 
        author    = {Deng, Yuhong and Guo, Di and Guo, Xiaofeng and Zhang, Naifu and Liu, Huaping and Sun, Fuchun}, 
        title     = {MQA: Answering the Question via Robotic Manipulation}, 
        booktitle = {Proceedings of Robotics: Science and Systems}, 
        year      = {2021},
      }