SimVQA: Exploring Simulated Environments for Visual Question Answering

SimVQA: Exploring Simulated Environments
for Visual Question Answering

Paola Cascante-Bonilla^♰, Hui Wu^‡, Letao Wang^♮, Rogerio Feris^‡, Vicente Ordonez^♰

^♰Rice University, ^♮University of Virginia, ^‡MIT-IBM Watson AI Lab

CVPR 2022

Paper Dataset Code

We explore using synthetic computer-generated data to fully control the visual and language space, allowing us to provide more diverse scenarios for VQA. By exploiting 3D and physics simulation platforms, we provide a pipeline to generate synthetic data to expand and replace type-specific questions and answers without risking the exposure of sensitive or personal data that might be present in real images. We quantify the effect of synthetic data in real-world VQA benchmarks and to which extent it produces results that generalize to real data.

Different camera position

Segmen- tation mask

Image/Question/Answer Samples

F-SWAP to Leverage the Synthetic Data

We also propose Feature Swapping (F-SWAP), a domain alignment method where we randomly switch object-level features during training to make a VQA model more domain invariant. The motivation for feature swapping relies in observing that in all three datasets we can find similar types of objects and configurations but the appearance of the objects might differ. Our goal with feature swapping is then to randomly replace during the training the object-level features for some of the objects with the features for an equivalent object from another domain.

*Image samples from a subset of our TDW-VQA dataset.
**Search criteria for objects is an inclusive or, and FIFO.

It might take a few seconds to load the database for your first search.