SimVQA: Exploring Simulated Environments
for Visual Question Answering

Paola Cascante-Bonilla, Hui Wu, Letao Wang, Rogerio Feris, Vicente Ordonez
Rice University, University of Virginia, MIT-IBM Watson AI Lab
 CVPR 2022
Paper Dataset Code

We explore using synthetic computer-generated data to fully control the visual and language space, allowing us to provide more diverse scenarios for VQA. By exploiting 3D and physics simulation platforms, we provide a pipeline to generate synthetic data to expand and replace type-specific questions and answers without risking the exposure of sensitive or personal data that might be present in real images. We quantify the effect of synthetic data in real-world VQA benchmarks and to which extent it produces results that generalize to real data.

Different camera position
Segmen- tation mask
Image/Question/Answer Samples

F-SWAP to Leverage the Synthetic Data
We also propose Feature Swapping (F-SWAP), a domain alignment method where we randomly switch object-level features during training to make a VQA model more domain invariant. The motivation for feature swapping relies in observing that in all three datasets we can find similar types of objects and configurations but the appearance of the objects might differ. Our goal with feature swapping is then to randomly replace during the training the object-level features for some of the objects with the features for an equivalent object from another domain.
SimVQA Explorer*. Enter the item you want to search by selecting any icon shown below. You can delete them by clicking on the close icon or by using your delete button.

*Image samples from a subset of our TDW-VQA dataset.
**Search criteria for objects is an inclusive or, and FIFO.

It might take a few seconds to load the database for your first search.