Multimodal Image-Language transformers have achieved impressive results on a variety of tasks that rely on fine-tuning (e.g., visual question answering and image retrieval). We are interested in shedding light on the quality of their pretrained representations--in particular, if these models can distinguish verbs or they only use the nouns in a given sentence. To do so, we collect a dataset of image-sentence pairs consisting of 447 verbs that are either visual or commonly found in the pretraining data (i.e., the Conceptual Captions dataset). We use this dataset to evaluate the pretrained models in a zero-shot way. We find that the pretrained models fail more in situations that require verb understanding compared to other parts of speech. We also investigate what category of verbs are particularly challenging for these models.
Grounding language to vision is a fundamental problem for many real-world AI systems such as retrieving images or generating descriptions for the visually impaired. Success on these tasks requires models to relate different aspects of language such as objects and verbs to images. For example, to distinguish between the two images in the middle column below, models must differentiate between the verbs “catch” and “kick.” Verb understanding is particularly difficult as it requires not only recognising objects, but also how different objects in an image relate to each other. To overcome this difficulty, we introduce the SVO-Probes dataset and use it to probe language and vision models for verb understanding.
In particular, we consider multimodal transformer models (e.g., Lu et al., 2019; Chen et al., 2020; Tan and Bansal, 2019; Li et al., 2020), which have shown success on a variety of language and vision tasks. However, despite strong performance on benchmarks, it is not clear if these models have fine-grained multimodal understanding. In particular, prior work shows that language and vision models can succeed at benchmarks without multimodal understanding: for example, answering questions about images based only on language priors (Agrawal et al., 2018) or “hallucinating” objects that are not in the image when captioning images (Rohrbach et al., 2018). To anticipate model limitations, work like Shekhar et al. propose specialised evaluations to probe models systematically for language understanding. However, prior probe sets are limited in the number of objects and verbs. We developed SVO-Probes to better evaluate potential limitations in verb understanding in current models.
SVO-Probes includes 48,000 image-sentence pairs and tests understanding for more than 400 verbs. Each sentence can be broken into a <Subject, Verb, Object> triplet (or SVO triplet) and paired with positive and negative example images. The negative examples differ in only one way: the Subject, Verb, or Object is changed. The figure above shows negative examples in which the subject (left), verb (middle), or object (right) does not match the image. This task formulation makes it possible to isolate which parts of the sentence a model has the most trouble with. It also makes SVO-Probes more challenging than standard image retrieval tasks, where negative examples are often completely unrelated to the query sentence.
To create SVO-Probes, we query an image search with SVO triplets from a common training dataset, Conceptual Captions (Sharma et al. 2018). Because image search can be noisy, a preliminary annotation step filters the retrieved images to ensure we have a clean set of image-SVO pairs. Since transformers are trained on image-sentence pairs, not image-SVO pairs, we need image-sentence pairs to probe our model. To collect sentences which describe each image, annotators write a short sentence for each image that includes the SVO triplet. For example, given the SVO triplet <animal, lie, grass>, an annotator could write the sentence “An animal lays in the grass.” We then use the SVO annotations to pair each sentence with a negative image, and ask annotators to verify negatives in a final annotation step. See the figure below for details.
We examine whether multimodal transformers can accurately classify examples as positive or negative. The bar chart below illustrates our results. Our dataset is challenging: our standard multimodal transformer model achieves 64.3% accuracy overall (chance is 50%). Whereas accuracy is 67.0% and 73.4% on subjects and objects respectively, performance falls to 60.8% on verbs. This result shows that verb recognition is indeed challenging for vision and language models.
We also explore which model architectures perform best on our dataset. Surprisingly, models with weaker image modeling perform better than the standard transformer model. One hypothesis is that our standard model (with stronger image modeling ability) overfits the train set. As both these models perform worse on other language and vision tasks, our targeted probe task illuminates model weaknesses that are not observed on other benchmarks.
Overall, we find that despite impressive performance on benchmarks, multimodal transformers still struggle with fine-grained understanding, especially fine-grained verb understanding. We hope SVO-Probes can help drive exploration of verb understanding in language and vision models and inspire more targeted probe datasets. Both our SVO-Probes benchmark and models can be found here on GitHub: benchmark and models.