Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval


Yale Song (Microsoft Research) and Mohammad Soleymani (USC ICT)
CVPR 2019

[arXiv], [GitHub]

tl; dr
  • Most existing instance embedding methods are injective, maping an instance to a single point in the embedding space. Unfortunately, they cannot effectively handle polysemous instances with multiple meanings; at best, it would find an average representation of different meanings.
  • We introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance. Each representation encodes global context and different local information, combined via residual learning.
  • For cross-modal retrieval, we tie-up two PIE-Nets and optimize them jointly in the multiple instance learning framework. We call our model Polysemous Visual-Semantic Embedding (PVSE).
  • We demonstrate PVSE on image-text and video-text cross-retrieval scenarios. For video-text cross-retrieval, we introduce a new dataset called My Reaction When (MRW).
  • We show SOTA results on MS-COCO (5K test set), TGIF, and our MRW datasets.

Polysemous Visual-Semantic Embedding

  • Feature extractor: We use well-known encoders for different modalities, e.g., CNN, RNN, GloVe. We just require that each encoder computes both local and global representation of an instance. Local representation could capture, e.g., spatial region in an image, temporal slice (frame) in a video, and words in a sentence.
  • PIE-Net: The local and global representations are passed to the PIE-Net, which computes K embeddings that focus on different local parts of an instance. The PIE-Net uses multi-head self-attention to extract K locally-guided representations and combines each with global representation via residual learning. Each modality gets its own PIE-Net; parameters are not shared.
  • Learning objective: We train the model in the multiple instance learning framework. The rationale is that when a pair of instances have weak/ambiguous association it is too restrictive to enforce one-to-one alignment. We instead use the MIL objective to allow for a loose alignment by requiring at least one pair among K x K instance embeddings to be aligned.

My Reaction When (MRW) Dataset

The My Reaction When (MRW) dataset contains 50,107 video-sentence pairs crawled from social media, where videos display physical or emotional reactions to the situations described in sentences. The subreddit /r/reactiongifs contains several examples; below shows some representative examples.

(a) Physical Reaction (b) Emotional Reaction (c) Animal Reaction (d) Lexical Reaction (caption)
MRW a witty comment I wanted to make was already said MFW I see a cute girl on Facebook change her status to single MFW I cant remember if I've locked my front door MRW a family member askes me why his computer isn't working

We split the data into train (44,107 pairs), validation (1,000 pairs), and test (5,000) sets. The dataset and scripts to prepare the data will become available at our GitHub page soon (pending approval). In the meantime, you may download the dataset at here (metadata only)


Qualitative Results: Image-to-Sentence Retrieval on MS-COCO

Below, for each query image we show three visual attention maps and their top-ranked text retrieval results, along with their ranks and cosine similarity scores (green: correct, red: incorrect). Words in each sentence is color-coded with textual attention intensity, using the color map shown at the top.


Qualitative Results: Video-to-Sentence Retrieval on TGIF

For each query video we show three visual attention maps and their top-ranked text retrieval results, along with their ranks and cosine similarity scores (green: correct, red: incorrect). Words in each sentence is color-coded with textual attention intensity.


Qualitative Results: Sentence-to-Video (GIF) Retrieval on MRW

For each query sentence we show top five retrieved videos and cosine similarity scores. Quiz: We encourage the readers to find the best matching video in each set of results; see our paper (page 11) for the answers.

(a) MRW I accidentally close the Reddit tab when I am 20 pages deep
Rank 1 (0.76) Rank 2 (0.74) Rank 3 (0.72) Rank 4 (0.72) Rank 5 (0.70)
(b) MRW there is food in the house and cannot eat it
Rank 1 (0.87) Rank 2 (0.86) Rank 3 (0.84) Rank 4 (0.83) Rank 5 (0.82)
(c) My reaction when I hear a song on the radio that I absolutely hate
Rank 1 (0.76) Rank 2 (0.74) Rank 3 (0.72) Rank 4 (0.72) Rank 5 (0.70)
(d) HIFW I am drunk and singing at a Karaoke ba
Rank 1 (0.78) Rank 2 (0.75) Rank 3 (0.74) Rank 4 (0.73) Rank 5 (0.73)
(e) MFW I post my first original content to imgur and it gets the shit down voted out of it
Rank 1 (0.94) Rank 2 (0.87) Rank 3 (0.86) Rank 4 (0.85) Rank 5 (0.77)
(f) MRW the car in front of me will not go when it is their turn
Rank 1 (0.84) Rank 2 (0.83) Rank 3 (0.80) Rank 4 (0.77) Rank 5 (0.75)
(g) MRW I get drunk and challenge my SO to a dance off
Rank 1 (0.91) Rank 2 (0.90) Rank 3 (0.89) Rank 4 (0.88) Rank 5 (0.88)

Experimental Results

(1) Image-Sentence Cross-Retrieval Results on MS-COCO
(2) Video-Sentence Cross-Retrieval Results on TGIF
(3) Video-Sentence Cross-Retrieval Results on MRW

Note: Since our CVPR 2019 camera-ready, we've improved the performance on TGIF and MRW by modifying the data augmentation logic for video data. The new results are reflected in our arxiv version.