Summaries from European Conference on Computer Vision on ShortScience.org

doi.org
sci-hub
scholar.google.com

UNITER: UNiversal Image-TExt Representation Learning
Chen, Yen-Chun and Li, Linjie and Yu, Licheng and Kholy, Ahmed El and Ahmed, Faisal and Gan, Zhe and Cheng, Yu and Liu, Jingjing
European Conference on Computer Vision - 2020 via Local Bibsonomy
Keywords: dblp

[link] Summary by ngthanhtinqn 2 years ago

This paper is to design a generalized multimodal architecture that can solve all Vision language tasks.
Concretely, they will pre-train their model on 4 main tasks (MLM, ITM, WRA, MRM) and will evaluate various downstream tasks (VQA, VCR, NLVR).

https://i.imgur.com/IG7suDj.png

As shown in Fig 1, UNITER first encodes image regions (visual features and bounding box features) and textual words (tokens and positions) into a common embedding space with Image Embedder and Text Embedder.
Then, a Transformer module is applied to learn generalizable contextualized embeddings for each region and each word.

The contribution is two-fold:
(1) Masked language/region modeling is conditioned on full observation of image/text, rather than applying joint random masking to both modalities
(2) Introducing a novel WRA pre-training task via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions. 
Intuitively, OT-based learning aims to optimize distribution matching by minimizing the cost of transporting one distribution to another. In our context, we aim to minimize the cost of transporting the embeddings from image regions to words in a sentence (and vice versa), thus optimizing toward better cross-modal alignment.