Summary

Contrastive Learning of Medical Visual Representations from Paired Images and Text, Zhang, Jiang, Miura, Manning, Langlotz; 2020 - Summary

author:	TongruiLi
score:	4/10 / 10

TODO: Summarize the paper:

What is the core idea?

This paper presented ConVIRT, an unsupervisied way to learn visual representation of medical image with no expert input required besides a text labeling the image. It essentially train a backbone that can be transfered to different backbones.

How is it realized (technically)?

The paper assumes that the inputs are given in \(x_v, x_u\), a pair of image and text input. The goal is to learn an encoder that maps the image to a latent space reprsentation. They proposed the following:

use a stochastic image transformation to transform \(x_v\), \(x_u\) to \(\tilde{x_v}\), \(\tilde{x_u}\)
use an autoencoder \(f_v\), \(f_u\) to transform \(\tilde{x_v}\), \(\tilde{x_u}\) to latent space \(h_v\), \(h_u\). In the paper, they used resnet50 for image encoder and BERT for text encoder.
use a projection function \(g_v\), \(g_u\) to project \(h_v\), \(h_u\) to same dimensions space for constractive learning. They used two layer MLP for this task

They then use two different losses

image to text contrastive loss

where \(<v, u>\) represents the cosine similarity and \(\tau\) represents the temperature.

text to image contrastive loss

Final loss

Linear combination of the different weights

How well does the paper perform?

In both image and text retrival, the paper beat all previous benchmark. However, all other benchmarks were modified at best from state of art method, which may suffer from inductive bias.

What interesting variants are explored?

The author also released some analysis on hyperparameter, which is interesting to see.

TL;DR

Contrastive learning with joint vision and text modility can bring state of art result
Text information can bring additional critical knowledge to visual representations
Even on different domains, utilizing pretrained model can make a difference.