author: | biofizzatreya |
score: | 5 / 10 |
TODO: Summarize the paper:
- What is the core idea? The paper attempts to bridge the lack of labelled datasets in medical images by training a pair of neural networks on image and corresponding text data. The purpose of the network is to reconstruct correct text labels given a medical image and vice-versa in an unsupervised fashion.
-
How is it realized (technically)? A set of paired inputs is denoted by \((x_v, x_u)\) where \(x_v\) is an image and \(x_u\) is a piece of text. Both inputs are passed through random transformations and then through an encoder. For images, the encoder is a ResNet-50 and for texts it is BERT. The encoder vector is further transformed with a single layer network. Following this, two losses are computed to ensure the true pairs always have minimum loss values in spite of the noise from random transformations. image to text loss: text to image loss: final loss:
-
How well does the paper perform? The paper beats previous benchmarks. However the agreement of these results strongly rely on the nature of transformations.
- What interesting variants are explored? ConVirt performs better for zero-shot image retrieval tasks.
The paper explores some of the hyperparameters involved in the training. One of the examples show that the loss function parameters strongly influence learning.
TL;DR
- Unsupervised text-image matching
- Developed for medical images
- Relies strongly on loss hyperparameters