Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, Liu, Lin, Cao, Hu, Wei, Zhang, Lin, Guo; 2021 - Summary
author: ywen666
score: 9 / 10

Abstract

Motivation

To reduce complexity, Swin Transformer constructs a hierarchical representation by starting from small-sized patches (outlined in gray) and gradually merging neighboring patches in deeper Transformer layers.

The linear computational complexity is achieved by computing self-attention locally within non-overlapping windows that partition an image (outlined in red). The number of patches in each window is fixed, and thus the complexity becomes linear in image size.

A key design element of Swin Transformer is its shift of the window partition between consecutive self-attention layers. The shifted windows bridge the windows of the preceding layer, providing connections among them that significantly enhance modeling power.

Overall Architecture

Swin Transformer first splits an input RGB image into non-overlapping patches by a patch splitting module. Each patch is treated as a “token” and its feature is set as a concatenation of the raw pixel RGB values. A linear embedding layer is applied on this raw-valued feature to project it to an arbitrary dimension (denoted as C).

Two Swin Transformer blocks are applied on these patch tokens. The Transformer blocks maintain the number of tokens (\(\frac{H}{4} \times \frac{W}{4}\)), and together with the linear embedding are referred to as “Stage 1”.

The first patch merging layer concatenates the features of each group of 2 × 2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated features. This reduces the number of tokens by a multiple of 2×2 = 4 (2× downsampling of resolution), and the output dimension is set to 2C. Swin Transformer blocks are applied afterwards for feature transformation, with the resolution kept at \(\frac{H}{8} \times \frac{W}{8}\). This first block of patch merging and feature transformation is denoted as “Stage 2”.

Stage 2 is followed by stages 3 and 4 that have the same architecture except for the number of Swin Transformer blocks and the resolution of the windows.

Shifted Window based Self-Attention

Self-attention in non-overlapped windows

Global self-attention computation is generally unaffordable for a large \(hw\), while the window based self-attention is scalable.

Shifted window partitioning in successive blocks

To introduce cross-window connections while maintaining the efficient computation of non-overlapping windows, we propose a shifted window partitioning approach which alternates between two partitioning configurations in consecutive Swin Transformer blocks.

As illustrated in the above figure, the first module uses a regular window partitioning strategy which starts from the top-left pixel, and the \(8 \times 8\) feature map is evenly partitioned into \(2 \times 2\) windows of size \(4 \times 4 (M = 4)\).

Then, the next module adopts a windowing configuration that is shifted from that of the preceding layer, by displacing the windows by \((\text{floor}(\frac{M}{2}), \text{floor}(\frac{M}{2}))\).

Efficient batch computation for shifted configuration

To reduce the latency, the authors propose a more efficient batch computation approach by cyclic-shifting toward the top-left direction, as illustrated in figure below. After this shift, a batched window may be composed of several sub-windows that are not adjacent in the feature map, so a masking mechanism is employed to limit self-attention computation to within each sub-window.

Experiments

The paper conducted a thorough empirical evaluation on ImageNet-1k classification, object detection and semantic segmentation. The results showed that Swin-Transformer outperforms ResNet and ViT when used as the backbone model.

TL;DR

Swin-Transformer — A hierarchical Transformer whose representation is computed with Shifted windows. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. It achieved SOTA on various vision benchmarks when used as the backbone model.