Summary

author:	zhaoyue-zephyrus
score:	10 / 10

Long Short-Term Memories
- short-term memory is a FIFO queue of \(m_S\) slots, \(f_{T}, \cdots, f_{T-m_S+1}\)
- when a frame is older than \(m_S\) steps, it pops up and goes to the long-term memory;
- long-term memory is a FIFO queue of \(m_L\) slots \(f_{T-m_S}, \cdots, f_{T-m_S-m_L+1}\)
- \(m_S = 32, m_L = 2048\) at 4 FPS
LSTR Encoder
- Pure self-attention is computationally prohibitive for encoding long-term memory.
- Transformer Decoder units:
  - Inputs: (1) learnable output tokens: \(\lambda\in\mathbb{R}^{n\times C}\) and (2) inputs tokens: \(\theta\in\mathbb{R}^{m\times C}\)
  - \[\lambda' = \mathrm{SelfAttention}(\lambda) = \mathrm{Softmax}(\frac{\lambda\lambda^T}{\sqrt{C}})\lambda\]
  - \[\mathrm{CrossAttention}(\sigma(\lambda'), \theta) = \mathrm{Softmax}(\frac{\sigma(\lambda')\theta^T}{\sqrt{C}})\theta\]
- Two-Stage memory compression.
LSTR Decoder
- Use short-term memory as queries to retrieve useful information from the encoder.
Online inference with LSTR
- The queries of the first Transformer decoder unit are fixed
- In the cross-attention operation in the first stage, maintain the positional embedding matrix and update the feature matrix in a FIFO way
Experimental results
- Long-term memory is beneficial at up to 1024 seconds.

TL;DR