Summary

Online Model Distillation for Efficient Video Inference, Mullapudi, Chen, Zhang, Ramanan, Fatahalian; 2018 - Summary

author:	zhaoyue-zephyrus
score:	8 / 10

Specialize a low-cost student model through online distillation to a specific target distribution in video streams.
Architectural design: Just-In-Time Network
- Separable filter
- Small number of channels
- Skip connections from each encoder block to the corresponding decoder block
Training paradigm: Just-In-Time Model Distillation
- (1) Pre-train JIT-Net on COCO; (2) Re-train online on a live video stream when new frames arrive using teacher MRCNN
- Adaptive re-train:
  - Periodic distillation: run teacher network every \(\delta\) frames, \(\delta\) based on the recent student accuracy (8 - 64 frames)
  - Rapid specialization: update all layers with high lr and momentum; terminate when accuracy reaches threshold (0.9 by default) or update iterations reach upper limit (8 times).
Experimental results:
- Long Video Streams dataset (LVS): 30 HD videos, each 30 min; 720P+
- Pseudo Label by a high-performance Detectron Mask RCNN
- Qualitative results:
- JITNet 0.9 maintains 82.5 mean IoU with 7.5× runtime speedup
- JITNet is also more accurate than offline oracle, flow-based interpolation methods, OSVOS (one-shot video object segmentation).
- Room for improvement espeically for small objects (traffic camera or aerial views)

TL;DR

An online distillation approach for fast adaptation and efficient inference.
Maintain high accuracy with significant speedup.