Summary

author:	chengchunhsu
score:	7 / 10

What is the core idea?

Problem:

labeled data for learning tasks is limited
pre-training with good representation can provide a significant performance boost

Challenge:

Solution: introduced Generative Pre-Training (GPT) Model, a semi-supervised strategy for enhancing the performance on multiple NLP tasks.

Model architecture: Transformer

Step #1: Unsupervised pre-training

Step #2: Supervised fine-tuning

Use the pre-trained model
Add linear layer as the last layer
Objective: maximize the likelihood
Auxiliary objective from step #1 helps to improve the generalization and convergence speed

Problem: some task does not have structured input

Solution: concatenation/fusion

Benchmark
- SOTA in 9 out of the 12 datasets
Analysis
- Impact of number of layers transferred
- Zero-shot Behaviors
- Ablation studies

Impact of number of layers transferred

Added layers provide further benefits

→ each layer in the pre-trained model contains useful functionality for solving target task.

Zero-shot Behaviors

Even without fine-tuning, the generative model gains improvement along with # iteration of updates.

→ the pretraining supports the learning of a wide range of task-relevant functionality

Ablation studies

Apply one pre-trained model to many tasks
Method: Unsupervised pre-training decoder + supervised fine-tuning
Language model served as an effective pre-training objective which could help model generalize well