I. Introduction


II. Background


III. Framework


III-I. Unsupervised Pre-training


III-II. Supervised Fine-tuning


III-III. Task-specific input Transformation


IV. Experiment


IV-I. Setup


Datasets


Model Specification


IV-II. Supervised Fine-tuning


Hyper-parameters


  • LR :
    • lr-decay : .2 with warmup
    • .5
  • batch-size : 32
  • dropout : .1
  • epochs : 3

Loss(Objective)


V. Analysis


V-I. Impact of number of layers transferred


V-II. Zero-shot Behavior


V-III. Ablation Study


VI. Conclusion


Contribution