Method

SimCLR v2

deeper but less wider network:
- 152-layer ResNet with 3× wider channels
- selective kernels (SK) ,channel-wise attention mechanism
- obtain a 29% relative improvement in top-1 accuracy when fine-tuned on 1% of labeled examples
deep projection (non-linear):
- instead of throwing away g(·) entirely after pretraining as in SimCLR, we fine-tune from a middle layer
- use a 3-layer projection head and fine-tuning from the 1st layer of projection head
- 14% relative improvement in top-1 accuracy when fine-tuned on 1% of labeled examples
memory mechanism (MoCo):
- designates a memory network (with a moving average of weights for stabilization) whose output will be buffered as negative examples
- yields an improvement of ∼1% for linear evaluation as well as when fine-tuning on 1% of labeled examples

incorporate part of the MLP projection head into the base encoder during the fine-tuning

mean-teacher model:
- use the fine-tuned network as a teacher to impute labels for training a student network
- loss:
  
  t: scalar temperature parameter, balance the distribution