motivation
- generative & discriminative:
pixel-level generation is computationally expensive and may not be necessary for representation learning
contribution
- SSL(自监督)相比SL(监督)更加依赖数据增强
- 特征层与最终loss之间引入非线性转化层(MLP)能够提高效果
- 归一化的embeddin与合适的温度参数(temperature parameter,softmax)利于表达
- 更大的batch-size、较长的训练时长、deeper&wider network都利于SSL
method
The Contrastive Learning Framework
如上图所示,文章主要结构为:
- A stochastic data augmentation module
- 对于同一张图片进行不同组合的augmentation,得到不同multi-view作为positive pair
- 文中使用:random cropping,random color distortions, and random Gaussian blur
- A neural network base encoder f(·)
- 文中使用:ResNet50
- A small neural network projection head g(·)
- 将encoder得到的representations映射到contrastive loss所在latent space
- 文中使用两层的MLP
- A contrastive loss function
- sim为余弦相似度
整体算法流程如下:
Training with Large Batch Size
- 作者在实验部分调整batch-size(256-8192)
- 常见优化器(SGD/Momentum)在大batch-size训练中不稳定,因此选用LARS optimizer
- 使用了128 TPU v3 cores。。。
- Global BN
- 多卡BN各自算mean和std会泄露信息,从而用了 global BN
Evaluation Protocol
- 数据集:ImageNet ILSVRC-2012 dataset
- 效果检测:a linear classifier is trained on top of the frozen base network
Data Augmentation for Contrastive Representation Learning
Composition of data augmentation operations is crucial for learning good representations
上图展示了常见的数据增强方法,作者在实验部分将其进行对比,发现组合的效果往往大于单独增强(对角线)
Contrastive learning needs stronger data augmentation than supervised learning
作者对比了color distortion的影响,可以看出较大的color distortion对于ssl有促进作用,却抑制sl效果,应该是color distortion加大了train、val之间的分布不均
Architectures for Encoder and Head
Unsupervised contrastive learning benefits (more) from bigger models
可以看出当模型变大参数增多,SSL效果的提升要大于SL
A nonlinear projection head improves the representation quality of the layer before it
non-linear效果往往比linear效果强3%
the hidden layer before the projection head is a better representation than the layer after
因此使用h来进行作为representationz = g(h)在训练过程中对于data transformation不敏感,因此可能会舍弃部分颜色或方向信息,不利于分类
下图中可以看出g(h)对于rotation、corrupted、Sobel变换不敏感
Loss Functions and Batch Size
Normalized cross entropy loss with adjustable temperature works better than alternatives
- NT-Xent:Normalized Temperature-scaled Cross Entropy取得最优效果
- l2中合适的温度参数可以帮助模型学习hard negative样本
Contrastive learning benefits (more) from larger batch sizes and longer training
Comparison with State-of-the-art
Linear evaluation
Semi-supervised learning
sample 1% or 10% of the labeled ILSVRC-12 training datasets in a class-balanced way
Transfer learning
在混合数据集上SSL训练,并在特定数据集上fine-tune