# Self-Supervised Learning 的核心思想

Unsupervised Pre-train, Supervised Fine-tune.

## 两大主流方法

• 基于 Generative 的方法
• 基于 Contrative 的方法

## 实践应用

• BERT系列：nlp
• VIT系列：cv
• data2vec系列：multimodal
• SimCLR系列：对比学习
• MoCo系列

# Contrastive Representation Learning

• 构造相似实例和不相似实例
• 习得一个表示学习模型，使得相似的实例在投影空间中比较接近，而不相似的实例在投影空间中距离比较远

## 对比学习目标函数

### Contrastive Loss

$\mathcal{L}_\text{cont}(\mathbf{x}_i, \mathbf{x}_j, \theta) = \mathbb{1}[y_i=y_j] \| f_\theta(\mathbf{x}_i) - f_\theta(\mathbf{x}_j) \|^2_2 + \mathbb{1}[y_i\neq y_j]\max(0, \epsilon - \|f_\theta(\mathbf{x}_i) - f_\theta(\mathbf{x}_j)\|_2)^2$

### Triplet Loss

Triplet loss最小化anchor和正样本间的距离，最大化anchor和负样本间的距离

$\mathcal{L}_\text{triplet}(\mathbf{x}, \mathbf{x}^+, \mathbf{x}^-) = \sum_{\mathbf{x} \in \mathcal{X}} \max\big( 0, \|f(\mathbf{x}) - f(\mathbf{x}^+)\|^2_2 - \|f(\mathbf{x}) - f(\mathbf{x}^-)\|^2_2 + \epsilon \big)$

### N-pair Loss

Multi-Class N-pair loss generalizes triplet loss to include comparison with multiple negative samples.

Given a $$(N + 1)$$ tuplet of training samples,$$\{ \mathbf{x}, \mathbf{x}^+, \mathbf{x}^-_1, \dots, \mathbf{x}^-_{N-1} \}$$, including one positive and $$N-1$$ negative ones, N-pair loss is defined as:

\begin{aligned} \mathcal{L}_\text{N-pair}(\mathbf{x}, \mathbf{x}^+, \{\mathbf{x}^-_i\}^{N-1}_{i=1}) &= \log\big(1 + \sum_{i=1}^{N-1} \exp(f(\mathbf{x})^\top f(\mathbf{x}^-_i) - f(\mathbf{x})^\top f(\mathbf{x}^+))\big) \\ &= -\log\frac{\exp(f(\mathbf{x})^\top f(\mathbf{x}^+))}{\exp(f(\mathbf{x})^\top f(\mathbf{x}^+)) + \sum_{i=1}^{N-1} \exp(f(\mathbf{x})^\top f(\mathbf{x}^-_i))} \end{aligned}

### NCE

\begin{aligned} \mathcal{L}_\text{NCE} &= - \frac{1}{N} \sum_{i=1}^N \big[ \log \sigma (\ell_\theta(\mathbf{x}_i)) + \log (1 - \sigma (\ell_\theta(\tilde{\mathbf{x}}_i))) \big] \\ \text{ where }\sigma(\ell) &= \frac{1}{1 + \exp(-\ell)} = \frac{p_\theta}{p_\theta + q} \end{aligned}

### InfoNCE

$\mathcal{L}_q = - \log\dfrac{\exp(q k_+ / \tau)}{\sum_i \exp(q k_i / \tau)}$

## vision

### SimCLR

loss, 每个Batch里面的所有Pair的损失之和取平均:

$L = \frac{1}{2N}\sum_{k=1}^{N}[l(2k-1,2k)+l(2k,2k-1)]$

## nlp

### SimCSE

Unsupervised SimCSE

Supervised SimCSE

SimCSE2 中改进了两点：

1. 负样本质量，原本都是同一句话的embedding dropout，但句子长度相同，会导致模型倾向。
2. batchsize过大，引起性能下降，未解之谜
def unsup_loss(y_pred, lamda=0.05, device="cpu"):
idxs = torch.arange(0, y_pred.shape[0], device=device)
y_true = idxs + 1 - idxs % 2 * 2
similarities = F.cosine_similarity(y_pred.unsqueeze(1), y_pred.unsqueeze(0), dim=2)

similarities = similarities - torch.eye(y_pred.shape[0], device=device) * 1e12

similarities = similarities / lamda

loss = F.cross_entropy(similarities, y_true)

def sup_loss(y_pred, lamda=0.05, device="cpu"):
row = torch.arange(0, y_pred.shape[0], 3, device=device)
col = torch.arange(y_pred.shape[0], device=device)
col = torch.where(col % 3 != 0)[0]
y_true = torch.arange(0, len(col), 2, device=device)
similarities = F.cosine_similarity(y_pred.unsqueeze(1), y_pred.unsqueeze(0), dim=2)

similarities = torch.index_select(similarities, 0, row)
similarities = torch.index_select(similarities, 1, col)

similarities = similarities / lamda

loss = F.cross_entropy(similarities, y_true)