Momentum Contrast (MoCo) for Unsupervised Visual Representation Learning

1 Introduction

1.1 Instance discrimination (样本判别)

Instance discrimination 制定了一种划分正样本和负样本的规则

有一个数据集,里面有N张图片,随机选择一张图片 \(x_1\),经过不同的Data Transformation得到正样本对,其他的图片 \(x_2, x_3, ..., x_n\) 叫做负样本,经过Encoder(also called Feature Extractor)得到对应的feature

  • \(x_i\): 表示第 \(i\) 张图片
  • \(T_j\): 表示Data Transformation(Data Augmentation)
  • \(x_1^{(1)}\): anchor锚点, 基准点
  • \(x_1^{(2)}\): 基于anchor的positive sample
  • \(x_2, x_3, ..., x_n\): 基于anchor的negative sample

1.2 Momentum

动量在数学上可以理解为是一种指数移动平均(Exponential Moving Average)

\(m\)为动量系数,目的是为了 \(Y_t\) 不完全依赖于当前时刻的输入 \(X_t\)

$Y_t = mY_{t-1} + (1-m)X_t$
  • \(Y_t\): 当前时刻的输出
  • \(Y_{t-1}\): 上一时刻的输出
  • \(X_t\): 当前时刻的输入

\(m\)越大,对当前时刻的输入 \(X_t\) 的依赖越小

1.3 Momentum Contrast (MoCo)

2 Related Work

the relationships between Supervised Learning and Unsupervised Learning/Self-supervised Learning

2.1 Loss Function(目标损失函数)

2.2 Pretext tasks(代理任务)

3 Method

3.1 InfoNCE Loss

CrossEntropyLoss

In \(softmax\) formula, the probabilty of the true sample(class) through model computes:

$\hat{y}_+ = softmax(z_+) = \dfrac{\exp(z_+)}{\sum_{i=1}^K \exp(z_i)}$

In Supervised Learning, Groud Truth is the one-hot vector(for instance: [0,1,0,0], where K=4), So the cross entropy loss:

$\begin{align*} CrossEntropyLoss& = -\sum y_i\log(\hat{y}) \\ & = -\log \hat{y}_+\\ & = -\log softmax(z_+)\\ & = -\log\dfrac{\exp(z_+)}{\sum_{i=1}^K \exp(z_i)} \end{align*}$
  • \(K\) 为 num_labels (样本中的类别数)

InfoNCE Loss

那么为什么不能使用CrossEntropyLoss作为Contrast Learning的Loss Function呢?因为在Contrast Learning中,\(K\)的值很大(例如ImageNet里样本集有128万张图片,也就是128万类),softmax无法处理如此多的类别,而且exponential operation维度很高,计算复杂度很高。

\(\quad\) Contrastive Learning can be thought of as training an Encoder for a dictionary look-up task. Consider an encoded query \(q\) and a set of encoded samples {\(k_0, k_1, k_2, ...\)} that are the keys of a dictionary. Assume that there is a single key (denoted as \(k_+\)) in the dictionary that \(q\) matches.
\(\quad\)A contrastive loss is a function whose value is low when q is similar to its positive key \(k_+\) and dissimilar to all other keys (considered negative keys for \(q\)). With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE.

$\mathcal{L_q} = -\log \dfrac{\exp(q*k_+/\tau)}{\sum_{i=0}^K \exp(q*k_i/\tau)}$
  • \(q\)\(feature_{archor}\)
  • \(k_i(i=0,...,K)\) 为 1个\(feature_{positive}\) + \(K\)\(feature_{negative}\)
  • \(K\) 为负采样后负样本的数量
  • \(\tau\) means temperature, which is a hyper-parameter
  • The sum is over one positive and \(K\) negative samples. Intuitively, this loss is the log loss of a (\(K+1\))-way softmax-based classifier that tries to classify \(q\) as \(k_+\).

In general, the query representationis \(q = f_q(x^{query})\) where \(f_q\) is an encoder network and \(x^{query}\) is a query sample (likewise, \(k = f_k(x^{key})\)). Their instantiations depend on the specific pretext task. The input \(x^{query}\) and \(x^{key}\) can be images, patches, or context consisting a set of patches. The networks \(f_q\) and \(f_k\) can be identical, partially shared, or different.

posted @ 2024-05-03 04:19  ForHHeart  阅读(1)  评论(0编辑  收藏  举报