Momentum Contrast (MoCo) for Unsupervised Visual Representation Learning

1 Introduction

1.1 Instance discrimination (样本判别)

Instance discrimination 制定了一种划分正样本和负样本的规则

有一个数据集，里面有N张图片，随机选择一张图片 $x_1$，经过不同的Data Transformation得到正样本对，其他的图片 $x_2, x_3, ..., x_n$ 叫做负样本，经过Encoder(also called Feature Extractor)得到对应的feature

$x_i$: 表示第 $i$ 张图片
$T_j$: 表示Data Transformation(Data Augmentation)
$x_1^{(1)}$: anchor锚点, 基准点
$x_1^{(2)}$: 基于anchor的positive sample
$x_2, x_3, ..., x_n$: 基于anchor的negative sample

1.2 Momentum

动量在数学上可以理解为是一种指数移动平均(Exponential Moving Average)

$m$为动量系数，目的是为了 $Y_t$ 不完全依赖于当前时刻的输入 $X_t$

$Y_t = mY_{t-1} + (1-m)X_t$

$Y_t$: 当前时刻的输出
$Y_{t-1}$: 上一时刻的输出
$X_t$: 当前时刻的输入

$m$越大，对当前时刻的输入 $X_t$ 的依赖越小

1.3 Momentum Contrast (MoCo)

the relationships between Supervised Learning and Unsupervised Learning/Self-supervised Learning

2.1 Loss Function(目标损失函数)

2.2 Pretext tasks(代理任务)

3 Method

3.1 InfoNCE Loss

CrossEntropyLoss

In $softmax$ formula, the probabilty of the true sample(class) through model computes:

$\hat{y}_+ = softmax(z_+) = \dfrac{\exp(z_+)}{\sum_{i=1}^K \exp(z_i)}$

In Supervised Learning, Groud Truth is the one-hot vector(for instance: [0,1,0,0], where K=4), So the cross entropy loss:

$\begin{align*} CrossEntropyLoss& = -\sum y_i\log(\hat{y}) \\ & = -\log \hat{y}_+\\ & = -\log softmax(z_+)\\ & = -\log\dfrac{\exp(z_+)}{\sum_{i=1}^K \exp(z_i)} \end{align*}$

$K$ 为 num_labels (样本中的类别数)

InfoNCE Loss

那么为什么不能使用CrossEntropyLoss作为Contrast Learning的Loss Function呢？因为在Contrast Learning中，$K$的值很大（例如ImageNet里样本集有128万张图片，也就是128万类），softmax无法处理如此多的类别，而且exponential operation维度很高，计算复杂度很高。

$\quad$ Contrastive Learning can be thought of as training an Encoder for a dictionary look-up task. Consider an encoded query $q$ and a set of encoded samples {$k_0, k_1, k_2, ...$} that are the keys of a dictionary. Assume that there is a single key (denoted as $k_+$) in the dictionary that $q$ matches.
$\quad$A contrastive loss is a function whose value is low when q is similar to its positive key $k_+$ and dissimilar to all other keys (considered negative keys for $q$). With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE.

$\mathcal{L_q} = -\log \dfrac{\exp(q*k_+/\tau)}{\sum_{i=0}^K \exp(q*k_i/\tau)}$

$q$ 为 $feature_{archor}$
$k_i(i=0,...,K)$ 为 1个$feature_{positive}$ + $K$个$feature_{negative}$
$K$ 为负采样后负样本的数量
$\tau$ means temperature, which is a hyper-parameter
The sum is over one positive and $K$ negative samples. Intuitively, this loss is the log loss of a ($K+1$)-way softmax-based classifier that tries to classify $q$ as $k_+$.

In general, the query representationis $q = f_q(x^{query})$ where $f_q$ is an encoder network and $x^{query}$ is a query sample (likewise, $k = f_k(x^{key})$). Their instantiations depend on the specific pretext task. The input $x^{query}$ and $x^{key}$ can be images, patches, or context consisting a set of patches. The networks $f_q$ and $f_k$ can be identical, partially shared, or different.

posted @ 2024-05-03 04:19 ForHHeart 阅读(1) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

ForHHeart