迁移学习《Energy-based Out-of-distribution Detection》

论文信息

论文标题:Energy-based Out-of-distribution Detection
论文作者:Weitang Liu, XiaoYun Wang, John D. Owens, Yixuan Li
论文来源:NeurIPS 2020
论文地址:download 
论文代码:download
引用次数:

1 前言

  能量的模型(EBM)本质是建立一个函数 $E(\mathbf{x}): \mathbb{R}^{D} \rightarrow \mathbb{R}$,将输入空间的点 $\mathrm{x}$ 映射到一个称为能量的标量。

  能量值的集合可通过吉布斯分布转化为一个概率密度 $p(\mathbf{x})$:

    $p(y \mid \mathbf{x})=\frac{e^{-E(\mathbf{x}, y) / T}}{\int_{y^{\prime}} e^{-E\left(\mathbf{x}, y^{\prime}\right) / T}}=\frac{e^{-E(\mathbf{x}, y) / T}}{e^{-E(\mathbf{x}) / T}}\quad\quad\quad(1)$

  数据点 $\mathbf{x} \in \mathbb{R}^{D}$ 的亥姆霍兹自由能 $E(\mathbf{x})$ 表示为:

    $E(\mathbf{x})=-T \cdot \log \int_{y^{\prime}} e^{-E\left(\mathbf{x}, y^{\prime}\right) / T} \quad\quad\quad(2)$

  回顾 $\text{softmax}$ 函数导出分类分布:

    $p(y \mid \mathbf{x})=\frac{e^{f_{y}(\mathbf{x}) / T}}{\sum\limits _{i}^{K} e^{f_{i}(\mathbf{x}) / T}} \quad\quad\quad(3)$

  由 $\text{Eq.1}$ 和 $\text{Eq.3}$ 得出:

    $E(\mathbf{x}, y)=-f_{y}(\mathbf{x})$

  同样可得 $\mathbf{x} \in \mathbb{R}^{D}$ 的自由能函数 $E(\mathbf{x} ; f)$:

    $E(\mathbf{x} ; f)=-T \cdot \log \sum\limits _{i}^{K} e^{f_{i}(\mathbf{x}) / T}   \quad\quad\quad(4)$

2 基于能量的分布外检测

  OOD 检测实际上是一个二分类问题,用能量函数来构建判别模型的密度函数:

    $\mathrm{p}(\mathbf{x})=\frac{\mathrm{e}^{-\mathrm{E}(\mathbf{x} ; \mathrm{f}) / \mathrm{T}}}{\int_{\mathbf{x}} \mathrm{e}^{-\mathrm{E}(\mathbf{x} ; \mathrm{f}) / \mathrm{T}}}\quad\quad\quad(5)$

  Note:自由能越大,概率密度越低;

  上式取对数:

    $\log \mathrm{p}(\mathbf{x})=-\mathrm{E}(\mathbf{x} ; \mathrm{f}) / \mathrm{T}-\underbrace{\log \mathrm{Z}}_{\text {constant for all } \mathrm{x}} \quad\quad\quad(6)$

  Note:低能量高似然(ID),高能量低似然(OOD);

  通过对自由能设置阈值 $\tau$ 来判断 OOD :

    $g(\mathbf{x} ; \tau, f)=\left\{\begin{array}{ll}0 & \text { if }-E(\mathbf{x} ; f) \leq \tau \\1 & \text { if }-E(\mathbf{x} ; f)>\tau\end{array}\right.  \quad\quad(7)$

  回顾负对数似然损失:

    $\mathcal{L}_{\mathrm{nll}}=\mathbb{E}_{(\mathbf{x}, y) \sim P^{\text {in }}}\left(-\log \frac{e^{f_{y}(\mathbf{x}) / T}}{\sum_{j=1}^{K} e^{f_{j}(\mathbf{x}) / T}}\right)\quad\quad(8)$

  由于 $E(\mathbf{x}, y)=-f_{y}(\mathbf{x})$,$\text{NLL}$ 损失可以重写为:

    $\mathcal{L}_{\text {nll }}=\mathbb{E}_{(\mathbf{x}, y) \sim P^{\text {in }}}\left(\frac{1}{T} \cdot E(\mathbf{x}, y)+\log \sum_{j=1}^{K} e^{-E(\mathbf{x}, j) / T}\right)  \quad\quad(9)$

  第一项降低 $\text{ground truth}$ $y$ 的能量,第二个项提升了其他标签的能量,从梯度表达式看出:

    $\begin{aligned}\frac{\partial \mathcal{L}_{\mathrm{nll}}(\mathbf{x}, y ; \theta)}{\partial \theta} & =\frac{1}{T} \frac{\partial E(\mathbf{x}, y)}{\partial \theta} \\& -\frac{1}{T} \sum_{j=1}^{K} \frac{\partial E(\mathbf{x}, y)}{\partial \theta} \frac{e^{-E(\mathbf{x}, y) / T}}{\sum_{j=1}^{K} e^{-E(\mathbf{x}, j) / T}} \\& =\frac{1}{T} \underbrace{\left(\frac{\partial E(\mathbf{x}, y)}{\partial \theta}(1-p(Y=y) \mid \mathbf{x})\right.}_{\downarrow \text { energy push down for } y} \\& -\underbrace{\sum_{j \neq y} \frac{\partial E(\mathbf{x}, j)}{\partial \theta} p(Y=j \mid \mathbf{x})}_{\uparrow \text { energy pull up for other labels }}) .\end{aligned}\quad\quad(10)$

3 Energy Score vs. Softmax Score

  首先推导出 $\text{energy score}$ 和 $\text{softmax confifidence score}$ 之间的数学联系:

    $\begin{aligned} \underset{y}{\text{max}} p(y \mid \mathbf{x}) & =\max _{y} \frac{e^{f_{y}(\mathbf{x})}}{\sum_{i} e^{f_{i}(\mathbf{x})}}=\frac{e^{f^{\max }(\mathbf{x})}}{\sum_{i} e^{f_{i}(\mathbf{x})}} \\& =\frac{1}{\sum_{i} e^{f_{i}(\mathbf{x})-f^{\max }(\mathbf{x})}}\end{aligned} \quad\quad(11)$

    $\begin{aligned}\Longrightarrow \log \max _{y} p(y \mid \mathbf{x}) & =E\left(\mathbf{x} ; f(\mathbf{x})-f^{\max }(\mathbf{x})\right) \\& =\underbrace{E(\mathbf{x} ; f)}_{\downarrow \text { for in-dist } \mathbf{x}}+\underbrace{f^{\max }(\mathbf{x})}_{\uparrow \text { for in-dist } \mathbf{x}}\end{aligned}\quad\quad(12)$

  把 $\text{Eq.6}$ 带入,令 $T=1$:

    $\log \max _{\mathbf{y}} \mathrm{p}(\mathrm{y} \mid \mathbf{x})=-\log \mathrm{p}(\mathbf{x})+\underbrace{\mathrm{f}^{\max }(\mathbf{x})-\log \mathrm{Z}}_{\text {Not constant. Larger for in-dist } \mathbf{x}}\quad\quad(13)$

  这里后两项  $\mathrm{f}^{\max }(\mathbf{x})-\log \mathrm{Z}$  不是一个常数,相反,对于 ID 样本,其负对数似然期望是更小的,但是  $\mathrm{f}^{\max }(\mathrm{x})$  这个分类置信度却是越大越好,这二种冲突。这一定程度上解释了基于 softmax confidence 方法的问题。

4 能量边界学习

   在相同模型上,能量函数已经比 softmax 函数好了,那要是能有针对性的 fine-tuning 一下就更好了。作者就提出了一种能量边界目标函数来 fine-tuning 网络:

    $\underset{\theta}{\text{min}}  \mathbb{E}_{(\mathbf{x}, \mathrm{y}) \sim \mathcal{D}_{\text {in }}^{\text {tain }}}\left[-\log \mathrm{F}_{\mathrm{y}}(\mathbf{x})\right]+\lambda \cdot \mathrm{L}_{\text {energy }}\quad\quad(14)$

  其中第一项是标准的交叉熵损失函数,作用在 ID 训练数据上。第二项是一个基于能量的正则化项:

    $\begin{aligned}\mathrm{L}_{\text {energy }} =\quad &\mathbb{E}_{\left(\mathbf{x}_{\text {in }}, \mathrm{y}\right) \sim \mathcal{D}_{\text {in }}^{\text {train }}}\left(\max \left(0, \mathrm{E}\left(\mathbf{x}_{\text {in }}\right)-\mathrm{m}_{\text {in }}\right)\right)^{2} \\&\left.+\mathbb{E}_{\mathbf{x}_{\text {out }} \sim \mathcal{D}_{\text {out }}^{\text {train }}} \max \left(0, \mathrm{~m}_{\text {out }}-\mathrm{E}\left(\mathbf{x}_{\text {out }}\right)\right)\right)^{2}\end{aligned}\quad\quad(15)$

  惩罚能量高于 $m_{\text {in }}$ 的 ID 数据和能量低于 $m_{\text {out }}$  的ODD数据,来拉远正常数据和异常数据分布之间的距离。

  整体框架:

  

  代码:

def train():
    net.train()  # enter train mode
    loss_avg = 0.0

    # start at a random point of the outlier dataset; this induces more randomness without obliterating locality
    train_loader_out.dataset.offset = np.random.randint(len(train_loader_out.dataset))
    for in_set, out_set in zip(train_loader_in, train_loader_out):
        data = torch.cat((in_set[0], out_set[0]), 0)
        target = in_set[1]
        data, target = data.cuda(), target.cuda()
        # forward
        x = net(data)
        # backward
        scheduler.step()
        optimizer.zero_grad()
        loss = F.cross_entropy(x[:len(in_set[0])], target)
        # cross-entropy from softmax distribution to uniform distribution
        if args.score == 'energy':
            Ec_in = -torch.logsumexp(x[:len(in_set[0])], dim=1)
            Ec_out = -torch.logsumexp(x[len(in_set[0]):], dim=1)
            # '--m_in', type=float, default=-25., help='margin for in-distribution; above this value will be penalized'
            # '--m_out', type=float, default=-7.,help='margin for out-distribution; below this value will be penalized')
            loss += 0.1*(torch.pow(F.relu(Ec_in-args.m_in), 2).mean() + torch.pow(F.relu(args.m_out-Ec_out), 2).mean())
        elif args.score == 'OE':
            loss += 0.5 * -(x[len(in_set[0]):].mean(1) - torch.logsumexp(x[len(in_set[0]):], dim=1)).mean()
        loss.backward()
        optimizer.step()
        # exponential moving average
        loss_avg = loss_avg * 0.8 + float(loss) * 0.2
    state['train_loss'] = loss_avg
# test function
def test():
    net.eval()
    loss_avg = 0.0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.cuda(), target.cuda()
            # forward
            output = net(data)
            loss = F.cross_entropy(output, target)
            # accuracy
            pred = output.data.max(1)[1]
            correct += pred.eq(target.data).sum().item()
            # test loss average
            loss_avg += float(loss.data)

    state['test_loss'] = loss_avg / len(test_loader)
    state['test_accuracy'] = correct / len(test_loader.dataset)

 

posted @ 2023-03-19 10:28  多发Paper哈  阅读(236)  评论(0编辑  收藏  举报
Live2D