[Papers] Semantic Segmentation Papers(1)

Tags: Paper


总结几篇看过的语义分割论文,FCN, DeconvNet, SegNet, U-Net,后面会再总结DeepLab的论文.

FCN

Abstract

提出end to end FCN,输入arbitrary size image, 输出同样大小的label map. FCN中的skip architecture combines semantic information from a deep coarse layer with appearance information from a shallow fine layer to produce accurate and detailed segmentations.

Introduction

使用supervised pretrained classification netowrk来进行pixel wise prediction.
语义分割问题面对的问题是语义信息和位置信息之间的inherent tension

FCN

FCN作为将深度学习应用到分割问题上的开山鼻祖,虽然不是end-to-end 的,但是为后面的U-net, E-net, SegNet打下基础,特别是使用deconvolution 来对 coarse map unsample这一想法.
gmkiB.png

Adapting classifiers for dense prediction

全连接层可以看做在整个feature map上卷积的特殊情况,去除网络最后的全连接层网络输出的是label map加上spatial loss 就可以进行end-to-end dense learning.

Shift-and-stitch is filter rarefaction

rarefaction: 稀薄化

a trous algorithm

FCN还提到了后面DeepLab中用到的带孔卷积

Upsampling is backwards strided convolution

In a sense, upsampling with factor \(f\) is convolution with a fractional input stride of \(1/f\). So long as \(f\) is integral, a natural way to upsample is therefore backwards convolution (sometimes called deconvolution) with an output stride of \(f\).
这段话有点费解.

Deconvolution的解释:

  1. https://datascience.stackexchange.com/questions/6107/what-are-deconvolutional-layers
  2. https://github.com/vdumoulin/conv_arithmetic
    stride two and padding:
    a.jpg

称为transposed convolution是因为,transposed convolution经常用在backward计算的时候,反向传播可以通过乘以权重矩阵的转置完成. 图中的filter明明是stride 1为什么是stride 2呢?stride 2是相对原图(没有stride之前), 因为每个像素之前插入了0,现在的stride 1 就相当于原来的stride 2.

Shown above is a transposed convolution. 'stride two' means stride in the corresponding original convolution is two. This is precisely why you have 1 (=2-1, 2 being the original stride) layer of zeros in between rows and columns. Transposed convolution is generally used in backward pass. It is called transposed because of the analogy with fully connected layer where you multiply with the transpose of the weight matrix during a backward pass.

patchwise trainig is loss sampling

Segmentation Architecture

作者fully convolution networks主要由in-network unsampling和pixelwise loss组成, 此外还有skip architecture.

Learning DeConvolution Network for Semantic Sefeijiegmentation

Abstract

deep deconvolution + proposal-wise prediction

反卷积网络由反卷积和上采样层组成

1. Introduction

现有的基于CNN semantic segmentation网络大都是对前面分类网络得到的label map(FCN中是16*16)做基于bilinear interpolation的deconvolution. 然而这种deconvolution 的输入是前面经过convolution 和pooling 的 feature map这个feature map已经失去了很多structured details, 往往使用deconvolution不能得到很好的效果。
一些方法使用FCN + Conditional Random Field来解决这一问题。

FCN:
FCN由于其fixed size 的 receptive field使其对于过小的物体不能分类,对于过大的物体则会预测处多个类别(大小相对于receptive field而言).
FCN+CRF

3. System Architecture

gHLJK.png
网络的encoder是VGG分类网络,网络的decoder是对分类网络得到的feature map进行unpooling的deconvolution网络,最后网络输出的是概率图,对于每个像素属于每一个类别的概率. 最后得到每个像素类别的label. 这里可以提前说下DeconvNet没有去除VGG分类网络的fully connected layer, 而fully connected layer中有大量的参数,最后训练处理出的模型会占用大量的空间. 如果是做Application级别的产品最好还是用后面的SegNet, SegNet去除了fully connected layer不管是训练速度还是占用内存都要小很多.

Unpooling和Deconvolution

Unpooling

什么是pooling?

Pooling in convolution network is designed to filter noisy
activations in a lower layer by abstracting activations in a
receptive field with a single representative value.

虽然pooling可以增强激活区域的鲁棒性,但是同事也丢失了感受域内的空间信息。这些structure information可能对需要dense prediction的segmentation有较大的作用.

如何实现unpooling?
记录pooling时最大激活点(maximum activation)的位置。

deconvolution

从unpooling处得到的内容是稀疏的,通过deconvolution 可以得到enlarge dense 的 activation map. 然后将enlarge 边缘的像素裁剪掉得到和unpooling 输入大小一样的feature map.

在网络中unpooling和deconvolution的作用是不一样的:可以说unpooling是example specific的而deconvolution是class specific的. example specific意思就是只要是object那么unpooling通过前面pooling记录的 location information重建object的structure, 但是我们需要对每个像素点进行分类,那么你得到object stucture还不够,周围还有噪声信息和非target class的信息,那么deconvolution就是对其target class信息进行放大,对非target class信息进行抑制. 结合二者, decoder端的deconvolution network就可以输出较为准确的segmentation map.

其实从这两点而言DeconvNet和SegNet的decoder端的结构很相似的. 上采样得到sparse activation map然后通过deconv/conv得到dense activation map.

gHx9u.png

从下面activation map的可视化也可以看出encoder端是特征逐渐抽象(detail to coarse)的过程而decoder是从(coase to detail)的过程:
gH2b9.png

instance wise segmentation vs. image level segmentation

这里没怎么看懂

Training

  1. Batch Normalization
  2. Two-stage Training
  3. ensemble with FCN
    网络详细结构:
    gTZhh.png

Inference

测试的时候每张图像在输入网络之前,作者使用edge-box来产生candicate proposals这样可以在不同尺度上检测物体. 每张测试图片先产生2000个candicates然后根据object score挑选50个输入网路. 前面提到的instance wise segmentation也应该和这里有关,感觉作者介绍的不是很详细.

总体而言DeConvNet的idea虽然比较novel(不知道SegNet有没有借鉴DeConvNet), 但是很明显网络过深,很难训练,而且没有去除fully connected layer, 还需要使用edge-box产生candicate proposal, 不是一个end-to-end的网络. 实际使用的话我还是推荐SegNet吧.

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Abstract

The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample.

1. Introduction

在decoding端重复使用encoding的 max-pooling indices:

  1. improves boundary delineation
  2. reduces the number of parameters enabling end-to-end training

Architecture

gHIQa.png

without fully connected layer(134M to 14.7M)

encoder

conv + batchnorm + ReLU + max pooling(2*2)

to keep the spatial resolution of the feature map after max pooling, Segnet choose to store max pooling indices.

decoder

upsample feature maps using max pooling indices -> sparse feature maps. + trainable filter banks + batch norm

Use variant kinds of decoders to compare

Training

  • median frequency balancing
  • natural frequency balancing

analysis

BF: boundry F1 measure

SegNet和Deconvolution Net相似之处都是在encoder端保存max pooling indices,然后在decoder端使用indices进行unsample得到feature map, 然而这个时候得到的feature map仍然是稀疏的,因此在这个feature map之后再接convolution layer/deconvolutional得到更好的feature map. SegNet和Deconvoluton Net差别在于SegNet没有fully connected layer是一个更加轻量的框架.

U-Net

Abstract

  • use data augumentation to train the model
  • contracting path to capture context
  • symmetric expanding path enables precise localization

Introduction

  • High resolution features from the contracting path are combined with the upsampled output
  • overlap-tile strategy 这里没怎么看懂啊
  • elastic deformation for augmentation
  • 使用weight loss解决多分类问题中的touching border问题

Network Architecture

g48Eq.png

左边是contracting path, 右边是expansive path

左边使用33 convolution + ReLU + 22 max-pooling, 每次pooling feature channels 加倍
右边使用upsampling + 22 convolution(feature channels数目减半)+concatenation with corresponding feature map from contracting path + 33 convolution + ReLU

Training

  • energy function:

\[E = \sum_{x\in\Omega} w(x)log(p_{l(x)}(x)) \]

  • weight map:

\[w(x) = w_c(x) + w_0 \cdot exp(-\frac{(d_1(x)+d_2(x))^2}{2\sigma^2}) \]

  • 每一层的权重初始化,高斯分布,std: \(\sqrt{2/N}\)

Experiments

在两个医学数据集上都取得了较好的效果.

总体而言U-net结构是比较简单的,而且根据作者所言比较适合小数据集,第一个来自于EM segmentation challenge 中只有30张(512*512)图片,

posted @ 2018-06-13 21:29  VincentCheng  阅读(652)  评论(0编辑  收藏  举报