【CoSOD】Re-thinking Co-Salient Object Detection

Re-thinking Co-Salient Object Detection

原始文档:https://www.yuque.com/lart/papers/feumut

CoSOD最近的一篇综述, 梳理了该领域的方法, 提出了一个数据集, 在CVPR版本基础上进一步提出了一个新方法.

CoSOD

是什么

As a extension of this, co-salient object detection (CoSOD) emerged recently to employ a set of images.

The goal of CoSOD is to extract the salient object(s) that are common

within a single image (e.g., red-clothed football players in Fig. 1 (b))

or across multiple images (e.g., the blue-clothed gymnast in Fig. 1 (c)).

Two important characteristics of co-salient objects are local saliency and global similarity.

应用前景

collection-aware crops
- Cosaliency: Where people look when comparing images
co-segmentation
- Higher-order image co-segmentation
- Object-Based Multiple Foreground Video Co-Segmentation via Multi-State Selection Graph
weakly supervised learning
- Capsal: Leveraging captioning to boost semantics for salient object detection
image retrieval
- A model of visual attention for natural image retrieval
- Salientshape: group saliency in image collections
video foreground detection
- Cluster-based co-saliency detection

现有数据集

MSRC [Object categorization by learned universal visual dictionary] and_** Image Pair**_ [A co-saliency model of image pairs] are two of the earliest ones.
1. MSRC was designed for recognizing object classes from images and has spurred many interesting ideas over the past several years. This dataset includes 8 image groups and 240 images in total, with manually annotated pixel-level ground-truth data.
2. Image Pair, introduced by Li et al. [29], was specifically designed for image pairs and contains 210 images (105 groups) in total.
The iCoSeg [icoseg: Interactive co-segmentation with intelligent scribble guidance] dataset was released in 2010. It is a relatively larger dataset consisting of 38 categories with 643 images in total.
1. Each image group in this dataset contains 4 to 42 images,
2. rather than only 2 images like in the Image Pair dataset.
The THUR15K [Salientshape: group saliency in image collections] and CoSal2015 [Co-saliency detection via looking deep and wide] are two large-scale publicly available datasets, with CoSal2015 widely used for assessing CoSOD algorithms.
Different from the above-mentioned datasets, the WICOS [Co-saliency detection within a single image] dataset aims to detect co-salient objects from a single image, where each image can be viewed as one group.

存在的问题

Although the aforementioned datasets have advanced the CoSOD task to various degrees, they are severely limited in variety, with only dozens of groups. On such small-scale datasets, the scalability of methods cannot be fully evaluated.
Moreover, these datasets only provide object-level labels. **None of them provide rich annotations such as bounding boxes, instances, etc., ** which are important for progressing many vision tasks and multi-task modeling. Especially in the current deep learning era, where models are often data-hungry.
Most CoSOD datasets tend to focus on the appearance-similarity between objects to identify the co-salient object across multiple images. However, this leads to data selection bias [Salient objects in clutter: Bringing salient object detection to the foreground], [Unbiased look at dataset bias] and is not always appropriate, since, in real-world applications, the salient objects in a group of images often vary in terms of texture, scene, and background, even if they belong to the same category.

CoSOD的评估

现有评估方式的局限

评价全面性(Completeness), 建议引入更多的指标, 例如S-measure, E-measure.
评价合理性(Fairness), 对于F-measure需要使用二值预测结果的特性, 不同的二值策略导致不同的结果, 所以需要一套公用的基准代码来评估.

To address the aforementioned limitations, we argue that integrating various publicly available CoSOD algorithms, datasets, and metrics, and then providing a complete, unified benchmark, is highly desired.

CoSOD与SOD评估方式的差异

CoSOD涉及到分组, 也就是以每一组内(这些图像内普遍出现的目标往往就是Co-salient Obejct)统计各个指标的结果, 但是这里有个细节需要注意:

对于直接可获得的数值指标(例如MAE、S-measure、weighted F-measure、adaptive F-measure和adaptive E-measure)而言, 就是各组内计算平均值后, 所有组的结果再一起计算一次均值.
但是对于需要通过变化阈值来计算的指标(例如max F-measure、mean F-measure、max E-measure和mean F-measure)而言, 就是各组内平均得到256长度的序列后, 再所有组一起算一次均值. 对于最终得到的肠胃256序列的结果取最大或者均值便可以得到对应的指标值.

关于各个指标具体的定义细节可见本人的python代码或者是Fan提供的matlab代码.

https://github.com/lartpang/PySODMetrics

https://github.com/DengPingFan/CODToolbox

注意, 这里提供的链接是针对SOD或者COD任务的数据的指标计算代码.

对于CoSOD任务的分组计算的特性, 需要进行调整, 具体可见Fan提供的另一份计算CoSOD的代码, 但是他其中的指标计算并不全面, 代码还有部分错误(与这里指出的是相同的错误:https://github.com/DengPingFan/CODToolbox/issues), 但是计算的逻辑是可以参考的:

http://dpfan.net/wp-content/uploads/CoSalBenchmark-EvaluationTools.zip

我近期已经整理了一份python的实现, 暂时没有公开, 指标更加全面(按照本文的内容来看, SOD的指标实际上都可以被用到CoSOD上), 速度更快.

关于我对于E-measure计算的加速的思考可见以下两篇文章:

我是如何使计算时间提速25.6倍的:https://www.yuque.com/lart/blog/aemqfz

我是如何使计算提速>150倍的:https://www.yuque.com/lart/blog/lwgt38

本文的贡献

提出了CoSOD3k数据, 包含13个超类, 160组, 3316张图.
整理了34篇相关工作, 评估了16个模型, 提供了一套评估代码.
提出了一个简单有效的CoSOD框架, 基于现有的SOD方法实现了CoSOD的有效处理.
分析了结果, 对未来的工作提出了一些建议.

CoSOD3k

看文字分析不如图表来的直接.

不同数据集中数据属性的统计, 可见本文提出的数据集包含的丰富的注释类型

不同数据集中目标属性的统计

CoSOD3k类别信息统计

The overall dataset mask (the right of Fig. 7) tends to appear as a center-biased map without shape bias. As is well-known, humans are usually inclined to pay more attention to the center of a scene when taking a photo. Thus, it is easy for a SOD model to achieve a high score when employing a Gaussian function in its algorithm.

CoEG-Net

本文提出了一个两分支的框架以一种多重独立的方式(in a multiply independent fashion)分别捕获并发依赖(concurrent dependencies)和显著性前景. 通过上面的分支获得co-attention maps和下面分支获得的saliency prior maps之间相乘(element-wise)来产生最终的co-saliency prediction.

下面的显著性分支较为简单, 直接使用了DUTS上训练好的EGNet来收集多尺度显著性先验. 这可以在不利用跨图像信息的前提下帮助识别图像中的显著性区域.
上面分支以一种无监督的方式生成co-attention map. 这部分需要细讲一下.

Co-attention Projection for Co-saliency Learning

这里的设计受CAM[Learning deep features for discriminative localization]的启发:

给定输入图像\(\mathbf{I}^n\), 对应图像类别(keywords labeling)为\(c\)
从VGG最后的卷积层中获得特征激活图\(\mathbf{X}^n\)
\(c\)通过类别监督可以获得(例如从分类任务的全连接层对应的参数获得)对应与卷积特征激活输出各个通道的权重\(\omega\)
可以得到最终的类别特定的attention map:\(\mathbf{M}^n_c=\sum^K_{k=1}\omega^c_k\mathbf{X}^n\)
针对特征图\(\mathbf{X}^n\)上的每一个位置, 可以得到更加具体的计算方式:\(\mathbf{M}^n_c(i, j)=(\omega^c)^\top \cdot \mathbf{x}^n(i, j)\)

因此CAM实际上实现了一种从特征\(\mathbf{x}^n(i, j)\)到类别特定激活图\(\mathbf{M}^n_c(i, j)\)的线性变换.

本文延续这种思路, 并且根据自身没有类别标签的情况进行了进一步无监督学习的探索.

作者给出了自己的分析:

Ideally, the unknown common object category among a group of associated images \(\{\mathbf{I}^n\}^N_{n=1}\) should corresponds to a linear projection that results in high class activation scores in the common object regions, while having low class activation scores in other image regions.

From another point of view, the common object category should correspond to the linear transformation that generates the highest variance (most informative) in the resulting class activation maps.

Follow the idea in coarse localization task [Unsupervised object discovery and co-localization by deep descriptor transformation], we achieve this gold by exploring the classical principle component analysis (PCA) method [LIII. On lines and planes of closest fit to systems of points in space], which is the simplest way of revealing the internal structure of the data in a way that best explains the variance in the data.

我觉的这个解释有点牵强. 感觉逻辑有点不够连贯: high class activation scores =?>the highest variance (most informative)

接下来就是温习PCA的阶段了:

给定\(\{\mathbf{I}^n\}\), 可以得到\(\{\mathbf{X}^n\}\)
旨在获得一个变换, 可以从\(\{\mathbf{X}^n\}\)获得一个有着最大方差的co-attetion maps\(\{\mathbf{A}^n\}\), 注意这里是一组结果, 这个变换则通过分析特征描述子\(\{\mathbf{x}^n(i, j)\}\)的协方差矩阵获得
计算均值:\(\bar{\mathbf{x}} = \frac{1}{Z}\sum_n\sum_{i, j}\mathbf{x}^n(i, j)\)获得, 这里的Z是一个\(N \times H \times W\)的张量
通过对\(\mathbf{x}^n(i, j)\)去均值处理获得零均值版本的描述子\(\hat{\mathbf{x}}^n(i, j)\)
进一步获得协方差矩阵:

(虽然原文是这么给的, 但是为什么还要再减均值呢?)

这里通过获得Cov的最大的特征值对应的特征向量得到对应的线性变换:

这里的\(\xi^*\)表示对应的特征向量

可视化结果

这里需要注意, 得到的attention maps本身是灰度的, 具有极高的模糊性. 为了将其集成到已经由EGNet得到的saliency prior map上, 需要先对其进行处理, 文中使用了densecrf和manifold ranking来进一步细化.

实验结果

也尝试了基于其他SOD方法的实验

讨论和建议

SOD方法的良好表现并不一定意味着当前的数据集不够复杂, 或者直接使用SOD方法可以获得良好的性能: From the evaluation, we observe that, in most cases, the current SOD methods can obtain very competitive or even better performances than the CoSOD methods. However, this does not necessarily mean that the current datasets are not complex enough or using the SOD methods directly can obtain the good performances—the performances of the SOD methods on the CoSOD datasets are actually lower than those on the SOD datasets.
CoSOD的研究还存在一些问题: Consequently, the evaluation results reveal that many problems in CoSOD are still under-studied and this makes the existing CoSOD models less effective.
- Scalability: 现有方法很难应对更大的组的数据同时处理, 如何降低由于组内图像数量造成的计算损耗, 是实际应用需要考虑的关键问题.
- Stability: 一些方法对于数组组内样本的顺序有依赖, 这损害了模型性能的稳定性(如果改变顺序或者划分的子组有变换, 可能性能有变化). 这会限制实际的应用.
- Compatibility: 在CoSOD框架中引入SOD方法被本文证明了有效性, 但是如何实现更加高效(时间消耗)端到端可训练的检测是一个值得研究的问题.
- Metrics: 现有指标主要基于单图像的目标的预测评估, 没有考虑跨图像的目标预测的评估.

随想中的生活

生活，记忆，与情绪。