Sparse Coding

Sparse Coding

Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently.  —— 过完备的基,无监督  The aim of sparse coding is to find a set of basis vectors \mathbf{\phi}_i such that we can represent an input vector \mathbf{x} as a linear combination of these basis vectors:

\begin{align}
\mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} 
\end{align}
 
While techniques such as Principal Component Analysis (PCA) allow us to learn a complete set of basis vectors efficiently, we wish to learn an
over-complete
set of basis vectors to represent input vectors
\mathbf{x}\in\mathbb{R}^n
(i.e. such that
k
>
n
).
The advantage of having an over-complete basis is that our basis vectors are better able to capture structures and patterns inherent in the input data.
However, with an over-complete basis,
the coefficients ai are no longer uniquely determined
by the input vector
\mathbf{x}
. Therefore, in sparse coding, we introduce the additional criterion of sparsity t
o resolve the degeneracy introduced by over-completeness.

 

 

过完备的基能发掘出数据的内在结构和模型,但是其系数表示将不唯一

Here, we define sparsity as having few non-zero components or having few components not close to zero. The requirement that our coefficients ai be sparse means that given a input vector, we would like as few of our coefficients to be far from zero as possible. The choice of sparsity as a desired characteristic of our representation of the input data can be motivated by the observation that most sensory data such as natural images may be described as the superposition of a small number of atomic elements such as surfaces or edges. 原子图像的叠加 Other justifications such as comparisons to the properties of the primary visual cortex have also been advanced.

We define the sparse coding cost function on a set of m input vectors as

\begin{align}
\text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)
\end{align}

where S(.) is a sparsity cost function which penalizes ai for being far from zero.

稀疏惩罚项

We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of \mathbf{x}and the second term as a sparsity penalty which forces our representation of \mathbf{x} to be sparse. The constant λ is a scaling constant to determine the relative importance of these two contributions.

Although the most direct measure of sparsity is the "L0" norm (S(a_i) = \mathbf{1}(|a_i|>0)), it is non-differentiable (不可微) and difficult to optimize in general. In practice, common choices for the sparsity cost S(.) are the L1 penalty S(a_i)=\left|a_i\right|_1 and the log penalty S(a_i)=\log(1+a_i^2).

 

In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down ai and scaling \mathbf{\phi}_i up by some large constant. To prevent this from happening, we will constrain \left|\left|\mathbf{\phi}\right|\right|^2 to be less than some constant C. The full sparse coding cost function including our constraint on \mathbf{\phi} is

\begin{array}{rc}
\text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} & \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) 
\\
\text{subject to}  &  \left|\left|\mathbf{\phi}_i\right|\right|^2 \leq C, \forall i = 1,...,k 
\\
\end{array}

 

Probabilistic Interpretation

 

到目前为止,我们所考虑的稀疏编码,是为了寻找到一个稀疏的、超完备基向量集,来覆盖我们的输入数据空间。现在换一种方式,我们可以从概率的角度出发,将稀疏编码算法当作一种“生成模型”。

我们将自然图像建模问题看成是一种线性叠加,叠加元素包括 k 个独立的源特征 \mathbf{\phi}_i 以及加性噪声 ν :

\begin{align}
\mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x})
\end{align}

我们的目标是找到一组特征基向量 \mathbf{\phi} ,它使得图像的分布函数 P(\mathbf{x}\mid\mathbf{\phi}) 尽可能地近似于输入数据的经验分布函数 P^*(\mathbf{x}) 。一种实现方式是,最小化 P^*(\mathbf{x})P(\mathbf{x}\mid\mathbf{\phi}) 之间的 KL 散度,此 KL 散度表示如下:

\begin{align}
D(P^*(\mathbf{x})||P(\mathbf{x}\mid\mathbf{\phi})) = \int P^*(\mathbf{x}) \log \left(\frac{P^*(\mathbf{x})}{P(\mathbf{x}\mid\mathbf{\phi})}\right)d\mathbf{x}
\end{align}

因为无论我们如何选择 \mathbf{\phi} ,经验分布函数 P^*(\mathbf{x}) 都是常量,也就是说我们只需要最大化对数似然函数 P(\mathbf{x}\mid\mathbf{\phi}) 。 假设 ν 是具有方差σ2 的高斯白噪音,则有下式:

\begin{align}
P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp\left(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2}\right)
\end{align}

为了确定分布 P(\mathbf{x}\mid\mathbf{\phi}) ,我们需要指定先验分布 P(\mathbf{a}) 。假定我们的特征变量是独立的,我们就可以将先验概率分解为:

\begin{align}
P(\mathbf{a}) = \prod_{i=1}^{k} P(a_i)
\end{align}

此时,我们将“稀疏”假设加入进来——假设任何一幅图像都是由相对较少的一些源特征组合起来的。因此,我们希望 ai 的概率分布在零值附近是凸起的,而且峰值很高。一个方便的参数化先验分布就是:

\begin{align}
P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i))
\end{align}

这里 S(ai) 是决定先验分布的形状的函数。

当定义了 P(\mathbf{x} \mid \mathbf{a} , \mathbf{\phi}) P(\mathbf{a}) 后,我们就可以写出在由 \mathbf{\phi} 定义的模型之下的数据 \mathbf{x} 的概率分布:

\begin{align}
P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a}
\end{align}

那么,我们的问题就简化为寻找:

\begin{align}
\mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) >
\end{align}

这里 < . > 表示的是输入数据的期望值。

不幸的是,通过对 \mathbf{a} 的积分计算 P(\mathbf{x} \mid \mathbf{\phi}) 通常是难以实现的。虽然如此,我们注意到如果 P(\mathbf{x} \mid \mathbf{\phi}) 的分布(对于相应的 \mathbf{a} )足够陡峭的话,我们就可以用 P(\mathbf{x} \mid \mathbf{\phi}) 的最大值来估算以上积分。估算方法如下:

\begin{align}
\mathbf{\phi}^{*'}=\text{argmax}_{\mathbf{\phi}} < \max_{\mathbf{a}} \log(P(\mathbf{x} \mid \mathbf{\phi})) >
\end{align}

跟之前一样,我们可以通过减小 ai 或增大 \mathbf{\phi} 来增加概率的估算值(因为 P(ai) 在零值附近陡升)。因此我们要对特征向量 \mathbf{\phi} 加一个限制以防止这种情况发生。

最后,我们可以定义一种线性生成模型的能量函数,从而将原先的代价函数重新表述为:

\begin{array}{rl}
E\left( \mathbf{x} , \mathbf{a} \mid \mathbf{\phi} \right) & := -\log \left( P(\mathbf{x}\mid \mathbf{\phi},\mathbf{a}\right)P(\mathbf{a})) \\
 &= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) 
\end{array}

其中 λ = 2σ2β ,并且关系不大的常量已被隐藏起来。因为最大化对数似然函数等同于最小化能量函数,我们就可以将原先的优化问题重新表述为:

\begin{align}
\mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) 
\end{align}

使用概率理论来分析,我们可以发现,选择 L1 惩罚和 \log(1+a_i^2) 惩罚作为函数 S(.) ,分别对应于使用了拉普拉斯概率 P(a_i) \propto \exp\left(-\beta|a_i|\right) 和柯西先验概率 P(a_i) \propto \frac{\beta}{1+a_i^2}

 

Learning

 

 

 

 

 

 

 

 

 

 

使用稀疏编码算法学习基向量集的方法,是由两个独立的优化过程组合起来的。第一个是逐个使用训练样本 \mathbf{x} 来优化系数 ai ,第二个是一次性处理多个样本对基向量 \mathbf{\phi} 进行优化。

如果使用 L1 范式作为稀疏惩罚函数,对 a^{(j)}_i 的学习过程就简化为求解 由 L1 范式正则化的最小二乘法问题,这个问题函数在域 a^{(j)}_i 内为凸,已经有很多技术方法来解决这个问题(诸如CVX之类的凸优化软件可以用来解决L1正则化的最小二乘法问题)。如果 S(.) 是可微的,比如是对数惩罚函数,则可以采用基于梯度算法的方法,如共轭梯度法。

L2 范式约束来学习基向量,同样可以简化为一个带有二次约束的最小二乘问题,其问题函数在域 \mathbf{\phi} 内也为凸。标准的凸优化软件(如CVX)或其它迭代方法就可以用来求解 \mathbf{\phi},虽然已经有了更有效的方法,比如求解拉格朗日对偶函数(Lagrange dual)。

根据前面的的描述,稀疏编码是有一个明显的局限性的,这就是即使已经学习得到一组基向量,如果为了对新的数据样本进行“编码”,我们必须再次执行优化过程来得到所需的系数。这个显著的“实时”消耗意味着,即使是在测试中,实现稀疏编码也需要高昂的计算成本,尤其是与典型的前馈结构算法相比。

posted @ 2014-09-19 17:19  老姨  阅读(357)  评论(0)    收藏  举报