Singular Value Decomposition

The following content is compiled from Foundation of Data Science.

Consider each row of an \(n\times d\) matrix \(A\) as a point in \(d\)-dimensional space. The singular value decomposition finds the best-fitting \(k\)-dimensional subspace for \(k=1,2,3,\cdots\) for the set of \(n\) data points. Here, “best” means minimizing the sum of squares of the perpendicular distances of the points to the subspace, or equivalently, maximizing the sum of squares of the lengths of the projections of the points onto the subspace.

Consider the best fit line through the origin, i.e., best-fit \(1\)-dimensional subspace. Let \(v\) be a unit vector along this line. the length of the projection of \(a_i\), the \(i\)-th row of \(A\), onto \(v\) is \(|a_i\cdot v|\). From this we see that the sum of the squared lengths of the projections is \(|Av|^2\). With this in mind, define the first singular vector \(v_1\) of \(A\) as \(v_1=\arg\max\limits_{|v|=1}|Av|\) (arbitrarily pick one if tie). The value \(\sigma_1(A)=|Av_1|=\max\limits_{|v|=1}|Av|\) is called the first singular value of \(A\).

Now we define the other singular vectors and singular values in a greedy way. The second singular vector \(v_2\) is defined by the best fit line perpendicular to \(v_1\), that is, \(v_2=\arg\max\limits_{v\perp v_1,|v|=1}|Av|\), and the value \(\sigma_2(A)=|Av_2|\) is called the second singular value of \(A\). The third singular value are defined similarly by \(v_3=\arg\max\limits_{v\perp v_1,v_2,|v|=1}|Av|\) and \(\sigma_3(A)=|Av_3|\) and so on. The process stops when we found singular vectors \(v_1,v_2,\cdots,v_r\), singular values \(\sigma_1,\sigma_2,\cdots,\sigma_r\) and \(\max\limits_{v\perp v_1,\cdots,v_r,|v|=1}|Av|=0\).

The following theorem shows the correctness of the greedy algorithm.

Theorem 3.1 Let \(A\) be an \(n\times d\) matrix with singular vectors \(v_1,v_2,\cdots,v_r\). For \(1\le k\le r\), let \(V_k\) be the subspace spanned by \(v_1,v_2,\cdots,v_k\). For each \(k\), \(V_k\) is the best-fit \(k\)-dimensional subspace for \(A\).

Proof: The statement is obviously true for \(k=1\). For \(k=2\), let \(W\) be a best-fit \(2\)-dimensional subspace for \(A\). Choose an orthonormal basis \((w_1,w_2)\) of \(W\) so that \(w_2\) is perpendicular to \(v_1\). Since \(v_1\) maximizes \(|Av|^2\), it follows that \(|Aw_1|^2\le |Av_1|^2\). Since \(v_2\) maximizes \(|Av|^2\) over all \(v\) perpendicular to \(v_1\), it follows that \(|Aw_2|^2\le |Av_2|^2\). Thus \(|Aw_1|^2+|Aw_2|^2\le |Av_1|^2+|Av_2|^2\). Note that \(|Aw_1|^2+|Aw_2|^2\) is the sum of squared lengths of the projections of the rows of \(A\) onto \(W\), so \(V_2\) is at least as good as \(W\) and so is a best-fit \(2\)-dimensional subspace.

For general \(k\), proceed by induction. Suppose \(W\) is a best-fit \(k\)-dimensional subspace. Choose an orthonormal basis \(w_1,w_2,\cdots,w_k\) of \(W\) so that \(w_k\) is perpendicular to \(v_1,v_2,\cdots,v_{k-1}\). Since \(V_{k-1}\) is a best-fit \(k-1\)-dimensional subspace by induction hypothesis, it follows that \(|Aw_1|^2+\cdots+|Aw_{k-1}|^2\le |Av_1|^2+\cdots+|Av_{k-1}|^2\). Since \(v_k\) maximizes \(|Av|^2\) over all \(v\) perpendicular to \(v_1,\cdots,v_{k-1}\), it follows that \(|Aw_k|^2\le |Av_k|^2\). Thus \(|Aw_1|^2+\cdots+|Aw_{k}|^2\le |Av_1|^2+\cdots+|Av_{k}|^2\). Note that \(|Aw_1|^2+\cdots+|Aw_{k-1}|^2\) is the is the sum of squared lengths of the projections of the rows of \(A\) onto \(W\), so \(V_k\) is at least as good as \(W\) and so is a best-fit \(k\)-dimensional subspace. \(\square\)

Lemma 3.2 For any matrix \(A\), the sum of squares of the singular values equals the square of the Frobenius norm. That is, \(\sum \sigma_i^2(A)=\|A\|^2_F\).

Proof: \(\|A\|_F^2=\sum\limits_{j=1}^{n}|a_j|^2=\sum\limits_{j=1}^{n}\sum\limits_{i=1}^{r}(a_j\cdot v_i)^2=\sum\limits_{i=1}^{r}\sum\limits_{j=1}^{n}(a_j\cdot v_i)^2=\sum\limits_{i=1}^{r}|Av_i|^2=\sum\limits_{i=1}^{r}\sigma_i^2(A)\). \(\square\)

The vectors \(v_1,v_2,\cdots,v_r\) are called the right-singular value. The vectors \(Av_i\) form a fundamental set of vectors and we normalize then to length one by \(u_i=\frac{1}{\sigma_i(A)}Av_i\). Later we will show that \(u_i\) similarly maximizes \(|u^TA|\) over all \(u\) perpendicular to \(u_1,\cdots,u_{i-1}\), hence these \(u_i\) are called the left-singular vectors.

Theorem 3.7 The left singular vectors are pairwise orthogonal.

Proof: Let \(i\) be the smallest integer such that \(u_i\) is not orthogonal to some other \(u_j\). Without loss of generality assume that \(u_i^Tu_j=\delta>0\) (if \(\delta <0\) just replace \(u_i\) with \(-u_i\)). Clearly \(j>i\) since \(i\) was selected to be the smallest such index. For \(\varepsilon>0\), let \(v_i'=\dfrac{v_i+\varepsilon v_j}{|v_i+\varepsilon v_j|}\), then \(Av_i'=\dfrac{\sigma_iu_i+\varepsilon\sigma_ju_j}{\sqrt{1+\varepsilon^2}}\) (noting that \(v_i'\) is a unit-length vector). Now for sufficiently small \(\varepsilon\), we have

\[u_i^T(\frac{\sigma_iu_i+\varepsilon\sigma_ju_j}{\sqrt{1+\varepsilon^2}})>(\sigma_i+\varepsilon\sigma_j\delta)(1-\frac{\varepsilon^2}{2})>\sigma_i-\frac{\varepsilon^2}{2}\sigma_i+\varepsilon\sigma_j\delta-\frac{\varepsilon^3}{2}\sigma_j\delta>\sigma_i \]

This is a contradiction since \(v_i+\varepsilon v_j\) is orthogonal to \(v_1,v_2,\cdots,v_{i-1}\) and \(\sigma_i\) is defined to be the maximum of \(|Av|\) over such vectors. \(\square\)

Theorem 3.4 Let \(A\) be an \(n\times d\) matrix with right-singular vectors \(v_1,v_2,\cdots,v_r\), left-singular vectors \(u_1,u_2,\cdots,u_r\), and corresponding singular values \(\sigma_1,\sigma_2,\cdots,\sigma_r\), then

\[A=\sum_{i=1}^{r}\sigma_iu_iv_i^T \]

For \(k\in\{1,2,\cdots,r\}\), let \(A_k=\sum\limits_{i=1}^k\sigma_iu_iv_i^T\) be the sum truncated after \(k\) terms. It is clear that \(A_k\) has rank \(k\). We show that \(A_k\) is the best rank \(k\) approximation to \(A\), where error is measured int he Frobenius norm.

Proof: Note that \(\sum\limits_{i=1}^{r}\sigma_iu_iv_i^Tv_j=\sum\limits_{i=1}^{r}\sigma_ju_j=Av_j\). Since any vector \(v\) can be expressed as a linear combination of the singular vectors plus a vector perpendicular to the \(v_i\), we have \(Av=(\sum\limits_{i=1}^{r}\sigma_iu_iv_i^T)v\), and hence \(A=\sum\limits_{i=1}^{r}\sigma_iu_iv_i^T\). \(\square\)

The decomposition \(A=\sum\limits_{i=1}^{r}\sigma_iu_iv_i^T\) is called the singular value decomposition, SVD.

Theorem 3.6 For any matrix \(B\) of rank at most \(k\), \(\|A-A_k\|_F\le \|A-B\|_F\).

Proof: Let \(B\) minimize \(\|A-B\|_F^2\) among all rank \(k\) or less matrices. Let \(V\) be the space spanned by the rows of \(B\). The dimension of \(V\) is at most \(k\). Fix \(V\) and we find that each row of \(B\) should be the projection of the corresponding row of \(A\) onto \(V\), so \(\|A-B\|_F^2\) is the sum of squared distances of rows of \(A\) to \(V\). Since \(A_k\) minimize the sum of squared distance of \(A\) to any \(k\)-dimensional subspace, it follows that \(\|A-A_k\|_F\le \|A-B\|_F\). \(\square\)

In addition to the Frobenius norm, there is another matrix norm of interest. Consider a large number of vectors where for each vector \(x\) we wish to compute \(Ax\). It takes time \(O(nd)\) to compute each product; but if we approximate \(A\) by \(A_k\), and approximate \(Ax\) by \(A_kx\), it requires only \(k\) dot products of \(d\)-dimensional vectors, follows by a sum of \(k\) \(n\)-dimensional vectors, and takes time \(O(kd+kn)\). The error here is \(|(A_k-A)x|\). We take the maximum over all \(x\) of \(|(A_k-A)x|\) such that \(|x|\le 1\).

Formally, define \(\|A\|_2=\max\limits_{|x|\le 1}|Ax|\), which is called the \(2\)-norm or the spectral norm. Next we prove that \(A_k\) is the best rank \(k\), \(2\)-norm approximation to \(A\).

Lemma 3.8 \(\|A-A_k\|_2^2=\sigma_{k+1}^2\).

Proof: Let \(A=\sum\limits_{i=1}^{r}\sigma_iu_iv_i^T\) be the SVD of \(A\), then \(A_k=\sum\limits_{i=1}^{k}\sigma_iu_iv_i^T\) and \(A-A_k=\sum\limits_{i=k+1}^{r}\sigma_iu_iv_i^T\). Let \(v\) be the top singular vector of \(A-A_k\). Express \(v\) as a linear combination of \(v_1,v_2,\cdots,v_r\), that is, write \(v=\sum\limits_{j=1}^rc_jv_j\). Then

\[|(A-A_k)v|=\left|\sum_{i=k+1}^{r}\sigma_iu_iv_i^T\sum_{j=1}^{r}c_jv_j\right|=\left|\sum_{i=k+1}^{r}c_i\sigma_iu_iv_i^Tv_i\right|=\left|\sum_{i=k+1}^rc_i\sigma_iu_i\right|=\sqrt{\sum_{i=k+1}^{r}c_i^2\sigma_i^2} \]

The \(v\) maximize this last quantity, subject to the constraint that \(|v|^2=\sum\limits_{i=1}^{r}c_i^2=1\), occurs when \(c_{k+1}=1\) and the rest of the \(c_i\) are zero. Thus \(\|A-A_k\|_2^2=\sigma_{k+1}^2\), providing the lemma. \(\square\)

Theorem 3.9 Let \(A\) be an \(n\times d\) matrix. For any matrix \(B\) of rank at most \(k\), \(\|A-A_k\|_2\le \|A-B\|_2\).

Proof: Assume that \(A\) is of rank greater than \(k\) (else the theorem obviously holds). Since \(B\) has rank at most \(k\), there exists a \(z\neq 0\) in \(\text{null}(B)\cap \text{span}(v_1,v_2,\cdots,v_{k+1})\) where \(\text{null}(B)\) is the the of vectors \(v\) such that \(Bv=0\) and \(v_1,v_2,\cdots,v_{k+1}\) are the first \(k+1\) singular vectors of \(A\). Scale \(z\) to be of length one, we have \(\|A-B\|_2^2\ge |(A-B)z|^2=|Az|^2\). Since \(z\) is in the \(\text{span}(v_1,v_2,\cdots,v_{k+1})\),

\[|Az|^2=\left|\sum_{i=1}^{n}\sigma_iu_iv_i^Tz\right|^2=\sum_{i=1}^{n}\sigma_i^2(v_i^Tz)^2=\sum_{i=1}^{k+1}\sigma_i^2(v_i^Tz)^2\ge \sigma_{k+1}^2\sum_{i=1}^{k+1}(v_i^Tz)^2=\sigma_{k+1}^2 \]

it follows that \(\|A-B\|_2^2\ge \sigma_{k+1}^2=\|A-A_k\|_2^2\), providing the theorem. \(\square\)

Next we talk about how to computing SVD. Here we present an “in-principle” method to establish that the approximate SVD of a matrix \(A\) can be computed in polynomial time. The method we present, called the power method, is simple and is in fact conceptual starting point for many algorithms.

Let \(A\) be a matrix whose SVD is \(\sum_i\sigma_iu_iv_i^T\) and let \(B=A^TA\). By the orthogonality of \(u_i\)s we have \(B=\sum_i\sigma_i^2v_iv_i^T\), so \(B\) is square and symmetric, and has the same left and right-singular vectors. In particular, \(Bv_j=\sigma_j^2v_j\), so \(v_j\) is an eigenvector of \(B\) with eigenvalue \(\sigma_j^2\). By the orthogonality of \(v_i\)’s we have \(B^k=\sum_i\sigma_i^{2k}v_iv_i^T\). If \(\sigma_1>\sigma_2\), then \(B^k\to \sigma_1^{2k}v_1v_1^T\). This means a close estimate to \(v_1\) can be compute by simply taking the first column of \(B^k\) and normalizing it to a unit vector.

A problem with the above method is that \(A\) may be a very large, sparse matrix, say a \(10^8\times 10^8\) matrix with \(10^9\) nonzero entries. Sparse matrices are often represented by just a list of nonzero entries. Though \(A\) is sparse, \(B\) need not be and in worse case may have all \(10^{16}\) entries nonzero and it is then impossible to even write down \(B\). Thus a more efficient method is needed.

Instead of computing \(B^k\), select a random vector \(x\) and compute the product \(B^kx\). The vector \(x\) can be expressed in terms of the singular vectors of \(B\) augmented to a full orthonormal basis as \(x=\sum\limits_{i=1}^{d}c_iv_i\). Then \(B^kx\approx (\sigma_1^{2k}v_1v_1^T)(\sum\limits_{i=1}^{d}c_iv_i)=\sigma_1^{2k}c_1v_1\). Normalizing the resulting vector yields \(y_1\). The way \(B^kx\) is computed is by a series of matrix vector products, instead of matrix products. \(B^kx=A^TA\cdots A^TAx\), which can be computed right-to-left. This consists \(2k\) vector times sparse matrix multiplications.

An issue occurs if there is no significant gap between the first and second singular values. We will overcome this hurdle: Theorem 3.11 below states that even with ties, the power method converges to some vector in the span of those singular vectors corresponding to the “nearly highest” singular values.

Theorem 3.11 Let \(A\) be an \(n\times d\) matrix and \(x\) a unit length vector in \(\vb R^d\) with \(|x^Tv_1|\ge\delta\), where \(\delta >0\). Let \(V\) be the space spanned by the right singular vectors of \(A\) corresponding to singular values greater than \((1-\varepsilon)\sigma_1\). Let \(w\) be the unit vector after \(k=\frac{\ln(1/\varepsilon\delta)}{2\varepsilon}\) iterations of the power method, namely,

\[w=\frac{(A^TA)^kx}{\left|(A^TA)^kx\right|} \]

then \(w\) has a component of at most \(\varepsilon\) perpendicular to \(V\).

Proof: Let \(A=\sum\limits_{i=1}^{r}\sigma_iu_iv_i^T\) be the SVD of \(A\). If the rank of \(A\) is less than \(d\), then for convenience complete \(\{v_1,v_2,\cdots,v_r\}\) into an orthonormal basis \(\{v_1,v_2,\cdots,v_d\}\) of \(d\)-space. Write \(x\) in the basis of \(v_i\)s as \(x=\sum\limits_{i=1}^{d}c_iv_i\). Since \((A^TA)^k=\sum\limits_{i=1}^{d}\sigma_i^{2k}v_iv_i^T\), it follows that \((A^TA)^{k}x=\sum\limits_{i=1}^{d}\sigma_i^{2k}c_iv_i\). By hypothesis, \(|c_1|\ge \delta\).

Suppose that \(\sigma_1,\sigma_2,\cdots,\sigma_m\) are the singular values of \(A\) that are greater than or equal to \((1-\varepsilon)\sigma_1\) and that \(\sigma_{m+1},\cdots,\sigma_d\) are the singular values that are less than \((1-\varepsilon)\sigma_1\). Now

\[\left|(A^TA)^kx\right|^2=\left|\sum_{i=1}^{d}\sigma_i^{2k}c_iv_i\right|^2 =\sum_{i=1}^d\sigma_i^{4k}c_i^2\ge \sigma_1^{4k}c_1^2\ge \sigma_1^{4k}\delta^2 \]

The component of \(|(A^TA)^kx|^2\) perpendicular to \(V\) is

\[\sum_{i=m+1}^{d}\sigma_i^{4k}c_i^2\le (1-\varepsilon)^{4k}\sigma_1^{4k}\sum_{i=m+1}^{d}c_i^2\le (1-\varepsilon)^{4k}\sigma_1^{4k} \]

since \(\sum_{i=1}^{d}c_i^2=|x|=1\). Thus the component of \(w\) perpendicular to \(V\) has squared length at most \(\dfrac{(1-\varepsilon)^{4k}\sigma_1^{4k}}{\sigma_{1}^{4k}\delta^2}=\left(\dfrac{(1-\varepsilon)^{2k}}{\delta}\right)^2\le \left(\dfrac{e^{-2k\varepsilon}}{\delta}\right)^2=\varepsilon^2\) and so its length is at most \(\varepsilon\) since \(k=\dfrac{\ln(1/\varepsilon\delta)}{2\varepsilon}\). \(\square\)

We will see in Lemma 3.12 that a random vector satisfies the condition in Theorem 3.11 with fairly high probability.

Lemma 3.12 Let \(y\in \mathbf R^n\) be a random vector with the unit variance spherical Gaussian as its probability density. Normalize \(y\) to be a unit length by setting \(x=y/|y|\). Let \(v\) be any unit vector. Then

\[\Pr(|x^Tv|\le \frac{1}{20\sqrt d})\le \frac1{10}+3e^{-d/96} \]

Proof: It is sufficient to prove \(\Pr(|y|\ge 2\sqrt d)\le 3e^{-d/96}\) and \(\Pr(|y^Tv|\le \frac{1}{10})\le \frac1{10}\). The former one follows from Gaussian Annulus Theorem with \(\sqrt d\) substituted for \(\beta\), and the latter one follows from the fact that \(y^Tv\) is a random, zero mean, unit variance Gaussian with density is at most \(\frac{1}{\sqrt{2\pi}}\) in the interval \([-\frac1{10},\frac1{10}]\), so the integral of the Gaussian over the interval is at most \(\frac1{10}\). \(\square\)

posted @ 2025-05-06 23:51  by_chance  阅读(95)  评论(0)    收藏  举报