Note -「Math. in Info. Age」"Hypersphere is Star!"
Properties: Unit ball in High dimensional space
Question 1
When trying to sample a point within a unit ball:
- How to sample it?
- What's the (expected) norm of this random vector?
We highlight the intuition that "probability" has a lot to do with "volume". Analytically,
Or there is a convenient variation:
We focus on sampling point on the surface of the ball first. We are familiar with the algorithm below:
- Sample \(x_i\) from \(\mathcal N(0,1)\), and normalize \((\seq x1d)\) to yield a random point.
The uniformity of the direction is easy to prove (by entropy intuitively, or little calculation).
Given the independency of the direction and the norm of a vector, the problem is reduced to sample the norm of the vector properly. Formally, the sampling algorithm is like
- Sample \(r\) from distribution \(P(r)\) (which is uncertain yet);
- Sample \(\bs x\) on the surface of a unit ball;
- Yield \(r\bs x\).
When sampling \(r\), it is required to sample \(r\in[0,1]\) s.t. the density of \(r\) is proportional to \(r^{d-1}\). That is, \(P(r)=dr^{d-1}~(r\in[0,1])\). To achieve this, we can pick \(s\sim\opn{Unif}([0,1])\) and let \(r=\sqrt[d]{s}\).
Question 2
What's the relation between \(V(d)\) and \(A(d)\)?
Here we know \(V(d)=\frac{A(d)}{d}\).
Question 3
Find \(I:=\int_{-\oo}^{+\oo}\e^{-x^2}\d x\).
A classical problem.
Therefore, \(I=\sqrt\pi\).
One step further, this may derive the formula for volumes of \(n\)-balls:
which gives
Remains to solve: \(J_d\)s.
Question 4
Find \(I_n:=\int_0^{+\oo}x^n\e^{-x}\d x\).
It gives factorials, deriving the Gamma function \(\Gamma(t+1):=\int_0^{+\oo}x^t\e^{-x}\d x\) as a generalization.
Back to \(J_d\)s:
Therefore, \(A(d)=\frac{2\pi^{d/2}}{\Gamma(d/2)}\) and \(V(d)=\frac{2\pi^{d/2}}{d\Gamma(d/2)}\).
Next step, we are to illustrate that the mass of a ball is concentrated on its equator in a high dimensional space.
The reason is obvious: we may calculate the volume (mass) by integrating on the first axis, where
just causes the phenomenon:
For the whole half ball, we may estimate its lower bound with the volume of the cylinder inside it. If the height of the cylinder is \(h'\), then
Pick \(h'=\frac{1}{\sqrt{d-1}}\) for the tightest bound:
and set \(h=\frac{c}{\sqrt{d-1}}\),
Therefore,
Theorem
\(\seq{\bs x}1n\) are uniformly and independently sampled from the interior of a unit ball, then with the probability of \(p\ge 1-\frac{1}{n}\):
(a) \(\A i,~\|\bs x_i\|\ge 1-\frac{2\ln n}{d}\);
(b) \(\A i\neq j,~|\bs x_i\cdot \bs x_j|<\sqrt{\frac{6\ln n}{d-1}}\).
→ Proof. Use estimations above and take a union bound.
Look back at the property \(\lim_{d\to\oo}V(d)=0\). We define the intersection of \(d\) equators is the box of a \(d\)-unit ball. Then when sampling \(\bs x\) in the ball:
This goes contradictory with our intuition: this box should be quite small since the mass proves to concentrate on the surface of the ball. Why is that?
The induction is correct so let's just change our intuition: a proper 2D visualization of a high-dimensional ball should be a "star" (think about it in your mind) instead of a naive 2D ball.
Gaussian Annulus Theorem
For all \(\beta\le\sqrt{d}\), with the probability \(1-3\e^{-c\beta^2}\), \(|\bs x|\in[\sqrt{d}-\beta,\sqrt{d}+\beta]\), where \(c\) is a fixed and calculable constant.
Johnson-Lindenstrauss's Algorithm (JL algo.)
For \(n\) points in \(\R^d\), \(\seq{\bs v}1n\), \(\eps\in(0,1)\), there exists an algorithm who constructs a map \(f:\R^d\to\R^k\) where \(k\ge\frac{3\ln n}{c\eps^2}\), and with probability at least \(1-\frac{3}{2n}\), (\(c\) is the constant mentioned above), \(f\) is \(\eps\)-pairwise-distance-preserving, i.e.
\[\A i\neq j,~(1-\eps)\|\bs v_i-\bs v_j\|\le\|f(\bs v_i)-f(\bs v_j)\|\le(1+\eps)\|\bs v_i-\bs v_j\|. \]
→ Proof. Sample \(\seq{\bs u}1k\) uni. & ind. on the \(d\)-unit ball, and set \(f':\bs v\mapsto\pmat{\bs v\cdot\bs u_1&\cdots&\bs v\cdot \bs u_k}^\T\). This actually gives a Gaussian vector on \(k\)-unit ball. Therefore, a unit vector \(\bs v\) is expected to be mapped to a vector of norm \(\sqrt k\). That is, a distance vector \(\bs v_i-\bs v_j\) is expected to be scaled \(\sqrt{k}\) times (confirmed by Gaussian annulus theorem and union bound), independent with the vector itself. \(f=\frac{1}{\sqrt k}f'\) meets our desire with high (could be quantified with little calculation) probability.
Singular Value Decomposition
We are interesting in "compact" a high-rank matrix into a low-rank matrix while conserving its "information" as much as possible. As for the measurement of errors, we use Frobenius norm:
That is, for a matrix \(A\), we want to find \(B\) with \(\rank B\le k\), which minimized \(\|A-B\|_F\).
An easy version: find the best unit vector that minimized the sum of squared distance to a batch of vectors (as row vectors, they form a matrix \(A\)). Given the length of all the vectors are fixed, minimizing the sum is equivalent to maximizing the sum of squared norm of the projections on this unit vector, that is to maximize \(\|Av\|_2^2\).
The first singular value \(\sigma_1\) is defined as
and the argmax of it as \(v_1\) (maybe multiple even up to scaling).
Exercise 1 (Best Rank-One Approximation)
For the first singular value \(v_1\) of \(A\), \(\|A-Av_1v_1^\T\|_F\le \|A-B\|_F\) for any \(\rank B=1\).
→ Proof. \(Av_1v_1^\T\) forms a matrix, row vectors of which are projections on \(v_1\) of \(A\)'s row vectors. Further discussion is clear.
Let's go one step further, from best-fit line (1D) to best-fit plane (2D). Intuitively, we come up with a greedy algorithm: find \(\sigma_1\) and \(v_1\), and then find the best-fit line \(v_2=\arg\max_{|v|=1\land v\perp v_1}|Av|\). (We write \(\|\cdot\|_2\) as \(|\cdot|\) for vectors for simplicity.)
Why is it correct? For any \(w_1\perp w_2\), we can rotate \((w_1,w_2)\) on the plane they determine to make \(w_2\perp v_1\). Now \(|Av_1|\ge |Aw_1|\) and \(|Av_2|\ge |Aw_2|\). (Despite the conciseness, I think this proof is very tricky and worth remembering.)
Exercise 2
\(Av_1\perp Av_2\) where \(v_1\) and \(v_2\) are singular vectors of \(A\).
→ Proof. Consider that (\((\cdot\mid\cdot)\) means the inner product of two vectors. This convention comes from lww.)
if \((Av_1\mid Av_2)\neq 0\), then
It implies that \(Av_1\) is not the argmax of \(|Av_1|\), raising a contradiction.
Continuing this process, we can pick a orthogonal base of the space, \(\seq v1d\), and
where \(\seq u1{d'}\) also form a orthogonal base. A well-known form of this is \(A=U\Sigma V^\T\).
The final stuff is to compute \(\Sigma\) and \(V\). Given that
we known \(A^\T Av_i=\sigma_i^2 v_i\), and \(A^\T A\) can always be orthogonally diagonalized as a positive semi-definite form.
Review list for midterm exam
Machine Learning
We will consider the balance between training error reduction and generalization ability.
Linear Classifier
For \(\dat x_i\in\R^2\), a linear classifier \(h_{\dat w,t}(\dat x)\) classifies the plane into positive and negative semi-planes. (\(\dat a::\dat b\) means concatenate \(\dat a\in\R^a\) and \(\dat b\in \R^b\) to form a vector \(\dat c\in\R^{a+b}\).)
We simply treat \(\dat w\) as \((\dat w::-t)\in\R^3\) and \(\dat x\) as \(\dat x::1\). If there exists a perfect classifier that classifies all training data correctly, which means \(\A i,~h(\dat x_i)\dat y_i>0\), then there should be a \(\dat w^*\) s.t. \(h_{\dat w^*}(\dat x_i)\dat y_i\ge 1\).
If \(\dat w^*\) exists, we have a iterative algorithm to find a perfect classifier. Let \(\dat w_0=\bs 0\), while there is an \(\dat x_i\) s.t. \((\dat w_t\mid \dat x_i)y_i\le 0\), we set \(\dat w_{t+1}\gets\dat w_t+y_i\dat x_i\). Note that
So \((\dat w_t\mid\dat w^*)\) increases monotonously. Besides, if \(\|\dat x_i\|\le r^2\) holds,
i.e. \(\|\dat w_{t+1}\|^2\le\|\dat w_t\|^2+r^2\). Let \(m\) be the time of updates (mistakes), then
This upper bound implies that our model is easier to train when positive and negative samples lie along a large gap between them, where \(\|\dat w^*\|\) can be small.
For high dimensional space, we may find a "kernel function" \(\lang \dat x_i,\dat x_j\rang^*\) with \(\varphi\) existing s.t.
E.g. \(\lang\dat x_i,\dat x_j\rang^*=\e^{-k\|\dat x_i-\dat x_j\|}\).
Now, assuming our training set is sampled from a distribution \(D\), we define the true error of model \(h\) as
Let all possible \(h\in\mathcal H\) (finite set). Bad event \(B:\E h\in\mathcal H,~\text{te}_D(h)>\eps\land\text{err}_S(h)=0\). To make \(\Pr(B)<\delta\), with union bound:
This give the desired size of sample set \(S\).
Theorem 1
For \(\eps>0\) and \(0<\delta<1\), \(\Pr[|\text{err}_S(h)-\text{err}_D(h)|<\eps]\ge 1-\delta\) when
\[|S|>\frac{1}{2\eps^2}\br{\ln|\mathcal H|+\ln\frac{2}{\delta}}. \]
And when \(|S|>\cdots\), we know, with high probability,
The second term implies our normalization penalty method in model training.
But... What to do with contiguous \(\mathcal H\)?
Definition 1
We say \((X,\mathcal H)\) shatters \(A\sub X\), iff
\[\A S\sub A,~\E h\in\mathcal H,~S=h\cap A, \]where binary hypothesis \(h\) is seen as a subset of \(X\).
Definition 2 (VC Dimension)
We say \(\opn{VC}(X,\mathcal H)=d\), iff
\[\CAS{ \E A,~|A|=d\land A~\text{shattered by}~\mathcal H;\\ \A A,~|A|>d\land A~\text{not shattered by}~\mathcal H. } \]
For example \(\opn{VC}(\R,\{\text{intervals}\})=2\).
Proposition 1
\(\mathcal H:=\{d~\text{dimensional semi-plane}\}\), \(X:=\R^d\), then \(\opn{VC}(X,\mathcal H)=d+1\).
→ Proof. \(A=\{0,\pmat{1&0&0&\cdots},\pmat{0&1&0&\cdots}\}\) gives the desired set for \(d+1\). For any \(d+2\) points \(\seq{\dat x}1{d+2}\) , a semi-plane can be expressed as \(\dat h\in\R^{d+1}\), where
and we can treat \(\dat x_i\) is a positive sample iff \(\dat c_i\ge 0\). We can always construct \(\dat c\in\R^{d+2}\), s.t.
In this case, \(m:=\sum_{c_i\ge 0}c_i=\sum_{c_j<0}(-c_j)\), but
is a point in both the convex polygons of positive sample points and negative ones, which should not exist.
The same conclusion holds for \(d\) dimensional sphere.
Definition 3 (Growth Function)
For a set system \((X,\mathcal H)\), we define
\[\pi_{\mathcal H}(n):=\max_{|S|=n}|\{h\cap S:h\in\mathcal H\}|. \]
A key property of growth function:
If we define \(\mathcal H_1\cap\mathcal H_2:=\{h_1\cap h_2:h_1\in \mathcal H_1,h_2\in\mathcal H_2\}\), then
Now we look back at Theorem 1 in this chapter, handling infinite \(|\mathcal H|\).
Theorem 2
For domain \(X\), sample distribution \(D\) and hypotheses set \(\mathcal H\), when \(|S|=\mathcal O\br{\frac{1}{\eps}\br{\opn{VC}(X,\mathcal H)\log\frac{1}{\eps}+\log\frac{1}{\delta}}}\), with probability not less than \(1-\delta\), for all \(h\) with \(\text{err}_h(S)=0\), there is \(\opn{err}_h(D)<\eps\).
→ Proof. For a sampled \(S\), let event
where \(S'\) is sampled like \(S\). When \(n=|S'|=|S|>\frac{8}{\eps}\), \(\Pr[B\mid A]\ge\frac{1}{2}\). With this bound,
We may bound \(\Pr(B)\). Let's rephrase \(B\) as "sample \(S\cup S'\) first, randomly partition them into \(S\) and \(S'\), and finally check the condition". ..........
Online Learning
Now we set foot on online learning, where data come as a flow. For example, for stock market prediction, we want to design an algorithm to calculate the proper decision of day \(t+1\), according to experts' advice history of day \([1:t+1]\) and ground truth of day \([1:t]\), to reduce the accumulative loss.
As a benchmark, we consider #mistakes we made versus #mistakes the optimal (in hindsight) expert made before \(T\), and try to minimize the gap between them. Here comes a strategy:
- Assign weight \(1\) to each expert;
- Half (or\({}\x\alpha=\frac{1}{2}\)) the weights of those who made a mistake;
- Choose the weighted majority for current day.
Note that the sum of weights \(s\gets s\x0.75\) each time we made a mistake, while \(s\ge 2^{-m}\), where \(m\) is #mistakes the optimal (in hindsight) expert made. So
We may adjust \(\alpha\) to get a tighter bound.
In an antagonistic setting, we can smooth out the worst case by sampling our decision according to the sum of weights, instead of choosing the majority directly.
Given the ground truth, e.g. \(\text{N}\), the probability we make mistake is \(F_t\), then
not depending on the ground truth's value. and
Therefore,
Nash Equilibrium
Here we just omit some preparatory description. The key is
where \(\u{LHS}\ge\u{RHS}\) is obvious, while the other side is rather tricky. We want to build a connection between this and the online learning algorithm. Intuitively, online learning algorithm offered us a way to eliminate the "first-player disadvantage", so we can apply it to bound \(\u{LHS}\le\u{RHS}+0\).
Our algorithm to approximate the best strategy for loss matrix \(V_{n\x m}\) is:
- \(w_1(i)\gets 1\) for \(i=1..n\);
- For \(t=1..T\):
- Sample \(i_t\) to be \(i\) w.p. proportional to \(w_t(i)\), \(\bs p_t\gets\u{onehot}(i_t)\);
- \(j_t\gets\arg\max_j V_{i_tj}\), \(\bs q_t\gets\u{onehot}(j_t)\);
- \(w_{t+1}(i_t)\gets w_i(i_t)\e^{-\eps V_{i_tj_t}}\).
- \(\bs p=\frac{1}{T}\sum\bs p_t\), \(\bs q=\frac{1}{T}\sum \bs q_t\).
With online learning's conclusion, we can bound that
Boosting
Let's consider a binary classification problem here. The datapoints are \(\{(\dat x_i,y_i)\}_{i=1}^n\).
Weak learner: a leaner whose average accuracy \(\ge\frac{1}{2}+\gamma\), where \(\gamma\) is a small positive number.
Assume we have an oracle \(\mathscr O\) who takes a distribution \(D\) over \(\seq{\dat x}1n\) and yields a weak learner \(h_D\). Our target is to construct a perfect learner with \(\mathscr O\).
- \(w_1(i)\gets 1\) for \(i=1..n\);
- For \(t=1..T\):
- \(h_t\gets\mathscr O\br{\br{\frac{w_t(i)}{\sum w_t(i)}}_i}\);
- For \(i=1..n\):
- If \(h_t(i)\neq y_i\), \(w_{t+1}(i)\gets w_t(i)\x\alp\) (\(\alp\) is a constant factor to choose).
- \(h\gets\opn{maj}\{\seq h1t\}\).
If \(h\) makes mistake on one datapoint, the weight of it \(\ge\alpha^{\frac{T}{2}}\), while the total weight is \(S=n(\alpha(0.5-\gamma)+(0.5+\gamma))^T\). Set \(\alpha=\frac{1+2\gamma}{1-2\gamma}\), then \(S=n(1+2\gamma)^T\). So \((1-4\gamma^2)^{\frac{T}{2}}\ge\frac{1}{n}\) gives the bound \(T\ge\frac{2\ln n}{\ln\frac{1}{1-4\gamma^2}}\).
From a min-max perspective, let's consider a loss matrix \(V_{m\x n}\). Each row corresponds to a learner \(h_i\) while each column corresponds to a datapoint \(\dat x_j\). We know that
Therefore, there must exist a \(\bs p^*\), s.t.
So the weighted majority of \(h_i\)'s output proportional to \(\bs p^*\) must be true, or an one-hot \(\bs q\) violates the property.
Perception Algorithm
Recall that an linear classifier with parameter \(\dat w\) predicts the label of \(\dat x_t\) as \(\opn{sign}((\dat w\mid \dat x_t))\). Here we introduce Hinge loss:
Say the optimal classifier \(\dat w^*\) minimized the total Hinge loss (differs from Linear Classifier section, where we assumed \(y_t(\dat x_t\mid\dat w^*)\ge 1\), or total Hinge loss is \(0\)). We can replay the proof at Linear Classifier section but with a different upper bound:
The lower bound remains the same:
Assume the update (mistake) time is \(m\), then
Streaming Algorithm
A classical problem is weighted sampling in a stream. Let \(a_1,a_2,\cdots,a_t,\cdots\) be positive weights. We maintain current sum \(s\) and current sample \(x_t\), and when receiving \(a_{t+1}\):
Another one is majority maintenance. Let \(\seq a1n\in[m]\), \(n>2m\). We need to report the strict majority \(k\in[m]\) if it exists, or \(0\) otherwise. A memory lower bound should be \(m~\u{bit}\), since if the latter half votes are all the same, we have to distinguish all \(2^{m}-1\) subsets of votes that occur in the former half.
But we know how to bypass this bound: we don't require the algorithm to report \(0\) when no strict majority exists. Now we maintain current "winner" and his "hit point" \((w,h)\), and when receiving \(a_t\):
Final \(w\) will be reported. This algorithm can be easily extended to maintain all \(\frac{m}{k}\)-majority by maintaining \(k\) pairs of \((w_i,h_i)\).
Now let's consider distinct element counting problem. Let \(\seq a1n\in[m]\). We need to report \(\#\{\seq a1t\}\) for all \(t\). The memory bound is \(\mathcal O(m)\) obviously. But there is an elegant random algorithm to handle this.
Recall \(\Ex_{X\sim\opn{Unif}([0,1])^k}[\min_i X_i]=\frac{1}{k+1}\) (I love its combinational meaning! You may imagine an elegant proof to it). We pick a hash function \(h\) whose output are pairwise independent, and we maintain \(v_t=\min_{i=1}^t h(a_i)\). As this pseudo-random value concentrate to \(\frac{1}{c+1}\), we can estimate \(c\) via \(v_t\). Take \(h(a_i)=(Aa_i+B)\bmod p\) where \(A,B\sim\opn{Unif}(\Z_p)^2\) as an example, Chinses Remainder Theorem guarantees that \(h\) is pairwise independent function.
Theorem 1
Let \(d^*\) be the ground truth of \(|\{\seq a1t\}|\), and then
\[\Pr\l[\frac{d^*}{6}\le\frac{p}{v}\le 6d^*\r]\ge\frac{2}{3}-\frac{d^*}{p}. \]
→ Proof. We are to bound \(v\) to \([\frac{p}{6d^*},\frac{6p}{d^*}]\). Let \(\{\seq b1{d^*}\}=\{\seq a1t\}\), we can bound
(Flooring or ceiling constants are ignored.)
For \(\frac{6p}{d^*}<v\), we need to introduce a trick: we want to bound \(\Pr\l[\A k~h(b_k)>\frac{6p}{d^*}\r]\) but \(h(b_k)\) are merely pairwise independent. We may construct a sum of another random variable and use Chebyshev bound, which requires variant that only relies on pairwise independency. Here, let \(Y_k=\mathbb 1_{h(b_k)\le 6p/d^*}\), we have
Now let's estimate the unbalance rate \(\sum_c(\#\{i:a_i=c\})^2\) for \(\{\seq a1n\}\). Like the method above, we take a hash function \(h:[m]\to\{\pm 1\}\) whose outputs are pairwise independent. It is easy to check
Matrix Sampling
We want to sketch the multiplication of two \(n\x n\) matrix \(A=\pmat{\bs a_1&\cdots&\bs a_n}\) and \(B=\pmat{\bs b_1^\T\\\vdots\\\bs b_n^\T}\), where
Obviously \(n\bs a_k\bs b_k^\T\) where \(k\gets\opn{Unif}([n])\) is just a unbiased estimation of \(AB\), but it is not satisfactory (especially when \(\bs a\) and \(\bs b\) are sparse).
A generalization of this method is to pick \(k\) from distribution \(P\) over \([n]\) and output \(\frac{1}{P(k)}\bs a_k\bs b_k^\T\). We should set \(P\) properly to minimize the "variance" \(\Ex[\|X-\Ex[X]\|_F^2]\). Since \(\Ex[X]=AB\) is fixed, our goal is to minimize
Therefore,
For implementation simplicity we set \(p_k\propto\|a_k\|^2\), and then we have
If we pick \(s\) instead of \(1\) sample,
However it works just on low-rank matrices. \(A=B=\bs 1\) can easily make us sad 😦
Let's go on to sketch just a matrix \(A\). An intuitive method is to write \(A=A\x\bs 1\) and perform the sketch above. But, \(\bs 1\) is full ranked, frustrate us again 😦
Now, our goal is to find a "pseudo-identity matrix" \(P\) s.t. \(A\approx AP\).
With the sampling method above, we may pick \(s\) rows of \(A\), say \(\{\seq r1s\}\), and scale them to form \(R\), s.t. \(A^\T A\approx R^\T R\). Specifically we have a bound
Let the space \(\mathcal R=\lang \seq r1s\rang\), we set \(P\) to make \(P|_{\mathcal R}=\id\) and \(P|_{\mathcal R^\perp}=\bs 0\), which means \(PR^\T y=R^\T y\) for all \(y\) and \(Px=0\) for all other vector \(x\). An easy construction is \(P=R^\T(RR^\T)^{-1}R\) where \((RR^\T)^{-1}\) is the Moore-Penrose inversion of \(RR^\T\).
Let's bound the spectrum norm of \(A-AP\) (Note for matrix \(A\), \(\|A\|_2\) means its spectrum norm):
Finally we sketch \(AP\) to get \(X\) of \(t\) rows. Now
is bounded successfully.
Phase Transition of Random Graph
A random graph \(\mathcal G(n,p)\) is a graph of \(n\) vertexes where each edge \((u,v)\) occurs w.p. \(p\) independently.
For a predicate \(P\), \(\Pr[P(\mathcal G(n,p))]\) usually owns a phase transition point. E.g. \(P(G):G~\text{contains a ternary cycle}\), and we have
Its threshold turns out to be \(p^*(n)=\frac{1}{n}\) (which means, when \(p\ll p^*\), \(\Pr[\cdots]\to 0\) and vise versa).
Let's go through the \(p\gg p^*(n)\) part.

浙公网安备 33010602011771号