Machine Learning

Lec 1 (generated by Gemini 2.5 Pro, based on lec1 slide)

Course Definition and Scope

Textbooks emerge when a field becomes sufficiently established for broad teaching, often coinciding with the field becoming outdated or subsumed by newer methods. Foundation models represent such newer methods. AI can be decomposed as foundation models plus machine learning plus other legacy approaches. Machine learning remains essential as it provides the foundational basis for understanding foundation models. The course emphasizes understanding boundaries and connections between these areas.

The primary objective is to construct a comprehensive map of the machine learning world, enabling students to understand all core elements and their interrelationships, and to position new concepts within this framework. The pedagogical philosophy prioritizes understanding what a tool is over how to use it. True comprehension of what enables modification, innovation, and creation of new methods.

Course Structure and Content

The course covers six major areas: Introduction (logistics, history, overall framework), Optimization (stochastic gradient descent, convex optimization, mirror descent, SVRG, non-convex optimization), Generalization (no free lunch theorem, VC dimension, PAC learning, Bayesian optimality), Linear Methods (linear regression, ridge, LASSO, compressed sensing, SVM), Decision Trees (decision trees, random forests, boosting), and Metric Learning (nearest neighbor, spectral clustering, SimCLR, t-SNE, position embeddings). Additional topics include robust ML, hyperparameter tuning, and ML interpretability. The midterm examination covers up to linear methods, while the final examination encompasses all topics.

The theoretical framework underlying machine learning consists of three pillars. Representation theory addresses whether the function class possesses sufficient expressiveness, contrasting linear functions with neural networks. Optimization theory determines methods for finding parameters, such as SGD and its variants. Generalization theory examines whether learned parameters transfer to unseen tasks, a property empirically validated but theoretically unresolved under current frameworks.

Historical Development of Artificial Intelligence

1950: Turing proposed fundamental question "Can machines think?" Reframed as imitation game where human interrogator C distinguishes between human B and machine A, establishing operational criterion for machine intelligence.

1951: Minsky designed SNARC, first artificial neural network inspired by biological systems. Strachey and Prinz developed programs for checkers and chess, establishing game AI as intelligence benchmark.

1955: Newell and Simon created Logic Theorist for automated theorem proving.

1956: Dartmouth Conference where McCarthy coined term "Artificial Intelligence", marking official birth of the field.

1956-1974: Golden years characterized by high investment and optimism. Developments included search algorithms, first chatterbot ELIZA, and robotics. Optimistic predictions: Simon and Newell 1958 predicted digital computer world chess champion within ten years; Simon 1965 predicted machines capable of any human work within twenty years; Minsky 1970 predicted machine with average human intelligence within three to eight years.

1974-1980: First AI winter. Overblown expectations and failed promises led to funding collapse. Neural networks shuttered for decade. Contributing factors: limited computational power, combinatorial explosion of search spaces, weak reasoning capabilities.

1980-1987: AI boom driven by expert systems using logical rules for domain-specific applications like healthcare. 1982: Hopfield proposed Hopfield networks. Hinton and Rumelhart developed backpropagation. Neural networks revived.

1987-1993: Second AI winter. Expert systems proved useful only in narrow scenarios. Most AI projects failed, over 300 companies shut down.

1993-2011: Steady development period. 1997: IBM Deep Blue defeated world chess champion. 2005: Stanford robot won DARPA Grand Challenge. Researchers avoided "AI" terminology to distance from past failures. Neural networks considered dead. Popular methods: Support Vector Machines, graphical models, reinforcement learning.

2012-2022: Deep learning era. 2012: Hinton's group achieved 16% ImageNet error (versus 26% second place) using deep learning, catalyzing paradigm shift. 2013: DeepMind exceeded human performance on Atari games. 2016: DeepMind AlphaGo defeated world champion Lee Sedol in Go.

2018-present: Foundation models era. Emergence of CLIP (text-image), ChatGPT, Midjourney, Copilot. This represents the current chapter of AI development.

Supervised Learning Framework

Supervised learning constitutes a fundamental machine learning paradigm based on learning from examples. Given input \(X = (x_1, x_2, \dots, x_N)\) and output labels \(Y = (y_1, y_2, \dots, y_N)\), the objective is to learn function \(f\) such that \(f(x_i) \approx y_i\). The conceptual pipeline: Data \(X \to f(X) \to\) Label \(Y\).

MNIST example: input \(x\) is 28×28 pixel handwritten digit image, output \(y \in \{0, 1, \dots, 9\}\). ImageNet example: input \(x\) is 224×224 pixel image, output \(y \in \{0, 1, \dots, 999\}\) corresponding to object category. Movie review example: input \(x\) is text paragraph, output \(y \in \{\text{good}, \text{bad}\}\).

Evaluation employs loss function \(l\) measuring distance between prediction \(f(X)\) and true labels \(Y\). For categorical targets (classification): \(l(f, x_i, y_i) = 1\) if \(f(x_i) \neq y_i\), otherwise \(l(f, x_i, y_i) = 0\). For continuous targets (regression): \(l(f, x_i, y_i) = \text{dist}(f(x_i), y_i)\), typically squared error \(l(f, x_i, y_i) = (f(x_i) - y_i)^2\). Total loss: \(L(f, X, Y) = \frac{1}{N} \sum_i l(f, x_i, y_i)\).

The supervised learning objective is finding \(f = \arg\min_f L(f, X, Y)\) through optimization. However, minimizing \(L\) alone is insufficient. A trivial memorization function (if-statement returning \(y_i\) for \(x_i\), random output otherwise) achieves \(L=0\) but fails on new data, demonstrating the necessity of generalization.

Generalization Theory

Effective function \(f\) must both fit training data (minimize training loss \(L_{\text{train}}\)) and generalize to unseen test data \((X', Y')\) from same distribution (minimize test loss \(L_{\text{test}} = L(f, X', Y')\)). The function should extract patterns rather than perform rote memorization.

Practical validation: partition training data into training set (e.g., 90%) and validation set (e.g., 10%). Train on training set, evaluate on validation set which remains unseen during training. Cross-validation provides robust estimation: partition data into \(k\) parts, iteratively use each part as validation with remaining \(k-1\) parts for training, average results across \(k\) iterations for accurate generalization estimate.

Critical principle: never evaluate on test set during training. Test set contamination invalidates evaluation.

Training loss \(L_{\text{train}} = L(f, X_{\text{train}}, Y_{\text{train}})\) measures training fit. Test loss \(L_{\text{test}} = L(f, X_{\text{test}}, Y_{\text{test}})\) measures generalization. Validation loss \(L_{\text{valid}} = L(f, X_{\text{valid}}, Y_{\text{valid}})\) provides practical test loss estimation. Population loss \(L_{\text{population}} = \mathbb{E}_{X,Y \sim D_X, D_Y} L(f, X, Y)\) represents theoretical true loss over entire data distribution. Ultimate supervised learning objective: minimize population loss.

Underfitting occurs when function lacks sufficient representational capacity (e.g., linear function for complex nonlinear relationship), resulting in poor \(L_{\text{train}}\). Overfitting occurs when function has excessive capacity (e.g., overparameterized neural network), achieving \(L_{\text{train}} = 0\) through memorizing noise, but yielding poor \(L_{\text{test}}\).

Classical perspective: regularization restricts representation power to ensure simplicity, preventing overfitting. Function minimizing empirical loss \(L_{\text{train}}\) need not minimize population loss. Modern perspective: explicit regularization is not always necessary for neural networks. Implicit regularizations from SGD optimization prevent overfitting. While overfitting could theoretically degrade \(L_{\text{test}}\), this almost never manifests in modern deep learning practice.

Supervised Learning Procedure

Step 1: Identify target task (e.g., MNIST digit recognition).

Step 2: Construct dataset with thousands or millions of examples.

Step 3: Define loss function \(L\) for evaluating \(f(X)\).

Step 4: Learn function \(f\) minimizing \(L\) through optimization.

Step 5: Ensure generalization beyond training loss minimization. Employ validation set in practice. Theoretical guarantees examined separately.

Dataset Construction

High-quality large-scale data surpasses all subsequent learning techniques in importance. Dataset construction capabilities are not universal.

Creating million-scale datasets requires two components. First, input \(X\) collection via web scraping or alternative sources. Second, correct label \(Y\) acquisition through crowdsourcing. Amazon Mechanical Turk (https://www.mturk.com/) exemplifies this approach, decomposing large labeling tasks into millions of micro-tasks for thousands of workers. Multiple workers label identical data points, with majority voting ensuring correctness.

Complex labeling tasks (e.g., MS COCO dataset) require sophisticated approaches. Segmentation mask annotation exemplifies this difficulty. Cost minimization and quality assurance achieved through assembly line decomposition, where each worker performs single simple step.

Innovative approaches like ReCAPTCHA integrate labeling into useful services, obtaining free labels. Users digitize books while solving CAPTCHAs.

Practical challenges: specialized domain labeling (healthcare, legal) requires expert knowledge. Workers may behave strategically, minimizing effort to maximize compensation, potentially using AI to generate labels. Dataset construction frequently constitutes the primary bottleneck in machine learning applications.

Lec 2

Smoothness, Convexity, Gradient Descent

First order Taylor expansion: \(f(w') = f(w) + \nabla f(w)^\top (w' - w) + \frac{g(w')}{2} \|w' - w\|^2\).

Smoothness assumption: \(\exists L\) s.t. \(g(w') \le L \|w' - w\|\) for all \(w'\).

Let \(Q(w') = f(w) + \nabla f(w)^\top (w' - w) + \frac{L}{2} \|w' - w\|^2\), then \(f(w') \le Q(w')\) for all \(w'\) and \(Q(w')\) is minimized at \(w' = w - \frac{1}{L} \nabla f(w)\).

Alternative definition of smoothness (gradient Lipschitz): \(\|\nabla f(w) - \nabla f(w')\| \le L \|w - w'\|\) for all \(w, w'\).

Lemma. Gradient Lipschitz is equivalent as \(|f(y) - f(x) - \nabla f(x)^\top (y - x)| \le \frac{L}{2} \|y - x\|^2\).

Proof. \(\Rightarrow\): We see that

\[\begin{aligned} & |f(w') - f(w) - \nabla f(w)^\top (w' - w)| \\ =& \left| \int_0^1 \nabla f(w + t(w' - w))^\top (w' - w) dt - \nabla f(w)^\top (w' - w) \right| \\ =& \left| \int_0^1 (\nabla f(w + t(w' - w)) - \nabla f(w))^\top (w' - w) dt \right| \\ \le& \int_0^1 \left| (\nabla f(w + t(w' - w)) - \nabla f(w))^\top (w' - w) \right| dt \\ \le& \int_0^1 \|\nabla f(w + t(w' - w)) - \nabla f(w)\| \|w' - w\| dt \\ \le& \int_0^1 L \|t(w' - w)\| \|w' - w\| dt \\ =& \frac{L}{2} \|w' - w\|^2. \end{aligned}\]

\(\Leftarrow\): By Taylor expansion with integral remainder,

\[\begin{aligned} f(w') - f(w) - \nabla f(w)^\top (w' - w) =& \int_0^1 (w' - w)^\top \nabla^2 f(w + t(w' - w)) (w' - w) (1-t) dt. \end{aligned}\]

We have

\[\begin{aligned} \left| \int_0^1 (w' - w)^\top \nabla^2 f(w + t(w' - w)) (w' - w) (1-t) dt \right| \le \frac{L}{2} \|w' - w\|^2. \end{aligned}\]

Let \(w' = w + \delta u\) for a unit vector \(u\) and \(\delta \to 0^+\), then

\[\begin{aligned} \left| \int_0^1 (\delta u)^\top \nabla^2 f(w + t\delta u) (\delta u) (1-t) dt \right| \le& \frac{L}{2} \|\delta u\|^2. \end{aligned}\]

That is,

\[\begin{aligned} \left| \int_0^1 u^\top \nabla^2 f(w + t\delta u) u (1-t) dt \right| \le& \frac{L}{2}. \end{aligned}\]

As \(\delta \to 0^+\), the integral becomes \(\int_0^1 u^\top \nabla^2 f(w) u (1-t) dt = \frac{1}{2} u^\top \nabla^2 f(w) u\), which is equivalent to \(| u^\top \nabla^2 f(w) u | \le L\), which implies \(|u^\top \nabla^2 f(w) u| \le L\) for any unit vector \(u\).
This means \(\|\nabla^2 f(w)\| \le L\), so

\[\begin{aligned} \|\nabla f(w) - \nabla f(w')\| =& \left\| \int_0^1 \nabla^2 f(w' + t(w - w')) (w - w') dt \right\| \\ \le& \int_0^1 \|\nabla^2 f(w' + t(w - w')) (w - w')\| dt ****\\ \le& \int_0^1 \|\nabla^2 f(w' + t(w - w'))\| \|w - w'\| dt \\ \le& \int_0^1 L \|w - w'\| dt \\ =& L \|w - w'\|. \end{aligned}\]

If \(w' = w - \eta \nabla f(w)\), then we have

\[\begin{aligned} f(w') - f(w) \le & \nabla f(w)^\top (w' - w) + \frac{L}{2} \|w' - w\|^2 \\ = & \nabla f(w)^\top (-\eta \nabla f(w)) + \frac{L}{2} \|\eta \nabla f(w)\|^2 \\ = & -\eta \left(1 - \frac{L\eta}{2}\right) \|\nabla f(w)\|^2. \end{aligned}\]

To make sure \(f(w') < f(w)\), we need \(\eta < \frac{2}{L}\).

Convex: \(f(w') \ge f(w) + \nabla f(w)^\top (w' - w)\).

\(\mu\)-strongly convex: A function \(f\) is \(\mu\)-strongly convex if it satisfies any of the following equivalent conditions:

(a) \(f(\alpha x + (1-\alpha)y) \le \alpha f(x) + (1-\alpha)f(y) - \alpha(1-\alpha)\frac{\mu}{2}\|x - y\|^2\) for all \(x, y\) and \(\alpha \in [0,1]\)

(b) \(f(y) \ge f(x) + \nabla f(x)^\top(y-x) + \frac{\mu}{2}\|x - y\|^2\) for all \(x, y\)

(c) \((\nabla f(x) - \nabla f(y))^\top(x-y) \ge \mu\|x - y\|^2\) for all \(x, y\)

(d) \(\nabla^2 f(x) \succeq \mu I\) for all \(x\) (i.e., \(\lambda_{\min}(\nabla^2 f(x)) \ge \mu\))

Lemma. The four definitions of \(\mu\)-strongly convex are equivalent.

Proof. We will prove (a) \(\Rightarrow\) (b) \(\Rightarrow\) (c) \(\Rightarrow\) (d) \(\Rightarrow\) (a).

(a) \(\Rightarrow\) (b): (a) is equivalent to

\[f(x + (1-\alpha)(y-x)) - f(x) \le (1-\alpha)(f(y) - f(x)) - \alpha(1-\alpha)\frac{\mu}{2}\|x - y\|^2. \]

Dividing by \(1-\alpha\) and letting \(\alpha \to 1^-\) gives

\[f(y) \ge f(x) + \nabla f(x)^\top(y-x) + \frac{\mu}{2}\|x - y\|^2. \]

(b) \(\Rightarrow\) (c): By (b), we have

\[f(y) \ge f(x) + \nabla f(x)^\top(y-x) + \frac{\mu}{2}\|x - y\|^2 \]

and

\[f(x) \ge f(y) + \nabla f(y)^\top(x-y) + \frac{\mu}{2}\|x - y\|^2. \]

Adding the two inequalities yields

\[(\nabla f(x) - \nabla f(y))^\top(x-y) \ge \mu\|x - y\|^2. \]

(c) \(\Rightarrow\) (d): Let \(x - y = tu\) where \(\|u\| = 1\). Then

\[(\nabla f(y+tu) - \nabla f(y))^\top u \ge \mu t. \]

Dividing by \(t\) and letting \(t \to 0^+\), we obtain

\[u^\top\nabla^2 f(y)u \ge \mu. \]

Since \(u\) is arbitrary, we have

\[\lambda_{\min}(\nabla^2 f(y)) = \inf_{\|u\|=1} u^\top\nabla^2 f(y)u \ge \mu. \]

(d) \(\Rightarrow\) (a): By Taylor expansion with integral remainder,

\[\begin{aligned} f(y) =& f(z) + \nabla f(z)^\top(y-z) + \int_0^1 (y-z)^\top\nabla^2 f(z+t(y-z))(y-z)(1-t)dt \\ \ge& f(z) + \nabla f(z)^\top(y-z) + \mu\int_0^1 \|y-z\|^2(1-t)dt \\ =& f(z) + \nabla f(z)^\top(y-z) + \frac{\mu}{2}\|y - z\|^2. \end{aligned}\]

Similarly, for \(x\),

\[f(x) \ge f(z) + \nabla f(z)^\top(x-z) + \frac{\mu}{2}\|x - z\|^2. \]

Multiplying the first inequality by \((1-\alpha)\) and the second by \(\alpha\), then adding them gives

\[\begin{aligned} (1-\alpha)f(y) + \alpha f(x) \ge& f(z) + \nabla f(z)^\top((1-\alpha)(y-z) + \alpha(x-z)) \\ & + \frac{\mu}{2}((1-\alpha)\|y-z\|^2 + \alpha\|x-z\|^2). \end{aligned}\]

Let \(z = \alpha x + (1-\alpha)y\), then \((1-\alpha)(y-z) + \alpha(x-z) = 0\), so

\[\begin{aligned} (1-\alpha)f(y) + \alpha f(x) \ge& f(\alpha x + (1-\alpha)y) \\ & + \frac{\mu}{2}((1-\alpha)\|(1-\alpha)(y-x)\|^2 + \alpha\|\alpha(x-y)\|^2) \\ =& f(\alpha x + (1-\alpha)y) + \frac{\mu}{2}\alpha(1-\alpha)\|x-y\|^2. \end{aligned}\]

Convergence Analysis

Lemma. If \(f\) is \(L\)-smooth and convex, \(w^* = \arg \min_w f(w)\), then running GD with \(\eta \le \frac{1}{L}\) gives

\[f(w_t) - f(w^*) \le \frac{\|w_0 - w^*\|^2}{2\eta t}. \]

Proof. Combining smoothness and convexity together:

\[\begin{aligned} f(w_{i+1}) \le& f(w_i) - \frac{\eta}{2} \|\nabla f(w_i)\|^2 \\ \le& f(w^*) + \langle \nabla f(w_i), w_i - w^* \rangle - \frac{\eta}{2} \|\nabla f(w_i)\|^2 \\ =& f(w^*) - \frac{1}{\eta} \langle w_{i+1} - w_i, w_i - w^* \rangle - \frac{1}{2\eta} \|w_i - w_{i+1}\|^2 \\ =& f(w^*) + \frac{1}{2\eta} \|w_i - w^*\|^2 - \frac{1}{2\eta} (\|w_i - w^*\|^2 - 2\langle w_i - w_{i+1}, w_i - w^* \rangle + \|w_i - w_{i+1}\|^2) \\ & \\ =& f(w^*) + \frac{1}{2\eta} \|w_i - w^*\|^2 - \frac{1}{2\eta} \|w_i - w_{i+1} - w_i + w^*\|^2 \\ =& f(w^*) + \frac{1}{2\eta} \|w_i - w^*\|^2 - \frac{1}{2\eta} \|w_{i+1} - w^*\|^2. \end{aligned}\]

That is,

\[f(w_{i+1}) - f(w^*) \le \frac{1}{2\eta} (\|w_i - w^*\|^2 - \|w_{i+1} - w^*\|^2). \]

Taking summation over \(i\) from \(0\) to \(t-1\), we have (telescoping)

\[\sum_{i=0}^{t-1} (f(w_{i+1}) - f(w^*)) \le \frac{1}{2\eta} (\|w_0 - w^*\|^2 - \|w_t - w^*\|^2) \le \frac{\|w_0 - w^*\|^2}{2\eta}. \]

Since \(f(w_i)\) is non-increasing, we have

\[f(w_t) - f(w^*) \le \frac{1}{t} \sum_{i=0}^{t-1} (f(w_{i+1}) - f(w^*)) \le \frac{\|w_0 - w^*\|^2}{2\eta t}. \]

Therefore, GD converges at rate \(O(1/t)\).

Lec 3

Stochastic Gradient Descent

Idea: At each step, instead of computing the full gradient over all \(n\) samples, we compute an unbiased estimate of the gradient based on a random sample (or a mini-batch of samples). Let the full objective be \(f(w) = \frac{1}{n} \sum_{j=1}^n f_j(w)\).

At step \(i\), we randomly pick an index \(j_i \in \{1, ..., n\}\) and update using the gradient of that single component function. Let \(g_i = \nabla f_{j_i}(w_i)\). The update rule is \(w_{i+1} = w_i - \eta g_i\).

To analyze the convergence of SGD, we make the following standard assumptions:

  • The stochastic gradient is an unbiased estimator of the true gradient. Let \(E_i[\cdot]\) denote the conditional expectation over the choice of \(j_i\) given \(w_i\).

    \[E_i[g_i] = E_{j_i \sim U(1,..,n)}[\nabla f_{j_i}(w_i)] = \frac{1}{n} \sum_{j=1}^n \nabla f_j(w_i) = \nabla f(w_i). \]

  • The variance of the stochastic gradient is bounded by a constant \(\sigma^2\).

    \[E_i[\|g_i - \nabla f(w_i)\|^2] \le \sigma^2. \]

    This implies \(E_i[\|g_i\|^2] = E_i[\|g_i - \nabla f(w_i) + \nabla f(w_i)\|^2] = E_i[\|g_i - \nabla f(w_i)\|^2] + \|\nabla f(w_i)\|^2 \le \sigma^2 + \|\nabla f(w_i)\|^2\).

SGD Convergence Analysis

Lemma. If \(f\) is \(L\)-smooth and convex, \(w^* = \arg \min_w f(w)\), then running SGD with constant step-size \(\eta \le \frac{1}{L}\) and taking an average of iterates \(\bar{w}_t = \frac{1}{t} \sum_{i=1}^t w_i\) gives

\[E[f(\bar{w}_t)] - f(w^*) \le \frac{\|w_0 - w^*\|^2}{2\eta t} + \frac{\eta\sigma^2}{2}. \]

Proof. First, we establish a one-step progress inequality. By \(L\)-smoothness,

\[f(w_{i+1}) \le f(w_i) + \nabla f(w_i)^\top (w_{i+1} - w_i) + \frac{L}{2} \|w_{i+1} - w_i\|^2. \]

Take conditional expectation \(E_i[\cdot]\) over the choice of sample \(j_i\) at step \(i\):

\[\begin{aligned} E_i[f(w_{i+1})] \le& f(w_i) + \nabla f(w_i)^\top E_i[w_{i+1} - w_i] + \frac{L}{2} E_i[\|w_{i+1} - w_i\|^2] \\ =& f(w_i) - \eta \nabla f(w_i)^\top E_i[g_i] + \frac{L\eta^2}{2} E_i[\|g_i\|^2] \\ \le& f(w_i) - \eta \|\nabla f(w_i)\|^2 + \frac{L\eta^2}{2} (\|\nabla f(w_i)\|^2 + \sigma^2) \\ =& f(w_i) - \eta\left(1 - \frac{L\eta}{2}\right)\|\nabla f(w_i)\|^2 + \frac{L\eta^2\sigma^2}{2}. \end{aligned}\]

Since we choose \(\eta \le \frac{1}{L}\), we have \(1 - \frac{L\eta}{2} \ge \frac{1}{2}\), so

\[E_i[f(w_{i+1})] \le f(w_i) - \frac{\eta}{2}\|\nabla f(w_i)\|^2 + \frac{\eta\sigma^2}{2}. \]

Now, we combine this with convexity. For a convex function, \(f(w_i) \le f(w^*) + \nabla f(w_i)^\top(w_i - w^*)\).

\[\begin{aligned} E_i[f(w_{i+1})] \le f(w^*) + \nabla f(w_i)^\top(w_i - w^*) - \frac{\eta}{2}\|\nabla f(w_i)\|^2 + \frac{\eta\sigma^2}{2}. \end{aligned}\]

We use the identity \(2\eta \nabla f(w_i)^\top(w_i - w^*) = 2\eta E_i[g_i]^\top(w_i - w^*) = E_i[2\eta g_i^\top(w_i - w^*)]\).
From the expansion \(\|w_{i+1} - w^*\|^2 = \|w_i - \eta g_i - w^*\|^2 = \|w_i - w^*\|^2 - 2\eta g_i^\top(w_i - w^*) + \eta^2\|g_i\|^2\), we have

\[2\eta g_i^\top(w_i - w^*) = \|w_i - w^*\|^2 - \|w_{i+1} - w^*\|^2 + \eta^2\|g_i\|^2. \]

Taking conditional expectation \(E_i[\cdot]\):

\[2\eta \nabla f(w_i)^\top(w_i - w^*) = \|w_i - w^*\|^2 - E_i[\|w_{i+1} - w^*\|^2] + \eta^2 E_i[\|g_i\|^2]. \]

Substituting this back and dividing by 2:

\[\begin{aligned} E_i[f(w_{i+1})] - f(w^*) \le& \frac{1}{2\eta} (\|w_i - w^*\|^2 - E_i[\|w_{i+1} - w^*\|^2] + \eta^2 E_i[\|g_i\|^2]) \\ & - \frac{\eta}{2}\|\nabla f(w_i)\|^2 + \frac{\eta\sigma^2}{2} \\ \le& \frac{1}{2\eta} (\|w_i - w^*\|^2 - E_i[\|w_{i+1} - w^*\|^2]) + \frac{\eta}{2}(\|\nabla f(w_i)\|^2 + \sigma^2) \\ & - \frac{\eta}{2}\|\nabla f(w_i)\|^2 + \frac{\eta\sigma^2}{2} \\ =& \frac{1}{2\eta} (\|w_i - w^*\|^2 - E_i[\|w_{i+1} - w^*\|^2]) + \frac{\eta\sigma^2}{2}. \end{aligned}\]

Taking total expectation \(E[\cdot]\) over all steps (using the tower rule \(E[X] = E[E_i[X]]\)):

\[E[f(w_{i+1})] - f(w^*) \le \frac{1}{2\eta} (E[\|w_i - w^*\|^2] - E[\|w_{i+1} - w^*\|^2]) + \frac{\eta\sigma^2}{2}. \]

Taking summation over \(i\) from \(0\) to \(t-1\):

\[\sum_{i=0}^{t-1} (E[f(w_{i+1})] - f(w^*)) \le \frac{1}{2\eta} \sum_{i=0}^{t-1} (E[\|w_i - w^*\|^2] - E[\|w_{i+1} - w^*\|^2]) + \sum_{i=0}^{t-1} \frac{\eta\sigma^2}{2}. \]

The first term on the right is a telescoping sum:

\[\sum_{i=1}^{t} (E[f(w_i)] - f(w^*)) \le \frac{1}{2\eta} (\|w_0 - w^*\|^2 - E[\|w_t - w^*\|^2]) + \frac{t\eta\sigma^2}{2}. \]

Since \(\|w_t - w^*\|^2 \ge 0\), we have

\[\sum_{i=1}^{t} (E[f(w_i)] - f(w^*)) \le \frac{\|w_0 - w^*\|^2}{2\eta} + \frac{t\eta\sigma^2}{2}. \]

By convexity and Jensen's inequality, for the average iterate \(\bar{w}_t = \frac{1}{t}\sum_{i=1}^t w_i\), we have

\[E[f(\bar{w}_t)] \le E\left[\frac{1}{t}\sum_{i=1}^t f(w_i)\right] = \frac{1}{t}\sum_{i=1}^t E[f(w_i)]. \]

Combining these, we get

\[t(E[f(\bar{w}_t)] - f(w^*)) \le \sum_{i=1}^{t} (E[f(w_i)] - f(w^*)) \le \frac{\|w_0 - w^*\|^2}{2\eta} + \frac{t\eta\sigma^2}{2}. \]

Dividing by \(t\) gives the final result:

\[E[f(\bar{w}_t)] - f(w^*) \le \frac{\|w_0 - w^*\|^2}{2\eta t} + \frac{\eta\sigma^2}{2}. \]

To optimize this bound, we can set \(\eta \propto 1/\sqrt{t}\), which makes both terms on the right side of order \(O(1/\sqrt{t})\). Therefore, SGD converges at rate \(O(1/\sqrt{t})\).

Stochastic Variance Reduced Gradient

Idea: SVRG aims to combine the low iteration cost of SGD with the fast convergence of GD. It achieves this by reducing the variance of the stochastic gradient estimator. The core idea is to periodically compute a full gradient at a "snapshot" or "anchor" point \(\tilde{w}\), and then use this information to correct the subsequent stochastic gradients.

The standard SGD update uses a noisy gradient \(g_i = \nabla f_{j_i}(w_i)\). The SVRG update uses a modified, lower-variance gradient estimator \(v_i\):

\[v_i = \nabla f_{j_i}(w_i) - \nabla f_{j_i}(\tilde{w}) + \tilde{u}, \quad \text{where } \tilde{u} = \nabla f(\tilde{w}). \]

This estimator is still unbiased, since \(E_i[v_i] = E_i[\nabla f_{j_i}(w_i)] - E_i[\nabla f_{j_i}(\tilde{w})] + \tilde{u} = \nabla f(w_i) - \nabla f(\tilde{w}) + \nabla f(\tilde{w}) = \nabla f(w_i)\).

SVRG operates in epochs (outer loop, indexed by \(s\)).

  • At the start of each epoch \(s\), set an anchor point \(\tilde{w} = \tilde{w}_{s-1}\).
  • Compute the full gradient at this anchor: \(\tilde{u} = \nabla f(\tilde{w})\).
  • Run an inner loop of \(m\) iterations (indexed by \(t=0, \dots, m-1\)) starting from \(w_0 = \tilde{w}\), using the update rule \(w_{t+1} = w_t - \eta v_t\).
  • At the end of the epoch, select the next anchor \(\tilde{w}_s\). Two options are common:
    • Option I: \(\tilde{w}_s = w_m\) (the last iterate).
    • Option II: \(\tilde{w}_s\) is chosen uniformly at random from \(\{w_0, w_1, \dots, w_{m-1}\}\).

The following analysis uses Option II.

SVRG Convergence Analysis

Lemma. If \(f\) is \(L\)-smooth and \(\mu\)-strongly convex, then running SVRG with a sufficiently small step-size \(\eta\) and sufficiently large inner loop size \(m\) achieves a linear convergence rate. Specifically, there exists a constant \(\alpha < 1\) such that

\[E[f(\tilde{w}_s) - f(w^*)] \le \alpha E[f(\tilde{w}_{s-1}) - f(w^*)]. \]

Proof. The proof consists of three main parts: bounding the variance of the SVRG gradient, analyzing the one-step progress, and finally summing over an epoch to establish the linear rate.

Let \(v_t = \nabla f_{j_t}(w_t) - \nabla f_{j_t}(\tilde{w}) + \tilde{u}\). We want to bound \(E_t[\|v_t\|^2]\).
Using the inequality \(\|a+b\|^2 \le 2\|a\|^2 + 2\|b\|^2\), we have

\[\begin{aligned} E_t[\|v_t\|^2] =& E_t[\|\left(\nabla f_{j_t}(w_t) - \nabla f_{j_t}(w^*)\right) + \left(\nabla f_{j_t}(w^*) - \nabla f_{j_t}(\tilde{w}) + \tilde{u}\right)\|^2] \\ \le& 2 E_t[\|\nabla f_{j_t}(w_t) - \nabla f_{j_t}(w^*)\|^2] + 2 E_t[\|\nabla f_{j_t}(\tilde{w}) - \nabla f_{j_t}(w^*) - \tilde{u}\|^2]. \end{aligned}\]

The second term is of the form \(E[\|X - E[X]\|^2] \le E[\|X\|^2]\).
From the L-smoothness of \(f\), we now derive a key property. For each component function \(f_j\), define an auxiliary function \(g_j(w) = f_j(w) - f_j(w^*) - \nabla f_j(w^*)^\top(w-w^*)\). Since \(f_j\) is L-smooth and convex, so is \(g_j\). Note that \(\nabla g_j(w) = \nabla f_j(w) - \nabla f_j(w^*)\) and \(w^*\) is the minimizer of \(g_j(w)\) (since \(\nabla g_j(w^*) = 0\)). We see that

\[\begin{aligned} 0 = g_j(w^*) \le& g_j(w - \eta \nabla g_j(w)) \\ \le& g_j(w) - \eta \|\nabla g_j(w)\|^2 + \frac{L\eta^2}{2} \|\nabla g_j(w)\|^2. \end{aligned}\]

Taking \(\eta = \frac{1}{L}\) gives

\[\|\nabla g_j(w)\|^2 \le 2Lg_j(w). \]

Therefore,

\[\|\nabla f_j(w) - \nabla f_j(w^*)\|^2 \le 2L(f_j(w) - f_j(w^*) - \nabla f_j(w^*)^\top(w - w^*)). \]

Taking expectation over the random index \(j\):

\[\begin{aligned} E_j[\|\nabla f_j(w) - \nabla f_j(w^*)\|^2] &\le 2L \cdot E_j[f_j(w) - f_j(w^*) - \nabla f_j(w^*)^\top(w - w^*)] \\ &= 2L(f(w) - f(w^*) - \nabla f(w^*)^\top(w - w^*)). \end{aligned}\]

Since \(w^*\) is the minimizer of the full function \(f\), we have \(\nabla f(w^*) = 0\). This gives the desired property:

\[E_j[\|\nabla f_j(w) - \nabla f_j(w^*)\|^2] \le 2L(f(w) - f(w^*)). \]

Applying this property to the two terms in our variance bound yields:

\[E_t[\|v_t\|^2] \le 4L(f(w_t) - f(w^*)) + 4L(f(\tilde{w}) - f(w^*)). \]

We examine the change in expected distance to the optimum.

\[\begin{aligned} E[\|w_{t+1} - w^*\|^2] =& E[\|w_t - \eta v_t - w^*\|^2] \\ =& E[\|w_t - w^*\|^2] - 2\eta E[v_t^\top(w_t - w^*)] + \eta^2 E[\|v_t\|^2] \\ =& E[\|w_t - w^*\|^2] - 2\eta E[\nabla f(w_t)^\top(w_t - w^*)] + \eta^2 E[\|v_t\|^2]. \end{aligned}\]

Using convexity, \(\nabla f(w_t)^\top(w_t - w^*) \ge f(w_t) - f(w^*)\). Substituting this and the variance bound from part 1:

\[\begin{aligned} E[\|w_{t+1} - w^*\|^2] \le& E[\|w_t - w^*\|^2] - 2\eta E[f(w_t) - f(w^*)] + \eta^2 E[4L(f(w_t) - f(w^*)) + 4L(f(\tilde{w}) - f(w^*))] \\ =& E[\|w_t - w^*\|^2] - 2\eta(1 - 2L\eta)E[f(w_t) - f(w^*)] + 4L\eta^2 E[f(\tilde{w}) - f(w^*)]. \end{aligned}\]

Rearranging the terms gives a useful inequality for the sub-optimality gap:

\[2\eta(1 - 2L\eta)E[f(w_t) - f(w^*)] \le E[\|w_t - w^*\|^2] - E[\|w_{t+1} - w^*\|^2] + 4L\eta^2 E[f(\tilde{w}) - f(w^*)]. \]

We sum the above inequality over the inner loop, for \(t=0, \dots, m-1\).

\[\begin{aligned} & 2\eta(1 - 2L\eta) \sum_{t=0}^{m-1} E[f(w_t) - f(w^*)] \\ \le& \sum_{t=0}^{m-1} (E[\|w_t - w^*\|^2] - E[\|w_{t+1} - w^*\|^2]) + \sum_{t=0}^{m-1} 4L\eta^2 E[f(\tilde{w}) - f(w^*)] \\ =& (E[\|w_0 - w^*\|^2] - E[\|w_m - w^*\|^2]) + 4Lm\eta^2 E[f(\tilde{w}) - f(w^*)]. \end{aligned}\]

Using Option II, we have \(E[f(\tilde{w}_s) - f(w^*)] = \frac{1}{m}\sum_{t=0}^{m-1} E[f(w_t) - f(w^*)]\). We also drop the non-positive term \(-E[\|w_m - w^*\|^2]\) and use the fact that \(w_0 = \tilde{w} = \tilde{w}_{s-1}\).

\[2m\eta(1 - 2L\eta) E[f(\tilde{w}_s) - f(w^*)] \le E[\|\tilde{w}_{s-1} - w^*\|^2] + 4Lm\eta^2 E[f(\tilde{w}_{s-1}) - f(w^*)]. \]

Now, we use the property of \(\mu\)-strong convexity: \(\|w - w^*\|^2 \le \frac{2}{\mu}(f(w) - f(w^*))\).

\[\begin{aligned} 2m\eta(1 - 2L\eta) E[f(\tilde{w}_s) - f(w^*)] &\le \frac{2}{\mu}E[f(\tilde{w}_{s-1}) - f(w^*)] + 4Lm\eta^2 E[f(\tilde{w}_{s-1}) - f(w^*)] \\ &= \left(\frac{2}{\mu} + 4Lm\eta^2\right) E[f(\tilde{w}_{s-1}) - f(w^*)]. \end{aligned}\]

Finally, dividing both sides yields the recursive relationship:

\[E[f(\tilde{w}_s) - f(w^*)] \le \left( \frac{1}{\mu m\eta(1 - 2L\eta)} + \frac{2L\eta}{1 - 2L\eta} \right) E[f(\tilde{w}_{s-1}) - f(w^*)]. \]

By choosing \(\eta\) small enough (e.g., \(\eta < 1/(4L)\)) and \(m\) large enough such that the factor \(\alpha = \left( \dots \right)\) is less than 1, we prove geometric (linear) convergence.

Mirror Descent

Idea: Standard Gradient Descent makes progress proportional to the squared norm of the gradient, i.e., \(f(x_k) - f(x_{k+1}) \ge C\|\nabla f(x_k)\|^2\). This is effective when the gradient is large, but can be very slow when the gradient is small. Mirror Descent is an algorithm that works well even when the gradient is small, by changing the geometry used to measure distance and perform updates.

Instead of using the standard squared Euclidean distance, Mirror Descent uses Bregman divergence, a more general measure of "distance" between two points.

Definition: Given a continuously-differentiable, 1-strongly convex function \(w(x)\) called the distance generating function, the Bregman divergence from \(x\) to \(y\) is defined as:

\[V_x(y) = w(y) - w(x) - \nabla w(x)^\top (y-x). \]

This measures the difference between the value of \(w(y)\) and its first-order Taylor approximation at \(x\).

Example. If we choose the simplest distance generating function \(w(x) = \frac{1}{2}\|x\|^2\) (which is 1-strongly convex), then \(\nabla w(x) = x\). The Bregman divergence becomes:

\[V_x(y) = \frac{1}{2}\|y\|^2 - \frac{1}{2}\|x\|^2 - x^\top(y-x) = \frac{1}{2}(\|y\|^2 - \|x\|^2 - 2x^\top y + 2\|x\|^2) = \frac{1}{2}(\|x\|^2 - 2x^\top y + \|y\|^2) = \frac{1}{2}\|x-y\|^2. \]

In this case, Bregman divergence recovers half the squared Euclidean distance.

Intuition: The core idea is to perform a gradient descent step not in the original space (primal space), but in a "dual space" defined by the gradient of \(w(x)\), and then map the result back.

  1. Map to Dual Space: \(x_k \mapsto \nabla w(x_k)\).
  2. Perform GD Step in Dual Space: \(\nabla w(x_k) - \alpha \nabla f(x_k)\).
  3. Map Back to Primal Space: Find \(x_{k+1}\) such that \(\nabla w(x_{k+1})\) equals the result from the previous step.

This entire process is encapsulated in a single update rule called the mirror step:

\[x_{k+1} = \text{Mirr}_{x_k}(\alpha \nabla f(x_k)) := \arg\min_y \left\{ V_{x_k}(y) + \alpha \nabla f(x_k)^\top(y-x_k) \right\}. \]

Mirror Descent Convergence Analysis

Lemma. If \(f\) is a convex and \(\rho\)-Lipschitz function, then running Mirror Descent for \(T\) steps with an appropriate step-size \(\alpha\) on the average iterate \(\bar{x}_T = \frac{1}{T}\sum_{k=0}^{T-1} x_k\) gives

\[f(\bar{x}_T) - f(x^*) \le O\left(\frac{1}{\sqrt{T}}\right). \]

Proof. The proof relies on establishing a key one-step inequality that relates the progress in the objective function to the change in Bregman divergence.

We want to bound \(\alpha(f(x_k) - f(u))\) for any vector \(u\) (we will later set \(u=x^*\)).
By convexity of \(f\),

\[\alpha(f(x_k) - f(u)) \le \alpha \nabla f(x_k)^\top(x_k - u). \]

We split the inner product by introducing \(x_{k+1}\):

\[= \alpha \nabla f(x_k)^\top(x_k - x_{k+1}) + \alpha \nabla f(x_k)^\top(x_{k+1} - u). \]

From the optimality condition of the mirror step argmin, the gradient with respect to \(y\) at \(y=x_{k+1}\) is zero:

\[\nabla_y V_{x_k}(y)|_{y=x_{k+1}} + \alpha \nabla f(x_k) = 0 \implies \alpha \nabla f(x_k) = -\nabla_{x_{k+1}} V_{x_k}(x_{k+1}). \]

Substituting this into the second term:

\[= \alpha \nabla f(x_k)^\top(x_k - x_{k+1}) - (\nabla_{x_{k+1}} V_{x_k}(x_{k+1}))^\top(x_{k+1} - u). \]

Now we use a key property of Bregman divergence (the "triangle equality"): \(-(\nabla_y V_x(y))^\top(y-u) = V_x(u) - V_y(u) - V_x(y)\).

\[= \alpha \nabla f(x_k)^\top(x_k - x_{k+1}) + V_{x_k}(u) - V_{x_{k+1}}(u) - V_{x_k}(x_{k+1}). \]

Rearranging terms gives:

\[= \left( \alpha \nabla f(x_k)^\top(x_k - x_{k+1}) - V_{x_k}(x_{k+1}) \right) + \left( V_{x_k}(u) - V_{x_{k+1}}(u) \right). \]

The first parenthesis can be bounded. Since \(V_{x_k}(x_{k+1}) \ge \frac{1}{2}\|x_k - x_{k+1}\|^2\), and by Cauchy-Schwarz inequality, we have:

\[\alpha \nabla f(x_k)^\top(x_k - x_{k+1}) - V_{x_k}(x_{k+1}) \le \alpha \|\nabla f(x_k)\| \|x_k - x_{k+1}\| - \frac{1}{2}\|x_k - x_{k+1}\|^2 \le \frac{\alpha^2}{2}\|\nabla f(x_k)\|^2. \]

The last inequality comes from finding the maximum of the quadratic \(az - z^2/2\), which is \(a^2/2\).
Thus, we arrive at the crucial one-step inequality:

\[\alpha(f(x_k) - f(u)) \le \frac{\alpha^2}{2}\|\nabla f(x_k)\|^2 + V_{x_k}(u) - V_{x_{k+1}}(u). \]

We now sum this inequality from \(k=0\) to \(T-1\) and set \(u=x^*\):

\[\alpha \sum_{k=0}^{T-1}(f(x_k) - f(x^*)) \le \frac{\alpha^2}{2} \sum_{k=0}^{T-1} \|\nabla f(x_k)\|^2 + \sum_{k=0}^{T-1} (V_{x_k}(x^*) - V_{x_{k+1}}(x^*)). \]

The last term is a telescoping sum which simplifies to \(V_{x_0}(x^*) - V_{x_T}(x^*)\). Since Bregman divergence is non-negative, \(V_{x_T}(x^*) \ge 0\), so we can drop it to get an upper bound:

\[\alpha \sum_{k=0}^{T-1}(f(x_k) - f(x^*)) \le \frac{\alpha^2}{2} \sum_{k=0}^{T-1} \|\nabla f(x_k)\|^2 + V_{x_0}(x^*). \]

Let \(\bar{x}_T = \frac{1}{T}\sum_{k=0}^{T-1} x_k\). By convexity and Jensen's inequality, \(T(f(\bar{x}_T) - f(x^*)) \le \sum_{k=0}^{T-1}(f(x_k) - f(x^*))\).

\[\alpha T(f(\bar{x}_T) - f(x^*)) \le \frac{\alpha^2}{2} \sum_{k=0}^{T-1} \|\nabla f(x_k)\|^2 + V_{x_0}(x^*). \]

We assume \(f\) is \(\rho\)-Lipschitz, which for a convex function implies \(\|\nabla f(x_k)\| \le \rho\) for all \(k\).

\[\alpha T(f(\bar{x}_T) - f(x^*)) \le \frac{\alpha^2 T \rho^2}{2} + V_{x_0}(x^*). \]

Dividing by \(\alpha T\):

\[f(\bar{x}_T) - f(x^*) \le \frac{\alpha \rho^2}{2} + \frac{V_{x_0}(x^*)}{\alpha T}. \]

Let the initial "distance" \(V_{x_0}(x^*) = \Theta\). To minimize this bound, we balance the two terms by setting \(\alpha = \sqrt{\frac{2\Theta}{\rho^2 T}}\). Substituting this \(\alpha\) gives:

\[f(\bar{x}_T) - f(x^*) \le \frac{\rho^2}{2}\sqrt{\frac{2\Theta}{\rho^2 T}} + \frac{\Theta}{T}\sqrt{\frac{\rho^2 T}{2\Theta}} = \sqrt{\frac{\rho^2 \Theta}{2T}} + \sqrt{\frac{\rho^2 \Theta}{2T}} = \sqrt{\frac{2\rho^2 \Theta}{T}}. \]

This shows the convergence rate is \(O(1/\sqrt{T})\).

Lec 4

Matrix Completion

Idea: The goal is to fill in the missing entries of a matrix based on a small subset of known entries. The motivating example is the Netflix Prize, where the task was to predict user ratings for movies they hadn't seen, based on the ratings they had provided. This is also known as collaborative filtering.

Let \(A\) be an \(n \times m\) matrix (e.g., \(n\) users, \(m\) movies) where only entries in a set of indices \(\Omega\) are known. We want to recover the full matrix \(A\).

This is generally an impossible problem unless we make some assumptions about the structure of the matrix \(A\).

  • Low Rank: The matrix \(A\) is assumed to have a low rank, \(r \ll \min(n, m)\). The intuition is that user preferences and movie characteristics are not independent but are driven by a small number of latent factors (e.g., genres, actors, user taste profiles). A low-rank matrix can be expressed as a product of two smaller matrices, \(A = UV^T\), where \(U\) is \(n \times r\) and \(V\) is \(m \times r\).
  • Uniformly Distributed Entries: The known entries in \(\Omega\) are sampled uniformly at random. This prevents worst-case scenarios where entire rows or columns are missing, which would make recovery impossible.
  • Incoherence: The information in the matrix is "spread out" rather than concentrated in a few rows or columns. This ensures that the singular vectors of the matrix are not sparse.

Solution 1: Convex Relaxation

Idea: The most direct approach is to find the matrix \(X\) with the minimum possible rank that agrees with the known entries of \(A\).

\[\min_X \text{rank}(X) \quad \text{s.t.} \quad X_{ij} = A_{ij} \text{ for } (i,j) \in \Omega \]

However, minimizing rank is an NP-hard problem. The standard approach is to relax the rank function to its closest convex surrogate, the nuclear norm \(\|X\|_*\) (the sum of singular values).

\[\min_X \|X\|_* \quad \text{s.t.} \quad X_{ij} = A_{ij} \text{ for } (i,j) \in \Omega \]

While convex, this formulation leads to a Semidefinite Program (SDP), which is computationally too expensive for very large matrices like the one in the Netflix challenge.

Solution 2: Non-Convex Formulation via Alternating Minimization

Idea: Directly embrace the low-rank assumption by seeking factors \(U\) and \(V\) such that \(A \approx UV^T\). We can formulate this as a non-convex optimization problem that minimizes the error on the known entries.

Let \(P_\Omega(X)\) be a projection operator that keeps entries of \(X\) in \(\Omega\) and sets others to zero. The objective is:

\[\min_{U \in \mathbb{R}^{n \times r}, V \in \mathbb{R}^{m \times r}} \|P_\Omega(UV^T) - P_\Omega(A)\|_F^2 \]

This objective is non-convex with respect to \(U\) and \(V\) jointly. However, if we fix one matrix and optimize the other, the problem becomes a convex least-squares problem. This leads to the alternating minimization algorithm:

  • For \(t = 0, 1, 2, \dots\)
    1. Fix \(U^t\), solve for \(V^{t+1}\):

      \[V^{t+1} \leftarrow \arg\min_V \|P_\Omega(U^t V^T) - P_\Omega(A)\|_F^2 \]

    2. Fix \(V^{t+1}\), solve for \(U^{t+1}\):

      \[U^{t+1} \leftarrow \arg\min_U \|P_\Omega(U (V^{t+1})^T) - P_\Omega(A)\|_F^2 \]

In practice, this approach is much faster than convex relaxation and was a key component of the winning Netflix Prize solutions.

Non-Convex Optimization and Saddle Points

The success of methods like Alternating Minimization raises a question: why do non-convex objectives work so well in modern machine learning?

Idea: Many non-convex problems in ML have a benign landscape. The key assumption is that there are no spurious local minima, meaning all local minima are also global minima. In this setting, the only remaining obstacles for optimization algorithms are saddle points.

For a stationary point \(w\) where the gradient \(\nabla f(w) = 0\), we can classify it using the Hessian matrix \(\nabla^2 f(w)\):

  • If \(\nabla^2 f(w) \succ 0\) (all eigenvalues are positive), then \(w\) is a local minimum.
  • If \(\nabla^2 f(w) \prec 0\) (all eigenvalues are negative), then \(w\) is a local maximum.
  • If \(\nabla^2 f(w)\) has both positive and negative eigenvalues, then \(w\) is a strict saddle point. There exists at least one direction of negative curvature to escape.
  • If \(\nabla^2 f(w) \succeq 0\) but is not positive definite (it has some zero eigenvalues), then \(w\) is a local minimum or a flat saddle point. There is no obvious direction of escape.

We assume that the loss functions we deal with are strict saddle, meaning they do not contain any flat saddle points. The main challenge is to show that algorithms like SGD can efficiently escape strict saddle points.

Theorem (Informal). If a function \(f\) is smooth, bounded, strict saddle, has a smooth Hessian, and if the SGD noise has non-negligible variance in every direction with constant probability, then SGD will escape all saddle points and local maxima, and converge to a local minimum after a polynomial number of steps.

Proof Outline. The proof analyzes the behavior of SGD at a point \(w_0\) by considering three cases.

  • Case 1: The gradient is large, \(\| \nabla f(w_0) \|\) is big.
    Because the function is smooth, a large gradient guarantees that a step of gradient descent will make significant progress in decreasing the function value. Specifically, there exists a constant \(c > 0\) such that \(E[f(w_1)] \le f(w_0) - c\).

  • Case 2: We are close to a local minimum.
    A local minimum is an attractive basin. Once the iterate is close to it, SGD will not get out with high probability, because the stochastic noise is typically not large enough to overcome the "walls" of the basin. The algorithm effectively gets trapped, which is the desired behavior.

  • Case 3: We are close to a strict saddle point.
    Here, the gradient \(\| \nabla f(w_0) \|\) is small, but we are not near a minimum. Since it's a strict saddle point, there exists a negative direction (an escape direction associated with a negative eigenvalue of the Hessian). The random perturbation from the SGD noise will give a positive projection on this negative direction with high probability. This "nudge" is enough for the iterates to follow this direction and escape the saddle, after which the gradient will become large again (returning to Case 1). The smoothness of the Hessian ensures this escape is stable.

Lec 5

Generalization Theory

Central question: If a model performs well on training data (low training loss \(L_{train}\)), can we guarantee it will also perform well on new data (low population loss \(L_D\))?

Assume training samples are drawn i.i.d. from the true data distribution \(D\). Three key concepts: No Free Lunch Theorem, PAC Learning, VC Dimension.

No Free Lunch Theorem

Theorem (No-Free-Lunch). Let \(\mathcal{A}\) be any learning algorithm for binary classification with respect to the 0-1 loss over domain \(X\). Let \(m < |X|/2\) be the training set size. Then there exists a distribution \(D\) over \(X \times \{0,1\}\) such that:

  1. There exists a function \(f: X \to \{0,1\}\) with \(L_D(f) = 0\).
  2. With probability at least \(\frac{1}{7}\) over the choice of training set \(S \sim D^m\), the hypothesis returned by the algorithm satisfies \(L_D(\mathcal{A}(S)) \ge \frac{1}{8}\).

Proof. Consider a subset \(C \subset X\) with \(|C|=2m\). There are \(T = 2^{2m}\) possible functions \(f_1, \dots, f_T\) from \(C\) to \(\{0,1\}\). For each function \(f_i\), define distribution \(D_i\) over \(C \times \{0,1\}\):

\[D_i(\{(x,y)\}) = \begin{cases} 1/|C| & \text{if } y=f_i(x) \\ 0 & \text{otherwise} \end{cases} \]

Under this distribution \(L_{D_i}(f_i) = 0\).

Goal: prove \(\max_{i \in [T]} \mathbb{E}_{S \sim D_i^m}[L_{D_i}(\mathcal{A}(S))] \ge \frac{1}{4}\).

By averaging inequality,

\[\max_{i \in [T]} \mathbb{E}[L_{D_i}(\mathcal{A}(S))] \ge \frac{1}{T} \sum_{i=1}^T \mathbb{E}_{S \sim D_i^m}[L_{D_i}(\mathcal{A}(S))]. \]

Swapping summation and expectation:

\[= \frac{1}{k} \sum_{j=1}^k \left( \frac{1}{T} \sum_{i=1}^T L_{D_i}(\mathcal{A}(S_j^i)) \right), \]

where \(S_j\) is the \(j\)-th sequence of \(m\) unlabeled instances from \(C\), and \(S_j^i\) is that sequence labeled according to \(f_i\).

Fix unlabeled sequence \(S_j = (x_1, \dots, x_m)\). Let \(v_1, \dots, v_p\) be instances in \(C\) not appearing in \(S_j\), then \(p \ge m\). Loss of hypothesis \(h\) is at least its error on unseen points:

\[L_{D_i}(h) = \frac{1}{2m} \sum_{x \in C} \mathbb{I}[h(x) \ne f_i(x)] \ge \frac{1}{2m} \sum_{r=1}^p \mathbb{I}[h(v_r) \ne f_i(v_r)]. \]

Fix \(S_j\) and unseen instance \(v_r \in C\). Partition the \(T\) functions into \(T/2\) disjoint pairs \((f_i, f_{i'})\) such that they agree on all points except \(v_r\):

\[f_i(c) \ne f_{i'}(c) \iff c=v_r. \]

Since \(v_r\) is not in the training sequence, \(S_j^i = S_j^{i'}\). Any deterministic algorithm \(\mathcal{A}\) produces the same hypothesis \(h = \mathcal{A}(S_j^i) = \mathcal{A}(S_j^{i'})\).

Consider prediction of \(h\) on \(v_r\). Since \(f_i(v_r) \ne f_{i'}(v_r)\), \(h\) must be wrong for exactly one of them:

\[\mathbb{I}[h(v_r) \ne f_i(v_r)] + \mathbb{I}[h(v_r) \ne f_{i'}(v_r)] = 1. \]

Averaging over pairs, error rate on \(v_r\) is \(1/2\). Thus average error over all \(T\) functions on \(v_r\) is exactly \(1/2\):

\[\frac{1}{T} \sum_{i=1}^T \mathbb{I}[\mathcal{A}(S_j^i)(v_r) \ne f_i(v_r)] = \frac{1}{2}. \]

Average loss over all functions for fixed instance set \(S_j\):

\[\begin{aligned} \frac{1}{T} \sum_{i=1}^T L_{D_i}(\mathcal{A}(S_j^i)) \ge& \frac{1}{T} \sum_{i=1}^T \left( \frac{1}{2m} \sum_{r=1}^p \mathbb{I}[\mathcal{A}(S_j^i)(v_r) \ne f_i(v_r)] \right) \\ =& \frac{1}{2m} \sum_{r=1}^p \left( \frac{1}{T} \sum_{i=1}^T \mathbb{I}[\mathcal{A}(S_j^i)(v_r) \ne f_i(v_r)] \right) \\ =& \frac{1}{2m} \sum_{r=1}^p \frac{1}{2} \ge \frac{m}{2m} \cdot \frac{1}{2} = \frac{1}{4}. \end{aligned}\]

PAC Learning

Hypothesis Class \(H\): The set of all possible functions the learning algorithm can choose from.

Empirical Risk Minimization (ERM): Select the hypothesis with minimum error on training set \(S\).

\[\text{ERM}_H(S) \in \arg\min_{h \in H} L_S(h). \]

Realizability Assumption: There exists a perfect hypothesis \(h^* \in H\) such that \(L_{D,f}(h^*) = 0\).

Corollary. Let \(H\) be a finite hypothesis class, \(\delta \in (0,1)\) and \(\epsilon > 0\). Let \(m\) satisfy

\[m \ge \frac{\log(|H|/\delta)}{\epsilon}. \]

Then for any labeling function \(f\) and distribution \(D\) satisfying the realizability assumption, with probability at least \(1-\delta\) (over i.i.d. sample \(S\) of size \(m\)), every ERM hypothesis \(h_S\) satisfies

\[L_{D,f}(h_S) \le \epsilon. \]

Proof. Define "bad" hypotheses: \(H_B = \{h \in H : L_{D,f}(h) > \epsilon\}\). Define "misleading" samples: \(M = \{S : \exists h \in H_B, L_S(h) = 0 \}\).

Under realizability assumption, ERM algorithm always finds \(h_S\) with \(L_S(h_S)=0\). Algorithm "fails" if and only if \(h_S\) is a bad hypothesis, i.e., \(L_{D,f}(h_S) > \epsilon\). This can only occur when \(S\) is misleading. Thus

\[\{S | L_{D,f}(h_S) > \epsilon\} \subseteq M. \]

Failure probability is bounded by probability of misleading samples:

\[D^m(M) = D^m\left(\bigcup_{h \in H_B} \{S : L_S(h)=0\}\right) \le \sum_{h \in H_B} D^m(\{S : L_S(h)=0\}). \]

For a single bad hypothesis \(h \in H_B\), we have \(L_{D,f}(h) > \epsilon\), so \(P(h(x)=y) = 1 - L_{D,f}(h) < 1-\epsilon\). Probability of being correct on all \(m\) i.i.d. samples is

\[D^m(\{S : L_S(h)=0\}) = (P(h(x)=y))^m < (1-\epsilon)^m. \]

Total failure probability is

\[P(\text{failure}) \le \sum_{h \in H_B} (1-\epsilon)^m = |H_B|(1-\epsilon)^m \le |H|(1-\epsilon)^m \le |H|e^{-\epsilon m}. \]

Requiring this probability to be at most \(\delta\):

\[|H|e^{-\epsilon m} \le \delta \implies m \ge \frac{\ln(|H|/\delta)}{\epsilon}. \]

Definition (PAC Learnability). A hypothesis class \(H\) is PAC learnable if there exist a function \(m_H: (0,1)^2 \to \mathbb{N}\) (sample complexity) and a learning algorithm such that for every \(\epsilon, \delta \in (0,1)\), every distribution \(D\) and labeling function \(f\) (satisfying realizability assumption), running the algorithm on \(m \ge m_H(\epsilon, \delta)\) i.i.d. samples returns a hypothesis \(h\) such that with probability at least \(1-\delta\)

\[L_{D,f}(h) \le \epsilon. \]

Agnostic PAC Learning: Without assuming realizability.

Definition (Agnostic PAC Learnability). A hypothesis class \(H\) is agnostic PAC learnable if there exist \(m_H\) and an algorithm such that for every \(\epsilon, \delta \in (0,1)\) and every distribution \(D\) (without realizability assumption), running the algorithm on \(m \ge m_H(\epsilon, \delta)\) samples returns a hypothesis \(h\) such that with probability at least \(1-\delta\)

\[L_D(h) \le \min_{h' \in H} L_D(h') + \epsilon. \]

Error Decomposition: Error of learned hypothesis \(h_S\) decomposes as

\[L_D(h_S) = \epsilon_{app} + \epsilon_{est}, \]

where \(\epsilon_{app} = \min_{h \in H} L_D(h)\) is approximation error (inherent limitation of model class), \(\epsilon_{est} = L_D(h_S) - \epsilon_{app}\) is estimation error (due to finite samples).

VC Dimension

For infinite hypothesis classes (e.g., linear classifiers), how to analyze learnability? VC dimension measures the "effective size" or "complexity" of a hypothesis class.

Restriction of \(H\) to \(C\): Given point set \(C = \{c_1, \dots, c_m\}\), the restriction \(H_C\) is the set of all binary labelings on these points that can be produced by \(H\):

\[H_C = \{ (h(c_1), \dots, h(c_m)) : h \in H \}. \]

Shattering: A hypothesis class \(H\) shatters a set \(C\) if it can generate all possible \(2^{|C|}\) labelings on \(C\), i.e., \(|H_C| = 2^{|C|}\).

Definition (VC Dimension). The VC dimension of hypothesis class \(H\), denoted \(\text{VCdim}(H)\), is the size of the largest finite point set that can be shattered by \(H\). If \(H\) can shatter arbitrarily large sets, its VC dimension is infinite.

Example: Threshold functions \(H = \{h_a(x) = \mathbb{I}[x<a] : a \in \mathbb{R}\}\) with \(|H|=\infty\) is PAC learnable.

Lemma. For threshold functions \(H = \{h_a(x) = \mathbb{I}[x<a] : a \in \mathbb{R}\}\), sample complexity satisfies \(m_H(\epsilon, \delta) \le \frac{\log(2/\delta)}{\epsilon}\).

Proof. Under realizability, there exists optimal threshold \(a^*\) with \(L_D(h_{a^*}) = 0\). Given training set \(S = \{(x_i, y_i)\}_{i=1}^m\), ERM finds threshold \(b_S\) satisfying

\[\max\{x_i : (x_i, 1) \in S\} < b_S \le \min\{x_i : (x_i, 0) \in S\}. \]

Let \(b_0 = \max\{x_i : (x_i, 1) \in S\}\) and \(b_1 = \min\{x_i : (x_i, 0) \in S\}\), so \(b_S \in (b_0, b_1]\).

Define "safe region" \([a_0, a_1]\) around \(a^*\) where \(P_{x \sim D_X}[x \in (a_0, a^*]] = \epsilon\) and \(P_{x \sim D_X}[x \in (a^*, a_1)] = \epsilon\).

If \(b_0 \ge a_0\) and \(b_1 \le a_1\), then for any \(b_S \in (b_0, b_1]\):

\[L_D(h_{b_S}) = \begin{cases} P(a^* < x \le b_S) \le \epsilon & \text{if } b_S > a^* \\ P(b_S < x \le a^*) \le \epsilon & \text{if } b_S \le a^* \end{cases}. \]

Algorithm fails if \(b_0 < a_0\) or \(b_1 > a_1\). By union bound:

\[P[L_D(h_S) > \epsilon] \le P[b_0 < a_0] + P[b_1 > a_1]. \]

Event \(b_0 < a_0\) means no sample falls in \((a_0, a^*)\). Since single sample falls in this region with probability \(\epsilon\), we have \(P[b_0 < a_0] \le (1-\epsilon)^m\).

Similarly, \(b_1 > a_1\) means no sample falls in \((a^*, a_1)\), so \(P[b_1 > a_1] \le (1-\epsilon)^m\).

Total failure probability:

\[P[\text{failure}] \le 2(1-\epsilon)^m \le 2e^{-\epsilon m}. \]

Requiring \(2e^{-\epsilon m} \le \delta\) gives \(m \ge \frac{\ln(2/\delta)}{\epsilon}\).

This shows cardinality \(|H|\) is not the right measure of complexity, and \(\text{VCdim}(H)=1\) for threshold functions.

Theorem. A hypothesis class \(H\) with infinite VC dimension is not PAC learnable.

Thus finite VC dimension is a necessary condition for PAC learnability.

Theorem (The Fundamental Theorem of Statistical Learning). Let \(H\) be a hypothesis class with \(\text{VCdim}(H) = d < \infty\). There exist absolute constants \(C_1, C_2\) such that:

  1. \(H\) is agnostic PAC learnable with sample complexity \(m_H(\epsilon, \delta)\) satisfying

    \[C_1 \frac{d + \log(1/\delta)}{\epsilon^2} \le m_H(\epsilon, \delta) \le C_2 \frac{d + \log(1/\delta)}{\epsilon^2}. \]

  2. \(H\) is PAC learnable (under realizability assumption) with sample complexity satisfying

    \[C_1 \frac{d + \log(1/\delta)}{\epsilon} \le m_H(\epsilon, \delta) \le C_2 \frac{d\log(1/\epsilon) + \log(1/\delta)}{\epsilon}. \]

VC dimension \(d\) replaces \(\log|H|\) as the key measure of complexity, elegantly characterizing learnability for both finite and infinite hypothesis classes.

Lec 6

Linear Regression

Given dataset \((x_i, y_i)\) where \(x_i \in \mathbb{R}^d\) and \(y_i \in \mathbb{R}\). Model: \(f_{w,b}(x) = w^\top x + b\). Absorb bias into weight by augmenting \(x \to [x, 1]\) and \(w \to [w, b]\), simplifying to \(f(x) = w^\top x\).

Squared loss: \(L(w, X, Y) = \frac{1}{2N} \sum_{i=1}^N (w^\top x_i - y_i)^2\).

Closed form solution: \(w^* = (X^\top X)^{-1} X^\top y\) when \(X^\top X\) is invertible.

Gradient: \(\nabla_w L(w, X, Y) = \frac{1}{N} \sum_{i=1}^N (w^\top x_i - y_i)x_i\).

Gradient descent update: \(w_{t+1} = w_t - \frac{\eta}{N} \sum_i (w_t^\top x_i - y_i)x_i\). Since loss is convex, converges to global minimum.

Classification and Perceptron

For discrete labels, use \(f(x) = \text{sign}(w^\top x)\). Sign function not differentiable, so standard gradient descent fails.

Perceptron algorithm: Initialize \(w\) randomly. Repeat until no misclassifications: pick random \((x_i, y_i)\) with \(y_i \in \{-1, +1\}\). If \(y_i(w^\top x_i) \le 0\), update \(w \leftarrow w + y_i x_i\).

Theorem (Perceptron Convergence). Assume data is linearly separable. There exists \(w^*\) with \(\|w^*\| = 1\) and margin \(\gamma > 0\) such that \(y_i(w^{*\top} x_i) \ge \gamma\) for all \(i\). Assume \(\|x_i\| \le R\) for all \(i\). Then Perceptron makes at most \(\frac{R^2}{\gamma^2}\) mistakes.

Proof. Let \(w_t\) be weight after \(t\)-th mistake, starting with \(w_0 = 0\). Update rule: \(w_{t+1} = w_t + y_t x_t\).

Lower bound: \(\langle w_{t+1}, w^* \rangle = \langle w_t, w^* \rangle + y_t \langle x_t, w^* \rangle \ge \langle w_t, w^* \rangle + \gamma\). By induction, \(\langle w_t, w^* \rangle \ge t\gamma\). By Cauchy-Schwarz, \(\|w_t\| \ge \langle w_t, w^* \rangle \ge t\gamma\), so \(\|w_t\|^2 \ge t^2\gamma^2\).

Upper bound: \(\|w_{t+1}\|^2 = \|w_t + y_t x_t\|^2 = \|w_t\|^2 + \|y_t x_t\|^2 + 2\langle y_t x_t, w_t \rangle\). Since update occurs on mistake, \(y_t(w_t^\top x_t) \le 0\), so \(2\langle y_t x_t, w_t \rangle \le 0\). Thus \(\|w_{t+1}\|^2 \le \|w_t\|^2 + \|x_t\|^2 \le \|w_t\|^2 + R^2\). By induction, \(\|w_t\|^2 \le tR^2\).

Combining bounds: \(t^2\gamma^2 \le \|w_t\|^2 \le tR^2\) implies \(t \le \frac{R^2}{\gamma^2}\).

Logistic Regression

Output probability instead of hard classification. Sigmoid function: \(g(z) = \frac{1}{1+e^{-z}}\).

Model: \(f(x) = g(w^\top x) = \frac{1}{1+e^{-w^\top x}} \in (0,1)\).

Decision boundary: \(f(x) \ge 0.5 \iff w^\top x \ge 0\). Boundary is hyperplane \(w^\top x = 0\).

Cross-entropy loss: For true distribution \(y\) and predicted distribution \(p\), \(XE(y, p) = -\sum_i y_i \log p_i\). Entropy: \(H(y) = -\sum_i y_i \log y_i\). Property: \(XE(y, p) \ge H(y)\) with equality iff \(p_i = y_i\) for all \(i\). KL divergence: \(KL(y \| p) = XE(y,p) - H(y)\). Cross-entropy is asymmetric and provides large gradients for confident incorrect predictions.

Feature Learning

Linear models can learn arbitrary functions given correct features. Example: to learn \(y = 5x_1^3 + 6x_2x_6 + 12x_4\), define features \(z_1 = x_1^3\), \(z_2 = x_2x_6\), \(z_3 = x_4\), then \(y = 5z_1 + 6z_2 + 12z_3\) is linear in \(z\). Feature engineering is hard. Deep learning learns features automatically via representation learning.

Regularization

Prevent overfitting by penalizing model complexity. Add penalty term to loss.

Ridge regression (L2 regularization): \(L_{\text{Ridge}} = \frac{1}{2N}\sum_i(w^\top x_i - y_i)^2 + \frac{\lambda}{2}\|w\|_2^2\).

Gradient: \(\nabla_w L = \frac{1}{N}\sum_i(w^\top x_i - y_i)x_i + \lambda w\).

Update rule: \(w_{t+1} = w_t - \eta \nabla_w L = w_t(1-\eta\lambda) - \frac{\eta}{N}\sum_i(w_t^\top x_i - y_i)x_i\). First term is weight decay, shrinking weights towards zero. Ridge produces dense solutions.

LASSO regression (L1 regularization): \(L_{\text{LASSO}} = \frac{1}{2N}\sum_i(w^\top x_i - y_i)^2 + \lambda \|w\|_1\) where \(\|w\|_1 = \sum_j |w_j|\).

L1 norm is convex relaxation of L0 norm (number of non-zero entries). L0 minimization is NP-hard. L1 not differentiable at zero, use proximal gradient descent.

Update: (1) Gradient step on smooth part: \(\widetilde{w}_{t+1} = w_t - \frac{\eta}{N}\sum_i(w_t^\top x_i - y_i)x_i\). (2) Soft-thresholding: for each coordinate \(j\),

\[w_{t+1,j} = \begin{cases} \widetilde{w}_{t+1,j} - \eta\lambda & \text{if } \widetilde{w}_{t+1,j} > \eta\lambda \\ 0 & \text{if } |\widetilde{w}_{t+1,j}| \le \eta\lambda \\ \widetilde{w}_{t+1,j} + \eta\lambda & \text{if } \widetilde{w}_{t+1,j} < -\eta\lambda \end{cases}\]

Geometric intuition: L2 constraint \(\|w\|_2 \le c\) is circle. L1 constraint \(\|w\|_1 \le c\) is diamond. Optimal solution is where loss contours first touch constraint region. For L1 diamond, this is often at corners where coordinates are zero, producing sparse solutions.

Compressed Sensing

Nyquist-Shannon sampling theorem requires sampling at twice highest frequency. Many signals (images, audio) are sparse in some basis (Fourier, wavelet). If signal is \(s\)-sparse, can recover from fewer samples than Nyquist rate.

Recover signal \(x \in \mathbb{R}^p\) that is \(s\)-sparse (\(\|x\|_0 = s \ll p\)) from \(n\) linear measurements \(y = Ax\) where \(A \in \mathbb{R}^{n \times p}\) with \(n \ll p\). System is underdetermined. Leverage sparsity prior.

Restricted Isometry Property (RIP): Matrix \(A\) satisfies \((s, \epsilon)\)-RIP if for all \(s\)-sparse vectors \(x\),

\[(1 - \epsilon)\|x\|_2^2 \le \|Ax\|_2^2 \le (1 + \epsilon)\|x\|_2^2. \]

RIP ensures \(A\) approximately preserves lengths of sparse vectors, preventing information loss.

Theorem 1 (L0 Recovery). Let \(\epsilon < 1\) and \(A \in \mathbb{R}^{n \times p}\) be \((2s, \epsilon)\)-RIP matrix. Let \(x\) be \(s\)-sparse vector with \(\|x\|_0 \le s\). Let \(y = Ax\) be measurements. Then \(x\) is unique solution to \(\min_v \|v\|_0\) subject to \(Av = y\).

Proof. By contradiction. Assume \(\tilde{x} \ne x\) is the minimizer. Since \(\tilde{x}\) minimizes L0 norm and satisfies \(A\tilde{x} = y = Ax\), we have \(\|\tilde{x}\|_0 \le \|x\|_0 \le s\).

Let \(h = \tilde{x} - x \ne 0\). Then \(\|h\|_0 \le \|\tilde{x}\|_0 + \|x\|_0 \le 2s\), so \(h\) is \(2s\)-sparse.

By \((2s, \epsilon)\)-RIP applied to \(h\):

\[(1 - \epsilon)\|h\|_2^2 \le \|Ah\|_2^2. \]

But \(Ah = A(\tilde{x} - x) = A\tilde{x} - Ax = y - y = 0\), so \(\|Ah\|_2^2 = 0\).

Thus \((1 - \epsilon)\|h\|_2^2 \le 0\). Since \(\epsilon < 1\) implies \(1 - \epsilon > 0\), we must have \(\|h\|_2^2 = 0\), so \(h = 0\). This contradicts \(h \ne 0\). Therefore \(\tilde{x} = x\).

L0 minimization is NP-hard in general, making direct optimization computationally intractable for large problems. This motivates convex relaxation.

Theorem 2 (L1 Recovery). Under same RIP conditions as Theorem 1, the L0 problem is equivalent to tractable L1 minimization: \(\min_v \|v\|_1\) subject to \(Av = y\). This is convex and can be solved efficiently using standard convex optimization algorithms.

In fact, we may prove a stronger result than theorem 2, which holds even if \(x\) is not a sparse vector, but a sparse vector plus some noise.

Theorem 3 (Robust Recovery). Let \(\epsilon < \frac{1}{1+\sqrt{2}}\) and \(A\) be \((2s, \epsilon)\)-RIP matrix. Let \(x\) be arbitrary vector. Let \(x_s \in \arg\min_{v:\|v\|_0\le s}\|x - v\|_1\) be best \(s\)-sparse approximation of \(x\) (keeping \(s\) largest entries, zeroing others). Recover \(x^*\) by solving \(\min_v \|v\|_1\) subject to \(Av = Ax\). Then recovery error satisfies

\[\|x^* - x\|_2 \le \frac{2(1 + \rho)}{(1-\rho)\sqrt{s}} \|x - x_s\|_1 \]

where \(\rho = \frac{\sqrt{2}\epsilon}{1-\epsilon}\). When \(x = x_s\) (\(s\)-sparse), exact recovery \(x^* = x\) holds.

Proof. Let \(h = x^* - x\) be error vector. Goal: bound \(\|h\|_2\).

Partition indices \(\{1, \dots, p\}\): \(T_0\) contains \(s\) largest entries of \(x\) in absolute value, so \(x_s = x_{T_0}\) and \(x - x_s = x_{T_0^c}\). Let \(T_1\) contain \(s\) largest entries of \(h_{T_0^c}\). Define \(T_2, T_3, \dots\) similarly, partitioning \(T_0^c = T_1 \cup T_2 \cup \cdots\) into blocks of size \(s\) ordered by decreasing magnitude of \(h\). Set \(T_{0,1} = T_0 \cup T_1\), which is \(2s\)-sparse.

By triangle inequality: \(\|h\|_2 \le \|h_{T_{0,1}}\|_2 + \|h_{T_{0,1}^c}\|_2\). Bound these separately.

Claim 1 (Tail bound). \(\|h_{T_{0,1}^c}\|_2 \le \|h_{T_0}\|_2 + \frac{2}{\sqrt{s}}\|x - x_s\|_1\).

By optimality of L1 solution, \(\|x^*\|_1 \le \|x\|_1\), so \(\|x+h\|_1 \le \|x\|_1\). Expanding:

\[\sum_{i \in T_0} |x_i+h_i| + \sum_{i \in T_0^c} |x_i+h_i| \le \sum_{i \in T_0} |x_i| + \sum_{i \in T_0^c} |x_i|. \]

Using reverse triangle inequality \(|a+b| \ge |a| - |b|\) on \(T_0\):

\[(\|x_{T_0}\|_1 - \|h_{T_0}\|_1) + (\|h_{T_0^c}\|_1 - \|x_{T_0^c}\|_1) \le \|x_{T_0}\|_1 + \|x_{T_0^c}\|_1. \]

Simplifying: \(\|h_{T_0^c}\|_1 \le \|h_{T_0}\|_1 + 2\|x_{T_0^c}\|_1\).

For \(j \ge 2\), by construction \(\|h_{T_j}\|_\infty \le \frac{1}{s}\|h_{T_{j-1}}\|_1\). Using \(\|v\|_2 \le \sqrt{s}\|v\|_\infty\):

\[\|h_{T_j}\|_2 \le \sqrt{s} \cdot \frac{1}{s}\|h_{T_{j-1}}\|_1 = \frac{1}{\sqrt{s}}\|h_{T_{j-1}}\|_1. \]

Summing tail: \(\|h_{T_{0,1}^c}\|_2 \le \sum_{j \ge 2} \|h_{T_j}\|_2 \le \frac{1}{\sqrt{s}}\sum_{j \ge 2} \|h_{T_{j-1}}\|_1 = \frac{1}{\sqrt{s}}\|h_{T_0^c}\|_1\).

Substituting bound for \(\|h_{T_0^c}\|_1\):

\[\|h_{T_{0,1}^c}\|_2 \le \frac{1}{\sqrt{s}}(\|h_{T_0}\|_1 + 2\|x_{T_0^c}\|_1). \]

Using Cauchy-Schwarz \(\|h_{T_0}\|_1 \le \sqrt{s}\|h_{T_0}\|_2\):

\[\|h_{T_{0,1}^c}\|_2 \le \|h_{T_0}\|_2 + \frac{2}{\sqrt{s}}\|x_{T_0^c}\|_1 = \|h_{T_0}\|_2 + \frac{2}{\sqrt{s}}\|x - x_s\|_1. \]

Claim 2 (Head bound). \(\|h_{T_{0,1}}\|_2 \le \frac{2\rho}{1-\rho\sqrt{s}}\frac{1}{\sqrt{s}}\|x - x_s\|_1\) where \(\rho = \frac{\sqrt{2}\epsilon}{1-\epsilon}\).

Since \(A(x^*-x) = 0\), we have \(Ah = 0\), so \(Ah_{T_{0,1}} = -Ah_{T_{0,1}^c} = -\sum_{j \ge 2} Ah_{T_j}\).

Lemma (Almost Orthogonality). For \((2s,\epsilon)\)-RIP matrix \(A\) and disjoint index sets \(I, J\) of size at most \(s\), \(|\langle Au_I, Au_J \rangle| \le \epsilon \|u_I\|_2 \|u_J\|_2\).

Proof. WLOG assume \(\|u_I\|_2 = \|u_J\|_2 = 1\). By polarization identity:

\[\langle Au_I, Au_J \rangle = \frac{1}{4} (\|Au_I + Au_J\|_2^2 - \|Au_I - Au_J\|_2^2). \]

Since \(I, J\) disjoint, \(u_I, u_J\) have disjoint support with \(\langle u_I, u_J \rangle = 0\). Both \(u_I + u_J\) and \(u_I - u_J\) are \(2s\)-sparse with \(\|u_I \pm u_J\|_2^2 = \|u_I\|_2^2 + \|u_J\|_2^2 = 2\) by Pythagorean theorem.

Applying \((2s,\epsilon)\)-RIP:

\[\|A(u_I + u_J)\|_2^2 \le (1+\epsilon)\|u_I + u_J\|_2^2 = 2(1+\epsilon), \]

\[\|A(u_I - u_J)\|_2^2 \ge (1-\epsilon)\|u_I - u_J\|_2^2 = 2(1-\epsilon). \]

Substituting into polarization identity:

\[\langle Au_I, Au_J \rangle \le \frac{1}{4}(2(1+\epsilon) - 2(1-\epsilon)) = \frac{1}{4}(4\epsilon) = \epsilon. \]

Taking absolute value and removing normalization assumption gives \(|\langle Au_I, Au_J \rangle| \le \epsilon \|u_I\|_2 \|u_J\|_2\).

Applying RIP to \(h_{T_{0,1}}\) (which is \(2s\)-sparse):

\[(1-\epsilon)\|h_{T_{0,1}}\|_2^2 \le \|Ah_{T_{0,1}}\|_2^2 = \left\|\sum_{j \ge 2} Ah_{T_j}\right\|_2^2. \]

Expanding and using almost orthogonality for disjoint sets:

\[\begin{aligned} \left\|\sum_{j \ge 2} Ah_{T_j}\right\|_2^2 =& \sum_{j \ge 2} \|Ah_{T_j}\|_2^2 + 2\sum_{j < k} \langle Ah_{T_j}, Ah_{T_k} \rangle \\ \le& \sum_{j \ge 2} \|Ah_{T_j}\|_2^2 + 2\epsilon \sum_{j < k} \|h_{T_j}\|_2 \|h_{T_k}\|_2 \\ \le& (1+\epsilon)\sum_{j \ge 2} \|h_{T_j}\|_2^2 + 2\epsilon \left(\sum_{j \ge 2} \|h_{T_j}\|_2\right)^2 \\ \le& 2\epsilon \left(\sum_{j \ge 2} \|h_{T_j}\|_2\right)^2. \end{aligned}\]

Taking square root: \((1-\epsilon)\|h_{T_{0,1}}\|_2 \le \sqrt{2}\epsilon \sum_{j \ge 2} \|h_{T_j}\|_2 \le \sqrt{2}\epsilon \|h_{T_0^c}\|_1/\sqrt{s}\).

Rearranging: \(\|h_{T_{0,1}}\|_2 \le \frac{\sqrt{2}\epsilon}{1-\epsilon} \cdot \frac{1}{\sqrt{s}}\|h_{T_0^c}\|_1 = \frac{\rho}{\sqrt{s}}(\|h_{T_0}\|_1 + 2\|x_{T_0^c}\|_1)\).

Using \(\|h_{T_0}\|_1 \le \sqrt{s}\|h_{T_0}\|_2 \le \sqrt{s}\|h_{T_{0,1}}\|_2\):

\[\|h_{T_{0,1}}\|_2 \le \rho\|h_{T_{0,1}}\|_2 + \frac{2\rho}{\sqrt{s}}\|x_{T_0^c}\|_1. \]

Solving: \(\|h_{T_{0,1}}\|_2 \le \frac{2\rho}{(1-\rho)\sqrt{s}}\|x - x_s\|_1\).

Combining claim 1 and 2:

\[\begin{aligned} \|h\|_2 \le& \|h_{T_{0,1}}\|_2 + \|h_{T_{0,1}^c}\|_2 \\ \le& \|h_{T_{0,1}}\|_2 + \|h_{T_0}\|_2 + \frac{2}{\sqrt{s}}\|x - x_s\|_1 \\ \le& 2\|h_{T_{0,1}}\|_2 + \frac{2}{\sqrt{s}}\|x - x_s\|_1 \\ \le& 2 \cdot \frac{2\rho}{(1-\rho)\sqrt{s}}\|x - x_s\|_1 + \frac{2}{\sqrt{s}}\|x - x_s\|_1 \\ =& \frac{2}{\sqrt{s}}\left(\frac{2\rho}{1-\rho} + 1\right)\|x - x_s\|_1 \\ =& \frac{2(1+\rho)}{(1-\rho)\sqrt{s}}\|x - x_s\|_1. \end{aligned}\]

Theorem 4 (Random RIP Matrices). Matrix \(A \in \mathbb{R}^{n \times p}\) with entries drawn i.i.d. from Gaussian or Bernoulli distribution is \((s, \epsilon)\)-RIP with high probability if \(n \ge C \cdot s \log(p/s)\) for constant \(C\). Number of measurements is nearly linear in sparsity level.

Non-linear Compressed Sensing

For signals on low-dimensional manifold learned by generative model \(x = G(z)\) where \(G: \mathbb{R}^k \to \mathbb{R}^p\) is neural network and \(k \ll p\). Recover \(z\) from linear measurements \(y = AG(z)\) by solving \(\min_z \|y - AG(z)\|_2^2\). Generative prior provides strong structural constraint, achieving better reconstruction than traditional compressed sensing. Similar theoretical guarantees exist for non-linear setting.

posted @ 2025-10-07 02:13  Scintilla06  阅读(45)  评论(1)    收藏  举报