Chapter 6: Other Stuff

Acknowledgment: Most of the knowledge comes from Yuan Yang's course "Machine Learning".

Chapter 6_1: Robust ML

Projected gradient descent(PGD)

It is an advanced version of the Fast Gradient Sign Method(FGSM).

Fast Gradient Sign Method(FGSM):

\[x^* = P_{\Delta}\Big[x_0 + \eta\nabla L\big(f(x+\theta), y\big)\Big] \]

Explain the notation:

\(\Delta\) is the region. \(P_{\Delta}(\cdot)\) is the projection function.

If the learning goes to \(\infty\), \(x_0\) can be ignored. We always go to the corner.

Then this formula can be simplified as

\[x^* = \eta ~\text{sign}\Big(\nabla L\big(f(x+\theta), y\big)\Big) \]

This is FGSM, but it goes to the solution in only one step. Not so robust.

Projected gradient descent runs FGSM multiple times to find an adversarial example. That is:

\[x_{t+1} = x_t + \eta ~\text{sign}\Big(\nabla L\big(f(x+\theta), y\big)\Big) \]

Slower than FGSM, but typically able to find better local optimal.

Adversarial training

What we want to do is:

\[\mathop{\min}_{f_{\theta}}\sum_{x, y} \mathop{\max}_{\delta\in \Delta} L\big(f_{\theta}(x+\delta),y\big) \]

解释: 有一个预设的误差区域\(\Delta\),在这个\(\Delta\)的范围内,这个函数都会输出尽可能一样的结果。我们当然希望这个函数是非常robust的,输入有一点小变化,都不影响输出结果。攻击者要max里面这个东西,我们要训练一个不容易被攻击的模型,那自然就是min这个max

Algorithm:

Repeat the algorithm:

​ select minibatch B;

​ For each \((x, y) \in B\), compute adversarial example \(\delta^*(x)\)

​ update parameters by this rule

\[\theta_{t+1} = \theta_{t} - \frac{\eta}{|B|}\sum_{x, y \in B} \frac{\partial}{\partial \theta_t}L\Big(f_{\theta_t}\big(x+\delta^*(x)\big), y\Big) \]

Why does this algorithm make sense?

First, selecting a minibatch is a common operation.

Then we indeed need to update by this rule.

\[\theta_{t+1} = \theta_{t} - \sum_{x, y \in B} \frac{\partial}{\partial \theta_t} \mathop{\max}_{\delta\in\Delta}L\Big(f_{\theta}\big(x+\delta\big), y\Big) \]

But by danskin's theorem:

\[\frac{\partial}{\partial \theta_t} \mathop{\max}_{\delta\in\Delta}L\Big(f_{\theta}\big(x+\delta\big), y\Big) =\frac{\partial}{\partial \theta_t}L\Big(f_{\theta_t}\big(x+\delta^*(x)\big), y\Big)\\ \text{where } \delta^* = \arg\mathop{\max}_{\delta\in\Delta}L\Big(f_{\theta}\big(x+\delta\big), y\Big) \]

Proof omitted. Seems obvious, but it is a very subtle result. Only applies when \(\delta^*\) is exact.

Evaluating robust model

Run PGD with random restarts for many iterations.

Robust models are not universal. e.g. Robust on \(\ell_1\) is not robust on \(\ell2\) or \(\ell_∞\)

adversarial training can get a more meaningful gradient. (maybe it is a symbol of more robust features???)

Task 1: Non-robust features suffice for standard classification

假如我把一个数据集里面的所有图片,都用PGD算法来attack/relabel,得到一个错误的数据集。但是这个数据集用来训练一个神经网络的话,同样可以得到一个很好的效果。

因为模型可以学到这个数据集中的non robust features(human can not see)

Task 2: Generate a robust feature dataset

We get a robust model. We can use it as a feature extraction function as \(g\).

For every train image \(x\)(正常的,狗的图片,并且有狗的label), we generate \(x_r\) from random initialization s.t. \(g(x)\approx g(x_r)\). (大概就是,我们随机生成一个数据点,然后一直修改他,比如加高斯噪声,或者做梯度下降。然后直到它也被模型预测成g(x)的时候停止)

So the robust features(这是必然啊,因为你robust features不相同都不停止) are similar, but non-robust features are independent(你random初始化,然后做微调,那么non robust features自然可能是任何东西了).

直观理解,就是robust feature就是人眼看上去像啥,non_robust feature 就是被模型预测成啥

不用直接在\(x\)上跑adversarial training. 在\(\{x_r\}\)数据集上跑正常的training also has reasonable robust test performance. 而且后者还更快。

  • Answer a basic question: is it possible to apply plain training on a small training data set, and get a reasonably robust model? Give a yes answer to this question.

  • Why plaining training on \(\{x\}\) does not work, but on \(\{x_r\}\) works?

    • 因为x的non robust features也是对的,但是\(\{x_r\}\)的non robust feature是不对的,只有学到了robust features,才能预测对。

看起来好像做了一件无意义的事情:You first get a robust model, then obtain other models based on this robust model.但是真的无意义吗?

obfuscated gradient

前面的PGD,需要知道导数信息,那自然而然想到的防守方法,就是把导数信息给藏起来。

Shattered gradients不可导,求出来是非法值

  • Non-differentiable, Numeric instability, Gradient nonexistent/incorrect

Exploding and vanish gradient导数太难求了

  • multiple iterations of nn
  • very deep network

Stochastic Gradient

  • 随机化

这些也自然也会有一些对应的攻击性策略

  • Backward pass differentiable approximation. 就是如果有一些不能求导的部分,就跳过不求?

  • Take multiple samples and then calculate expectations to attack the randomized classifier.

Provable robust certificates

这感觉是蛮正经的理论分析,就是给的bound太紧了。

首先,我们认为Easy to find adversarial points in high dimensional space. 因为对于一个高维的多面体里面的一点,很容易找到一个微小的位移使之移动到另一个颜色区域内。非常不robust

我们修改一下,每个点的种类,不取决于他在哪个颜色区域内,而是取决于以他为圆心的单位球内最多的颜色是什么。这样子会更robust。

如何给出这个robust的bound呢?

greedy filling algorithm!

我们已知的是,center at x的单位球内的颜色分布直方图(not the prediction result of every point inside)

What’s the worst-case histogram at \(x+\delta\)? Greedy filling!

将前面的“每个点的种类,不取决于他在哪个颜色区域内,而是取决于以他为圆心的单位球内最多的颜色是什么”这句话在二分类问题下用数学公式写出来

\[g(x) = \int_{\vec{x}+\vec{v} \in B(\vec{x}, 1)} f(\vec{x}+\vec{v}) \Pr(\vec{v}) dv \]

The same defined for \(g(x+\delta)\)

我们现在要找到最大的 \(\delta\) 保证不管怎么涂色,都有\(g(x+\delta) > \frac 1 2\)

最具有攻击性的涂色方法是什么呢?根据likelihood来涂

\[LL = \frac{\Pr (\vec{z} - \vec{x})}{\Pr\big(\vec{z} - (\vec{x} + \vec{\delta})\big)} \]

(被采样到的概率是和该点到中心点的距离有关,我好像理解了,不同的center采到他的概率不一样)

Sort the likelihood from big to small. Greedy filling, until we reach the end.

e.g. 对于uniform的涂色

  • 只在 \(x\) 的ball中,不在 \(x+\delta\) 的ball中的是 \(+\infty\)
  • 同时在的是 \(1\)
  • 和第一种情况反过来的是 \(0\)

那就按这三种情况的顺序来涂,但是区域内部就无所谓了。

e.g.2 对于高斯的分布

the likelihood function is a line

作业中有算过,从 \(x\)\(x+\delta\) 方向,LL越来越小。

Chapter 6_2: Hyperparameter Tuning

Random search: search the configurations randomly.

Surprisingly, random search has very good performance in practice.

Much better than grid search! Grid search is too dense and costly.

Bayesian Optimization:

算法过程

Step 1 Assume a prior distribution for the loss function \(f\).计算每一点处函数值的均值和方差

Step 2 Select new sample(s) that balances exploration and exploitation.

Either the new sample(s) gives better results or gives more information about f.

Step 3 Update prior with the new sample(s) using Bayes' rule. Go to Step 2.

重点概念解释

  • 代理模型(surrogate model): 用于对目标函数进行建模。代理模型通常有确定的公式或者能计算梯度,又或者有已知的凹凸性、线性等特性,总之就是更容易用于优化。更泛化地讲,其实它就是一个学习模型,输入是所有观测到的函数值点,训练后可以在给定任意x的情况下给出对f(x)的估计。

    这也就回答了我们的问题,loss function \(f\) 哪里来的,根据一些实验的结果,得到一个连续的函数,作为每个点的pred mean, pred variance

  • 采集函数(acquisition function) :采集函数通常是一个由代理模型推出的函数,它的输入是可行集(feasible set)A上的任意值,输出值衡量了每个输入\(x\)有多值得被观测。会结合该点的pred mean和能减少的uncertainty来计算一个分数。

Bayesian Optimization 优点

  • 能利用先验知识高效地调节超参数,每个试验不独立,前一个推动下一个选择。bayes 优化的方法主要是针对随机搜索的优化,Random Search是稀疏的简单抽样,试验之间是相互独立的,不能利用先验知识选择下一组超参数。
  • 通常适用于连续值的超参,例如 learning rate, regularization coefficient
  • 不易陷入局部最优

摘抄自【转载】AutoML--超参数调优之Bayesian Optimization - marsggbo - 博客园 (cnblogs.com)

同样不错的一篇介绍:[理解贝叶斯优化 - 知乎 (zhihu.com)](https://www.cnblogs.com/marsggbo/p/10242962.html)

Gradient descent for hyperparameter

考虑一个简单的例子

考虑线性回归损失函数

\[L(w) = \frac{1}{2} \sum_{i=1}^{n} (w^T x - y)^2 \]

计算其梯度

\[\nabla_w L(w) = \sum_{i=1}^{n} (w^T x - y)x \]

然后固定算法,求超参的梯度

算法:梯度下降,更新只有有限的两步:

\[w_1 = w_0 - \eta \nabla_w L(w_0)\\ w_2 = w_1 - \eta \nabla_w L(w_1) \]

其中,$ \eta $ 是学习率。

最终的目标函数 $ f(w_0, \eta) $ (两个超参,初始化的权重和学习率)表示为:

\[f(w_0, \eta) = L(w_2) = \frac{1}{2} \sum_{i=1}^{n} (w_2^T x - y)^2 \]

What is $ \nabla_\eta f(w_0, \eta) $? 简单的链式法则

\[\begin{align*} \nabla_\eta f(w_0, \eta) &= \nabla_\eta L(w_2) \\ &= \nabla_{w_2} L(w_2) \cdot \nabla_{\eta}\big(w_2(w_0, \eta)\big)\\ &= \sum_{i=1}^{n} (w_2^T - y)x \cdot \nabla_\eta w_2 \end{align*} \]

因为\(w_2 = w_1 - \eta \nabla_w L(w_1)\) 所以我们也可以求\(\nabla_\eta w_2\)

\[\nabla_\eta w_2 = \nabla_\eta w_1 - \nabla_w L(w_1) - \eta \nabla_\eta (\nabla_w L(w_1)) \]

$ \nabla_\eta w_1 = \ldots $ $ \nabla_\eta (\nabla_w L(w_1)) = \ldots $ 这个两个也都可以类似的去算

进一步将这个问题拓展

但是,对于一个多步的梯度下降,就必然会面临一个很严重的存储问题了

In general, how to compute $ \eta \nabla_\eta (\nabla_w L(w_i)) $ for $ i=1, \ldots, T $?

Naïve idea: need to store $ w_1, \ldots, w_T $ in memory!

Impossible for neural nets. Every $ w_i $ has millions of parameters, $ T $ could be thousands.

Can we store $ w_i $ in a more efficient way?

首先我们需要引入momentum v

SGD with momentum

\(v_{t+1} = \gamma v_{t} - (1-\gamma)\nabla_w L(w_{t})\)

\(w_{t+1} = w_{t} + \eta v_{t+1}\)

Information from the gradient update history.

The current gradient might be noisy. So using a gradient average is more robust

Intuitively, \(v_t\) stores “compressed” information of \(w_1,⋯, w_T\)!

因为我只需要\(w_t\)\(v_t\) 我就可以螺旋的算出之后的参数,所以我只需要存一个\(w\)\(v\) 然后不断更新就可以了

局限

离散的不行,连续的还可以。

Best arm identification

\(n\) arms, each gives a reward when we play it, which is a bounded random variable with expectation \(v_i\)

Every time we pick an arm, play it, get an independent sample of its reward

With a fixed budget, how can we find the one with the largest \(v_i\)?

Successive Halving (SH) Algorithm

大概就是每次淘汰掉最差的一半

Input: budget B (就是总共能玩多少次), 总共有n台机器,每台每次的输出都由一个随机变量控制

per round budget \(B' = B/\log_2(n)\) (因为每次是砍掉一半,所以总轮数是\(\log_2(n)\)。每个回合不管剩下多少台机器,总预算都一样?不懂这个是否合理)

for r in range\((0, \log_2(n))\):

​ 每台机器都玩\(B'/|S_r|\) (\(S_r\)是这一回合还剩的机器台数,那也是所有预算平均分) 这样可以给每一台一个平均分。

​ Let \(S_{r+1}\) be the largest half with a larger empirical average.

最后筛到只剩下一个。

Guarantee for this algorithm

Assume \(v_1\ge v_2 ... \ge v_n\), let \(\Delta_i = v_1 - v_i\)

Theorem: With probability \(1-\delta\), the algorithm finds the best arm with \(B = O(H_2 \log n \log(\frac{\log}{n}))\) arm pulls. Here \(H_2 = \mathop{\max}_{i\ge 1} \frac{i}{\Delta_i^2}\)

Proof:

Concentration inequality: If arm 1 在r轮之前没有被淘汰,那么for any arm \(i\in S_r\) 用hoeffiding bound可以得到

\[\begin{align*} \Pr(\bar{v_1} < \bar{v_i}) &= \Pr\Big((\bar{v_i} -\bar{v_1}) - E[v_i - v_1] \ge \Delta_i\Big)\\ &\le \exp(-\frac{1}{2} n \Delta_i^2) \\ &= \exp(-\frac{1}{2} \frac{B\Delta_i^2}{\log_2(n)|S_r|}) \end{align*} \]

解释,\(\bar{v_i}\)\(\bar{v_t}\) 是这样采样下的empricial average,n是采样次数,等于\(\frac{B}{\log_2(n)|S_r|}\)

Assume a counstant \(n_c = \frac{n}{2^{r+2}}\)。 那么在第r轮,我们有\(n / 2^r = 4n_c\) 的arms,我们其中后3/4的记作\(S_r'\) \(|S_r'|= 3n_c\)

Let \(N_r\) be the number of arms in \(S_r'\) with empirical mean larger than arm 1 (bad arms 明明不好,但是在这次采样中表现得很好)

\[\begin{align*} E[N_r] &\le \sum_{i\le S_r'} \exp(-\frac{1}{2} \frac{B\Delta_i^2}{\log_2(n)|S_r|}) \\ &\le |S_r'| \exp(-\frac{B\Delta_i^2}{8\log_2(n)n_c}) \end{align*} \]

坏了,\(\Delta_i\) 放缩不掉

这时候,我们再来回过头来想想,\(\Delta_i\) 该怎么办,假如放缩到0的话,那么这个均值就没有意义了

所以我不能这样,我要先定义其中good term,这里选定了1/4 的good arms也就是 \(n_c\) good arms

那么这些bad term里面的\(\Delta_i\) 就都会大于 \(\Delta_{n_c}\) 那下面的式子的那个1/3 也就很自然了

By Markov inequality:

\[\begin{align*} \Pr(\text{arm 1 没有被选中}) &= Pr(\text{at least }n_c \text{ bad arm better than arm 1})\\ &= \Pr(N_r > \frac{1}{3}|S_r'|) \\ &\le 3\exp(-\frac{B}{8\log_2(n)}\frac{\Delta_{n_c}^2}{n_c}) \end{align*} \]

In other words, with a high probability, not so many bad arms will have an empirical mean larger than arm 1

Finally, by union bound, the probability of arm 1 getting removed in any round is at most

\[3 \sum_{r=1}^{\log(n)}\exp(-\frac{B}{8\log_2(n)}\frac{\Delta_{n_c}^2}{n_c}) \le 3 \sum_{r=1}^{\log(n)}\exp(-\frac{B}{8H_2\log_2(n)})= 3\log(n)\exp(-\frac{B}{8H_2\log_2(n)}) \]

第一个不等号的解释:

\[H_2 = \mathop{\max}_{i\ge 1} \frac{i}{\Delta_i^2}\\ \frac{1}{H_2} \le \frac{\Delta_{n_c}^2}{n_c} ~~\forall n_c \]

\[\begin{align*} 3\log(n)\exp(-\frac{B}{8H_2\log_2(n)}) &\le \delta\\ -\frac{B}{8H_2\log_2(n)} &\le \log(\frac{\delta}{3\log(n)})\\ B &\ge 8H_2\log(n)\log(\frac{\delta}{3\log(n)}) \end{align*} \]

所以总共只需要玩\(\Omega\Big(H_2\log(n)\log(\frac{\delta}{\log(n)})\Big)\)

SH in Hyperparameter Tuning

Every hyperparameter configuration is an arm, minimizing its loss.

Instead of drawing samples from a random variable, we are in a nonstochastic setting and pay for more accurate observation:

  • For all \(i \in [n]\), \(k \ge 1\), let \(\ell_{i,k} \in R\) be a sequence for arm \(i\) and assume \(v_i = \lim_{\tau \rightarrow{}\infty} \ell_{i,\tau}\) exists

change the algorithm. No average over samples, only care about the last observed value.

Guarantee for SH in hyperparameter tuning

Denote some notation first:

Let \(\gamma_i(t)\) be non-increasing function of \(t\), which gives the smallest value for each \(t\) s.t. \(|\ell_{i,t}−v_i|≤\gamma_i(t)\)

Let \(\gamma_i^{−1}(\alpha)=\min\{t∈N:\gamma_i(t)\le\alpha\}\)

First time we enter \(\alpha\)-close region to \(v_i\), then we never get out.

If \(k_i \ge \gamma_i^{−1}(\frac{v_i -v_1}{2})\) and \(k_1 \ge \gamma_1^{−1}(\frac{v_i -v_1}{2})\), then arm \(i\) and arm 1 are separated.

Thm: \(\bar{\gamma}(t) = \mathop{\max}_{i}\gamma_i(t)\) and \(B \ge 2\log n\Big(n + \sum_{i = 2, ..., n}\bar{\gamma}^{-1}(\frac{v_i -v_1}{2})\Big)\), then SH returns the best arm.

Proof:

Notice \(B^{\prime} = 2\Big(n + \sum_{i=2}^{n} \bar{\gamma}^{-1}(\frac{v_i - v_1}{2})\Big)\)

每个臂获得 \(B^{\prime}/|S_r|\) 次拉动,larger than \(\bar{\gamma}^{-1}\Big(\frac{v_{\Big\lfloor \frac{|S_r|}{2} \Big\rfloor+1} - v_1}{2}\Big)\)

接下来解释一下上一句话是为什么呢?

首先\(i\) 越小,\((\bar{\gamma}^{-1}(t)\frac{v_i -v_1}{2})\) 越大,因为差距越小,越难区分,要采更多的点。那就把\(i\) 小于 \(\frac{|S_r|}{2}\) 都忘小了放放到这个数,然后把剩下的直接扔了。

\[\begin{align*} 2\Big(n + \sum_{i=2}^{n} \bar{\gamma}^{-1}(\frac{v_i - v_1}{2})\Big) &\ge 2 \sum_{i=2}^{n} \bar{\gamma}^{-1}(\frac{v_i - v_1}{2}) \\ &\ge 2 \frac{|S_r|}{2}\bar{\gamma}^{-1}\Big(\frac{v_{\Big\lfloor \frac{|S_r|}{2} \Big\rfloor+1} - v_1}{2}\Big)\\ \frac{B'}{|S_r|} \ge \bar{\gamma}^{-1}\Big(\frac{v_{\Big\lfloor \frac{|S_r|}{2} \Big\rfloor+1} - v_1}{2}\Big) &\ge \gamma_i \Big(\frac{v_i - v_1}{2}\Big) \forall ~i \ge \Big\lfloor \frac{|S_r|}{2} \Big\rfloor+1 \end{align*} \]

So in round k, we know arm \(\Big\lfloor\frac{|S_r|}{2}\Big\rfloor+1\) and arm \(1\) are separated. 那就是每轮可以扔掉一半。

Therefore, we will identify arm 1 in \(S_{log_2(n)}\).

Neural architecture search(NAS)

Given a specific task. Find the best network structure for this task.

Reinforcement learning.

Random search also works.

ProxylessNAS

Conventional NAS algorithms need lots of GPU hours.

So it only be used in some proxy tasks (代理任务/ 辅助任务)

只能Train on a small dataset. Learning with only a few blocks. Training for a few epochs. 要不然太costly了

我们希望改进这一点,这样让其也能运用到大任务上。所以就有了ProxylessNAS。

For every layer (edge), consider all possibilities.

N components \(\{o_i\}\): Conv (with different filter size), Identity, pool

Train structure (the probability of a layer to be chosen) and weights(weight and bias inside nn) together!

Given input \(x\), the mixed layer’s output is based on all of its components:

One-shot (Bender et al., 2018) Output: \(\sum_{i=1}^{N} o_i(x)\)

Performance is not so good.

DARTS (Liu et al., 2018): with weights \(\alpha_i\). Output: \(\sum_{i=1}^{N} p_i o_i(x)=\sum_{i=1}^{N} \frac{e^{\alpha_i}}{\sum_{j} e^{\alpha_j}} o_i(x)\)

Not memory efficient, because it stores all \(N\) possible paths. The final model contains only one path.

Binarize the path. Define the binary variable. (a one-hot vector)

\(g = \begin{cases} [1, 0, \dots, 0] & \text{w.p. } p_1 \\ \vdots \\ [0, 0, \dots, 1] & \text{w.p. } p_N \end{cases}\)

Output relies on only one sampled path:
\(\sum_{i=1}^{N} g_i o_i(x) = \begin{cases} o_1(x) & \text{w.p. } p_1 \\ \vdots \\ o_N(x) & \text{w.p. } p_N \end{cases}\)

Saves a huge amount of memory. 但是期望呢又还是一样,所以采样这个操作就非常的智慧。

How to train?

Alternative training of nn structure and weight:
When training the weights, freeze \(\alpha_i\), and sample the structures.
When training \(\alpha_i\), freeze the weights.

How to learn \(\alpha_i\)?
\(\frac{\partial L}{\partial \alpha_i} = \sum_{j=1}^{N} \frac{\partial L}{\partial p_j} \frac{\partial p_j}{\partial \alpha_i} \approx \sum_{j=1}^{N} \frac{\partial L}{\partial g_j} \frac{\partial p_j}{\partial \alpha_i}\)

链式法则省略掉了不能求导的那一项 \(\frac{\partial g_j}{\partial p_j}\)

Here \(g_j\) is the binarized value.

\(\sum_{j=1}^{N} \frac{\partial L}{\partial g_j} \frac{\partial p_j}{\partial \alpha_i} = \sum_{j=1}^{N} \frac{\partial L}{\partial g_j} p_j (\delta_{ij} - p_i)\)

Where \(\delta_{ij} = 1\) if \(i = j\) and \(0\) otherwise.

Chapter 6_3: Differential Privacy

Notation

Denote a database \(x\) as a collection of records from \(X\), where \(X\) is a discrete set. We represent databases by their histograms: \(x \in \mathbb{N}^{|X|}\). Each entry \(x_i\) represents the number of elements in the database of type \(i \in X\), with every entry \(x_i \geq 0\).

The \(\ell_1\) distance between two databases is defined as: \(\|x - y\|_1 = \sum_{i=1}^{|X|} |x_i - y_i|\)

The \(\ell_1\) norm \(\|x\|_1\) measures the size of a database, and \(\|x - y\|_1\) measures how many records differ between \(x\) and \(y\).

Definition

A randomized algorithm \(M\) with domain \(\mathbb{N}^{|X|}\) is \((\epsilon, \delta)\)-differentially private if for all \(S \subseteq \text{Range}(M)\) and for all \(x, y \in \mathbb{N}^{|X|}\) such that \(\|x - y\|_1 \leq 1\):

\[\Pr[M(x) \in S] \leq \exp(\epsilon) \Pr[M(y) \in S] + \delta \]

In other words, if two databases differ by 1 element, the probability that \(M\) generates any outcome \(S\) is similar for them!

例子1:coin toss

Q: Do you have 1M in your pocket? If you do, you may refuse to tell the truth

How about the following protocol:

  • You toss a coin
  • If it’s head
    • toss another coin, and report yes if it’s head, and no otherwise
  • it’s tail, report the truth

If fraction p of the people indeed have 1M in the pocket, after this survey, we will get a number: \(3p/4 +(1−p)/4=1/4+p/2\)

So we can still recover \(p\). At the same time, we protected the privacy.

Randomized response is \((\ln3,0)\)-DP (relaxing \(‖x−y‖_1=2\))

(in wiki, \(x,y\) differ by one element), it’s DP

一个人真的有1M那么会有1/4的概率说no,有3/4的概率说yes

一个人没有1M那么会有1/4的概率说yes,有3/4的概率说no

所以这里正反都差个三倍

所以是 \(e^{\ln 3}\)

例子2:Laplace mechanism

首先定义一下 \(\ell_1\) sensitivity \(\Delta f\)

We look at the case where we generate \(k\) real numbers for the query. That is the range of the function is \(\subset R^k\)

The \(\ell_1\) sensitivity of a function \(f: \mathbb{N}^{|X|} \to \mathbb{R}^k\) is

\[\Delta f = \max_{x,y \in \mathbb{N}^{|X|}; \|x-y\|_1=1} \|f(x) - f(y)\|_1 \]

It measures how a single change in the database will change \(f\).

The Laplace distribution is denoted as \(\text{Lap}(b) = \frac{1}{2b} \exp\left(-\frac{|x|}{b}\right)\), where the variance is \(\sigma^2 = 2b^2\).

Given any function \(f: \mathbb{N}^{|X|} \to \mathbb{R}^k\), the Laplace mechanism is defined as

\[M_L(x, f, \epsilon) = f(x) + (Y_1, \dots, Y_k) \]

where \(Y_i\) are independent and identically distributed random variables drawn from \(\text{Lap}(\Delta f/\epsilon)\).

The scale of noise is calibrated to sensitivity and \(\epsilon\).

Theorem: The Laplace mechanism preserves \((\epsilon, 0)\)-differential privacy.

Proof: Let \(p_x\) be the probability density function of \(M_L(x, f, \epsilon)\), \(p_y\) be the probability density function of \(M_L(y, f, \epsilon)\). \(\|x-y\|_1 \leq 1\). We compare the two at some \(z \in \mathbb{R}^k\).

\[\begin{align*} \frac{p_x(z)}{p_y(z)} &= \frac{\prod_{i=1}^k\text{Lap}(\frac{|f(x)_i-z_i|}{\epsilon})}{\prod_{i=1}^k\text{Lap}(\frac{|f(y)_i-z_i|}{\epsilon})} \\ &= \frac{\prod_{i=1}^k\exp\left(-\frac{\epsilon |f(x)_i - z_i|}{\Delta f}\right)}{\prod_{i=1}^k\exp\left(-\frac{\epsilon |f(y)_i - z_i|}{\Delta f}\right)} \\ &= \prod_{i=1}^k \exp\left(\frac{\epsilon (|f(y)_i - z_i| - |f(x)_i - z_i|)}{\Delta f}\right) \\ &\leq \prod_{i=1}^k \exp\left(\frac{\epsilon |f(x)_i - f(y)_i|}{\Delta f}\right) ~~\text{绝对值不等式}\\ &= \exp\left(\frac{\epsilon \cdot \|f(x) - f(y)\|_1}{\Delta f}\right) ~~(指数相乘等于次幂相加)\\ &\leq \exp(\epsilon) ~~\text{sensity definition} \end{align*} \]

假设counting query是一个函数f,那么我们来看看Laplace mechanism能得到什么

显然Sensitivity of a counting query is 1。

If \(Y \sim \text{Lap}(b)\), then:\(\Pr[|Y| \geq t \cdot b] = \exp(-t)\) (这就是简单的积分)

Theorem: Let \(f: \mathbb{N}^{|X|} \to \mathbb{R}^k\), and let \(z= M_L(x, f, \epsilon)\). Then for all \(\delta \in (0, 1]\):

\[\Pr\Big[\|f(x) - z\|_{\infty} \geq \ln(k/\delta) \cdot (\Delta f/\epsilon)\Big] \leq \delta \]

Proof:

\[\begin{align*} \text{LHS} &=\Pr\Big[\max_{i \in [k]} |Y_i| \geq \ln(k/\delta) \cdot (\Delta f/\epsilon)\Big] ~~ |Y_i| \sim \text{Lap}(\frac{\Delta f}{\epsilon})\\ &\leq k \cdot \Pr\Big[|Y_i| \geq \ln(k/\delta) \cdot (\Delta f/\epsilon)\Big] = k \cdot (\delta/k) = \delta \end{align*} \]

If we want to count names of 10,000 potential names list from 1 billion people.

We can simultaneously calculate the frequency of all k=10,000 names with (1,0)-DP. \(\epsilon = 1\)

With probability 95%,(\(\delta = 0.05\)) no estimate will be off by more than \(\ln(10000/0.05)≈12.2\).

This is a low error for 1 billion people!

一个重要的性质:Differential privacy is immune to post-processing.

Let \(M: \mathbb{N}^{|X|} \to \mathbb{R}\) be a randomized algorithm that is \((\epsilon, \delta)\)-differentially private. Let \(f: \mathbb{R} \to \mathbb{R}'\) be an arbitrary mapping (possibly random). Then \(f \circ M: \mathbb{N}^{|X|} \to \mathbb{R}'\) is \((\epsilon, \delta)\)-differentially private.

Proof

Suffices to prove this for a deterministic function \(f\).

  1. Decompose \(f\) into a convex combination of deterministic functions.

  2. A Convex combination of differentially private mechanisms is differentially private (by definition).

Fix \(x, y\) such that \(\|x - y\|_1 \leq 1\), fix any event \(S \subseteq \mathbb{R}'\). Let \(T = \{r \in \mathbb{R} : f(r) \in S\}\).

\[\begin{align*} \Pr[f(M(x)) \in S] &= \Pr[M(x) \in T] \\ &\leq \exp(\epsilon) \Pr[M(y) \in T] + \delta \\ &= \exp(\epsilon) \Pr[f(M(y)) \in S] + \delta \end{align*} \]

What can we promise using the DP mechanism?

Suppose individual \(i\) has some preferences over a set of future events, denoted as \(A\).

We use \(u_i: A \to \mathbb{R}_{\geq 0}\) to represent the utility. \(i\) has utility \(u_i(a)\) if event \(a \in A\) becomes true.

Assume \(f: \text{Range}(M) \to \Delta(A)\) be any function that determines the distribution of future events, based on a function \(M\). Consider the function \(f(M(x))\) where \(x\) is some voting results.

例如M是counting query,f是带权随机采样

If \(M\) is differentially private, we know that individual \(i\)’s utility will not be harmed by more than \(\exp(\epsilon) \approx (1 + \epsilon)\) factor if anyone changes his/her vote.

Proof:

\[\begin{align*} E_{a \sim f(M(x))}[u_i(a)] &= \sum_{a \in A} u_i(a) \cdot \Pr_{f(M(x))}[a]\\ &\leq \sum_{a \in A} u_i(a) \cdot \exp(\epsilon) \Pr_{f(M(y))}[a] \\ &= \exp(\epsilon) E_{a \sim f(M(y))}[u_i(a)] \end{align*} \]

Similarly: \(E_{a \sim f(M(x))}[u_i(a)] \geq \exp(-\epsilon) E_{a \sim f(M(y))}[u_i(a)]\)

This holds independently of the individual \(i\)’s utility function and simultaneously for multiple individuals who may have completely different utility functions.

Chapter 6_4: Learning augmented algorithms

Problems of classical algorithms in the big data era

Worst case guarantees

  • Well-defined task, data format

  • Running time/performance guarantees in the worst case

Pros:

  • Data oblivious: perfectly solved all cases

Cons:

  • Data oblivious: not using the real-world data distribution

For big companies, data distribution rarely changes. Can we design better algorithms if we know this pattern? Certainly!

What is the bottom line?

  • Distributions are similar, but not exactly the same

  • Distributions can be very different sometimes (11.11/ Chinese New Year)

  • Better performance for common daily pattern, reasonably good performance for special cases (may be worse than classical).

Three different kinds of combinations

  • Completely replace the classical algorithm

    • Based on training data, end-to-end training
    • Not necessarily the optimal strategy, because ML may fail to capture the structural properties
  • Replace one gadget of the classical algorithm

    • Use ML to replace this gadget to get better performance
    • Nice coupling of classical algorithm and ML
  • Give oracle advice for some decisions of the classical algorithm

    • Simplest combination, does not affect how the algorithm works
    • E.g., search orders

在做系统的人,这个是不是还是确实火了一段时间的。

Example: Index all integers from 900 to 800M

We can use B-Tree to do Log(n) look-up/insertion.

Or data_array[lookup_key-900] to do O(1) lookup because we know the structure.

What is Btree? B-Tree maps a key to a range, the data we need lies in this range. Search inside this range will give us data.

What we can do?

Replace the B-tree with a model

  • Find an item: key \(\rightarrow{}\) pos estimate. Then binary search in [pos-err_min, pos+err_max]
    • You will know err_min and err_max in the training process.

Because you only care about the data in your database!

No generalization problem here(this may not be true in other learning-augmented algorithm settings).

Potential advantages of learned B-Tree model

  • Smaller indexes \(\rightarrow{}\) less storage
  • faster lookup
  • More parallelism \(\rightarrow{}\) sequential if-statements are exchanged for multiplications
  • Hardware accelerators \(\rightarrow{}\) lower power, better $/compute

However, the learned B-tree model is very slow.

We can mix it with some B-Tree depending on which one works better.

When searching in the min-max range, can we do better?

Exponential search is faster. 虽然算法复杂度一样,但是现实中的数据先验分布上,这个方法确实更好。

Chapter 6_5: Interpretability

Why do we need interpretability?

A single output of an ML algorithm is not enough, especially when the decision is important. Why shall I trust you? Explain to me!

What is good interpretability?

Good interpretability seems to be:

  • Human readable
  • Short enough for human beings
  • Can be verified

Lime: Local Interpretable Model-agnosticExplanations

\[\xi(x) = \text{argmin}_{g} L(f,g,\Pi_x) + \Omega(g) \]

公式解释:

Desiring a good linear model \(g\) to approximate \(f\) and \(g\) should be simple.

\(f\): the complicated neural network

\(g\): a linear model

\(\Omega(g)\): the complexity of \(g\)

  • E.g. How many non-zeros coefficients in \(g\)?

\(\Pi_x\): a sampling mechanism near \(x\)

  • Sample more points near \(x\), because we only care about neighborhood

\(L(f,g,\Pi_x)\): the loss of linear model \(g\) on the sampled points based on \(\Pi_x\), where the target is \(f\).

For example:

\(g\) is sparse, with at most \(k\) non-zero coefficients.

  • Hard constraint, can be transformed into a soft regularization term similar to Lasso.

\(Z\) is the input domain.\(\Pi_x(z) = \exp\left(-\frac{D(x,z)^2}{\sigma^2}\right)\), where \(D\) is some distance metric.

\[L(f,g,\Pi_x) = \sum_{z \in Z} \Pi_x(z) (f(z) - g(z))^2 \]

For natural language processing (NLP) tasks, \(x\) can be words. (把一个word x替换成另一个word z)

For image tasks, \(x\) can be superpixels obtained through segmentation. (取出的某几个voxel,变成其他的那几个voxel)

limitation:

Information is not always local.

How to improve this?

Attribution and baseline are what we use.

Compare the difference between baseline and current input x. Why do we believe x is a tiger? Not because of the local information of x. But because of how the baseline (black image) changes to x, also changes our prediction.

By thinking in this way, we can solve the locality issue of Lime.

Integrated gradients

Some fundamental Axioms for attribution:

Sensitivity:

  • If for every input and baseline that differ in one feature but have different predictions then the differing feature should be given a non-zero attribution.
  • If the function implemented by the deep network does not depend (mathematically) on some variable, then the attribution to that variable is always zero.

Implementation Invariance:

  • No matter how you implement the algorithm you should get the same outcome.

Completeness:

  • The attributions of all variables add up to the difference between the output of \(F\) at the input \(x\) and the baseline \(x_0\).

Linearity:

  • If we linearly compose two deep networks modeled by the functions \(f_1\) and \(f_2\) to form a third network that models the function \(af_1 + bf_2\), then the attribution for the third network should also preserve this linearity.

Symmetry preserving:

  • E.g., \(f = \text{Sigmoid}(x_1 + x_2 + \dots)\), then \(x_1\) and \(x_2\) should have the same attribution because they are symmetric.

What is the algorithm that satisfies all axioms?

\[\text{IntegratedGrads}_i(x) = (x_i - x_i') \times \int_{\alpha=0}^1 \frac{\partial F(x' + \alpha(x - x'))}{\partial x_i} \,d\alpha\\ \]

where \(x′\) is the baseline.

Completeness is satisfied:

\[\sum_{i=1}^n \text{IntegratedGrads}_i(x) = F(x) - F(x') \]

linearity and symmetry preserving are also satisfied. Intuition:

This is the only path (direct straight-line path between x and x′) to preserve symmetry.

Shapley value algorithm: SHAP

The Shapley value algorithm is similar to IG but for discrete variables.

Assume we have \(x_1, x_2, \ldots, x_n\) input variables, and the output is the function \(f\).

We want to know each \(x_i\)’s contribution, but \(f\) is not a linear function. So \(f(x_i)\) or \(f(U) - f(U-x_i)\) 都不能很公平的反应\(x_i\)的contribution.

\(f(\text{null}) = 0\), i.e., when no variables are positive, we get 0.

\[\phi_i(f) = \sum_{S \subseteq [n]\backslash\{i\}} \frac{|S|!(n-|S|-1)!}{n!}(f(S\cup\{i\})-f(S)) \]

Enumerating all possible cases of the base \(S\), try to see the difference between with \(i\) and without \(i\).

Take weight average, guarantee: \(\sum_i \phi_i(f) = f([n])\).

SHAP satisfies efficiency(the same as the term "completeness" of IG), symmetry, linearity, etc.

However, attribution may not lead to a good interpretation as well.

There is still a long way to go.