机器学习中的误差界分析

Error Theorem in ML

Basic Concepts

\[\varepsilon(h) = \text{Generalization Error} = E_{(x,y)\sim D}[1(h(x) \neq y)] \]

\[\hat{\varepsilon}(h)=\text{Empirical Error}=\frac{1}{m}\sum_{i=1}^m [1(h(x^{(i)})\neq y^{(i)})] \]

\(\hat{\varepsilon}\) is from finite data.

\[\varepsilon (g) = \text{Irreducible Error} \\ \varepsilon(h^*)-\varepsilon(g) = \text{Approximation Error} \\ \varepsilon(\hat{h})-\varepsilon(h^*)=\text{Estimation Error} \]

where \(g\) is the best hypothesis, \(h^*\) is the best hypothesis in your class, \(\hat{h}\) is the learned hypothesis from your limited training data. So

\[\varepsilon(\hat{h})=\text{Estimation E+Approx E+Irreducible E} \]

the Estimation Error can then break down to Estimation Var and Estimation Bias.

Empirical Risk Minimization

a learning algorithm

\[\hat{h}_{ERM}= \arg\,\min_{h\in\mathcal{H}} \frac{1}{m}\sum_{i=1}^{m}1\{h(x^{(i)})\neq y^{(i)}\} \]

Uniform Convergence, think about these two relations

  • \(\hat{\varepsilon}(h)\) and \(\varepsilon(h)\)
  • \(\varepsilon(\hat{h})\) and \(\varepsilon(h^*)\)

Use 2 Lemma

  • Union Bound

    \(P(A_1 \cup...\cup A_k) ≤ P(A_1) + ... + P(A_k)\)

  • Hoeffding Inequality

    \(P(|\phi -\hat{\phi}| > \gamma) ≤ 2 \exp( 2{\gamma}^2 m)\)

Bound Analysis for Finite Hypothesis Space

let \(|\mathcal{H}|=k\), and let any \(m(\text{number of samples in training set}),\delta(\text{P of error set by human})\) be fixed. Then with probability at least \(1-\delta\), we have that

\[\varepsilon(\hat{h}) \le (\min_{h\in \mathcal{H}}\varepsilon(h))+2\sqrt{\frac{1}{2m}\log \frac{2k}{\delta}} \]

\(\gamma\) is the \(\sqrt{.}\) term.

the \(\hat{h}\) we get using training set is at most \(2\gamma\) higher than the best generalized hypothesis \(h\) when generalized.

If the hypothesis space \(\mathcal{H}\) increase, the 1st item decrease(Bias), while the 2nd item increase(Variance).

m (number of training examples) is increasing logarithmically with the size of hypothesis space(numbers).

Bound Analysis for Infinite Hypothesis Space

Vapnik-Chervonenkis dimension

\(VC(\mathcal{H})\) is the size of the largest set that is shattered by \(\mathcal{H}\). \(d = VCD(\mathcal{H})\)

"shatter" means for labels(like the answers for X in \(\mathcal{X}\)), there exists some \(h \in \mathcal{H}\) can satisfy.

\[\varepsilon(\hat{h}) \le \varepsilon(h^*) + O(\sqrt{\frac{d}{m}\log\frac{m}{d}+\frac{1}{m}\log\frac{1}{\delta}}) \]

\[m = O_{\gamma,\delta}(d) \]

m (number of training examples) is also increasing linearly with the size of \(\mathcal{H}\)(parameterized by VC dimension).

posted @ 2022-08-12 15:30  19376273  阅读(94)  评论(0编辑  收藏  举报