PRML 5: Kernel Methods

　　A kernel function implicitly maps a data point into some high-dimensional feature space and substitutes for the inner product of two feature vectors, so that a non-linearly separable classification problem can be converted into a linearly separable one. This trick can be applied to many feature vector-based models such as SVM, which we have introduced in previous articles.

　　To test the validity of a kernel function, we need the Mercer Theorem: function $k:\mathbb{R}^m\times\mathbb{R}^m\rightarrow\mathbb{R}$ is a Mercer kernel iff for all finite sets $\{\vec{x}_1,\vec{x}_2,...,\vec{x}_n\}$, the corresponding kernel matrix is proved to be symmetric positive semi-definite.

　　One of the good kernel functions is the Gaussian kernel $k(\vec{x}_m,\vec{x}_n)=exp\{-\frac{1}{2\sigma^2}||\vec{x}_m-\vec{x}_n||^2\}$, which has infinite dimensionality. Another one is the polynomial kernel $k(\vec{x}_m,\vec{x}_n)=(\vec{x}_m^T\vec{x}_n+c)^M$ with $c>0$. In reality, we can construct a new kernel function with some simple valid kernels according to some properties.

　　We can also use a generative model to define kernel functions, such as:

　　(1) $k(\vec{x}_m,\vec{x}_n)=\int p(\vec{x}_m\text{ | }\vec{z})\cdot p(\vec{x}_n\text{ | }\vec{z})\cdot p(\vec{z})\cdot d\vec{z}$, where $\vec{z}$ is a latent variable;

　　(2) $k(\vec{x}_m,\vec{x}_n)=g(\vec{\theta},\vec{x})^TF^{-1}g(\vec{\theta},\vec{x})$, where $g(\vec{\theta},\vec{x})=\bigtriangledown_{\vec{\theta}}ln{p(\vec{x}\text{ | }\vec{\theta})}$ is the Fisher score,

　　　and $F=\frac{1}{N}\sum_{n=1}^N g(\vec{\theta},\vec{x}_n)g(\vec{\theta},\vec{x}_n)^T$ is the Fisher information matrix.

　　Gaussian Process is a probabilistic discriminative model, where an assumption is made that the set of values of $y(x)$ evaluated at an arbitrary set of points $\{\vec{x}_1,\vec{x}_2,...,\vec{x}_N\}$ is jointly Gaussian distributed. Here we use the kernel matrix to determine the covariance.

　　Gaussian Process for Regression:

　　Typically, we choose $k(\vec{x}_m,\vec{x}_n)=\theta_0 exp\{-\frac{\theta_1}{2}||\vec{x}_n-\vec{x}_m||^2\}+\theta_2+\theta_3 \vec{x}_m^T\vec{x}_n$, and assume that:

　　(1) prior distribution　　$p(\vec{y}_N)=Gauss(\vec{y}_N\text{ | }\vec{0},K_N)$;
　　(2) likelihood　　　　$p(\vec{t}_N\text{ | }\vec{y}_N)=Gauss(\vec{t}_N\text{ | }\vec{y}_N,\beta^{-1}I_N)$.

　　Then, we have $p(\vec{t}_N)=\int p(\vec{t}_N\text{ | }\vec{y}_N)\cdot p(\vec{y}_N)\cdot d\vec{y}_N=Gauss(\vec{t}_N\text{ | }\vec{0},K_N+\beta^{-1}I_N)$. Here, $p(\vec{t}_N)$ is the likelihood of hyperparameter $\vec{\theta}$, and we can make an MLE to learn $\vec{\theta}$.

　　Also, $p(\vec{t}_{N+1})=Gauss(\vec{t}_{N+1}\text{ | }\vec{0},K_{N+1}+\beta^{-1}I_{N+1})$. Hence, denote $\vec{k}=[k(\vec{x}_1,\vec{x}_{N+1}),k(\vec{x}_2,\vec{x}_{N+1}),...,k(\vec{x}_N,\vec{x}_{N+1})]^T$, then we can get the conditional Gaussian $p(\vec{t}_{N+1}\text{ | }\vec{t}_N) = Gauss(\vec{k}^T(K_N+\beta^{-1}I_N)^{-1}\vec{t}_N,k(\vec{x}_{N+1},\vec{x}_{N+1})-\vec{k}^T(K_N+\beta^{-1}I_N)^{-1}\vec{k}+\beta^{-1})$

　　Gaussian Process for Classification:

　　We make an assumption that $p(t_N\text{ | }a_N)=\sigma(a_N)$, and take the following steps:

　　(1) Calculate $p(\vec{a}_N\text{ | }\vec{t}_N)$ by Laplace approximation;

　　(2) Given $p(\vec{a}_N\text{ | }\vec{t}_N)$ and $p(\vec{a}_{N+1}\text{ | }\vec{t}_{N+1})$, $p(a_{N+1}\text{ | }\vec{a}_N)$ is a conditional Gaussian;

　　(3) $p(a_{N+1}\text{ | }\vec{t}_N)=\int p(a_{N+1}\text{ | }\vec{a}_N)\cdot p(\vec{a}_N\text{ | }\vec{t}_N)\cdot d\vec{a}_N$;

　　(4) $p(t_{N+1}\text{ | }\vec{t}_N)=\int \sigma(a_{N+1})\cdot p(a_{N+1}\text{ | }\vec{t}_N)\cdot d\vec{a}_{N+1}$.

References:

　　1. Bishop, Christopher M. Pattern Recognition and Machine Learning [M]. Singapore: Springer, 2006

posted on 2015-06-16 09:47 DevinZ 阅读(257) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

DevinZ | 生存·自由·夢想

PRML 5: Kernel Methods

导航