Bayesian linear regression

Let $S={(x^{(i)}, y^{(i)})}_{i=1}^m$ be a training set of i.i.d. examples from unknow distribution. The standard probabilistic interpretation of linear regression states that

\[y^{(i)} = \theta^T x^{(i)} + \varepsilon^{(i)}, \qquad i=1, \dots, m \]

where the $\varepsilon^{(i)}$ are i.i.d. “noise” variables with independent $\mathcal N(0, \sigma^2)$ distributions. It follows that $y^{(i)} - \theta^T x^{(i)} \sim \mathcal N(0, \sigma^2) $, or equivalently,

\[P(y^{(i)} | x^{(i)}) = \frac{1}{\sqrt{2\pi} \sigma} \text{exp}(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}) \]

In Bayesian linear regression, we assume that a prior distribution over parameters is also given; a typical choice, for instance, is $\theta \sim \mathcal N(0, \tau^2 I)$. Using Bayes’s rule, we obtain the parameter posterior,

\[\begin{equation} \begin{aligned} \text{posterior} &= \frac{\text{likelihood}\times \text{prior}}{\text{marginal likelihood}} \\ p(\theta, | S) &=\frac{p(\theta) p(S | \theta)}{\int_{\theta’} p(\theta’) p(S | \theta’) d\theta’} = \frac{p(\theta) \prod_{i=1}^{m} p(y^{(i)} | x^{(i)}, \theta)}{\int_{\theta’} p(\theta’) \prod_{i=1}^{m} p(y^{(i)} | x^{(i)}, \theta’) d\theta’} \end{aligned}\label{ppostd} \end{equation} \]

Assuming the same noise model on testing points as on training points, the “output” of Bayesian linear regression on a new test point $x_*$ is not just a single guess “$y_*$”, but rather an entire probability distribution over possible outputs, knows as the posterior predictive distribution:

\[\begin{equation}p(y_* | x_* , S) = \int_{\theta} p(y_* | x_* , \theta ) p(\theta | S) d\theta \label{postd}\end{equation} \]

For many types of models, the integrals in (\ref{ppostd}) and (\ref{postd}), are difficult to compute, and hence, we often resort to approximations, such as maximum a posteriori MAP estimation. MAP1, MAP2. Also you can see Regularization and Model selection.

\[\hat{\theta} = \text{arg max}_{\theta} p(\theta, | S) =\text{arg max}_{\theta} \prod_{i=1}^{m} p(y^{(i)} | x^{(i)}, \theta) \]

In the case of Bayesian linear regression, however, the integrals actually are tractable! In particular, for Bayesian linear regression, one can show that (in 2.1.1 The standard linear model: http://www.gaussianprocess.org/gpml/)

\[\theta | S \sim \mathcal N(\frac{1}{\sigma^2}A^{-1}X^Ty, A^{-1}) \]

\[y_* | x_*, S \sim \mathcal (\frac{1}{\sigma^2}x_*^TA^{-1}X^Ty, x_*^TA^{-1}x_* + \sigma^2) \]

where $A = 1/\sigma^2 X^TX + 1/\tau^2 I$. the derivation of these formulas is somewhat involved. Nonetheless, from these equations, we get at least a flavor of what Bayesian models are all about: the posterior distribution over the test output $y_*$ for a test input $x_*$ is a gaussian distribution – this distribution reflects the uncertainty in our predictions $y_* = \theta^Tx_* + \varepsilon_*$ arising from both the randomness in $\varepsilon_*$ and the uncertainty in our choice of parameter $\theta$. In contrast, classical probabilistic linear regression models estimate parameters $\theta$ directly from the training data but provide no estimate of how reliabl

posted @ 2020-02-09 17:09 eliker 阅读(204) 评论(0) 收藏举报

刷新页面返回顶部

eliker

Bayesian linear regression

公告