Bayesian linear regression
Let \(S={(x^{(i)}, y^{(i)})}_{i=1}^m\) be a training set of i.i.d. examples from unknow distribution. The standard probabilistic interpretation of linear regression states that
where the \(\varepsilon^{(i)}\) are i.i.d. “noise” variables with independent \(\mathcal N(0, \sigma^2)\) distributions. It follows that $y^{(i)} - \theta^T x^{(i)} \sim \mathcal N(0, \sigma^2) $, or equivalently,
In Bayesian linear regression, we assume that a prior distribution over parameters is also given; a typical choice, for instance, is \(\theta \sim \mathcal N(0, \tau^2 I)\). Using Bayes’s rule, we obtain the parameter posterior,
Assuming the same noise model on testing points as on training points, the “output” of Bayesian linear regression on a new test point \(x_*\) is not just a single guess “\(y_*\)”, but rather an entire probability distribution over possible outputs, knows as the posterior predictive distribution:
For many types of models, the integrals in (\ref{ppostd}) and (\ref{postd}), are difficult to compute, and hence, we often resort to approximations, such as maximum a posteriori MAP estimation. MAP1, MAP2. Also you can see Regularization and Model selection.
In the case of Bayesian linear regression, however, the integrals actually are tractable! In particular, for Bayesian linear regression, one can show that (in 2.1.1 The standard linear model: http://www.gaussianprocess.org/gpml/)
where \(A = 1/\sigma^2 X^TX + 1/\tau^2 I\). the derivation of these formulas is somewhat involved. Nonetheless, from these equations, we get at least a flavor of what Bayesian models are all about: the posterior distribution over the test output \(y_*\) for a test input \(x_*\) is a gaussian distribution – this distribution reflects the uncertainty in our predictions \(y_* = \theta^Tx_* + \varepsilon_*\) arising from both the randomness in \(\varepsilon_*\) and the uncertainty in our choice of parameter \(\theta\). In contrast, classical probabilistic linear regression models estimate parameters \(\theta\) directly from the training data but provide no estimate of how reliabl