Generalized Linear Models

Generalized Linear Models

1. Overview

Two components of a linear model

  • Random component: the response variable \(Y|X\) is continuous and normally distributed with mean \(\mu = \mu(X) = \mathbb{E}(Y|X)\)

  • Link: between the random and covariates

\[X= \left(X^{(1)}, X^{(2)}, \cdots, X^{(p)} \right)^{\top} \quad : \quad \mu(X) = X^\top \beta \]

A generalized linear model (GLM) generalizes normal linear regression models in the following directions.

  • Random component:

\[Y \sim \text{some exponential family distribution} \]

  • Link: between the random and covariates:

\[g \left(\mu(X) \right) = X^{\top} \beta \]

where \(g\) called link function and \(\mu = \mathbb{E}(Y|X)\)

2 Exponential Family Distribution

2.1 Definition

Exponential Family: A family of distribution \(\{P_{\theta} : \theta \in \Theta \}\), \(\Theta \in \mathbb{R}^k\) is said to be a \(k\)-parameter exponential family on \(\mathbb{R}^q\), if there exist real valued functions:

  • \(\eta_1, \eta_2, \cdots \eta_k\) and \(B\) of \(\theta\),

  • \(T_1, T_2, \cdots, T_k\), and \(h\) of \(x \in \mathbb{R}^q\) such that the density
    function (PMF or PDF) of \(P_\theta\) can be written as

\[p_\theta(x)=\exp \left[\sum_{i=1}^k \eta_i(\theta) T_i(x)-B(\theta)\right] h(x) \]

Example

Normal distribution

Consider \(X \sim \mathcal{N} (\mu, \sigma^2), \theta=(\mu, \sigma^2)\). The density is

\[p_\theta(x)=\exp \left(\frac{\mu}{\sigma^2} x-\frac{1}{2 \sigma^2} x^2-\frac{\mu^2}{2 \sigma^2}\right) \frac{1}{\sigma \sqrt{2 \pi}}, \]

which forms a two-parameter exponential family with

\[\begin{aligned} &\eta_1 = \frac{\mu}{\sigma^2}, && \eta_2 = -\frac{1}{2 \sigma^2} \\ &T_1(x) = x, && T_2(x) = x^2 \\ &B(\theta)=\frac{\mu^2}{2 \sigma^2} + \log \left(\sigma \sqrt{2 \pi} \right), && h(x)=1 \end{aligned} \]

  • When \(\sigma^2\) is known, it becomes a one-parameter exponential family on \(\mathbb{R}\) :

\[\eta=\frac{\mu}{\sigma^2}, \quad T(x)=x, \quad B(\theta)=\frac{\mu^2}{2 \sigma^2}, \quad h(x)=\frac{e^{-\frac{x^2}{2 \sigma^2}}}{\sigma \sqrt{2 \pi}} \]

Discrete distributions

  • Bernoulli distribution

  • Poisson distribution

Continuous distributions

  • Gamma distributions

  • Inverse Gamma distributions

  • Inverse Gaussian distributions

  • Others: Chi-square, Beta, Binomial,Negative binomial distributions.

3 One-parameter canonical exponential family

3.1 Definition

Canonical exponential family for \(k=1\), \(y\in \mathbb{R}\)

\[f_\theta(y)=\exp \left( \frac{y \theta-b(\theta)}{\phi}+c(y, \phi) \right) \]

for some known functions \(b(\cdot)\) and \(c(\cdot, \cdot)\)

  • If \(\phi\) is known, this is a one-parameter exponential family with \(\theta\) being the canonical parameter.

  • If \(\phi\) is unknown, this may/may not be a two-parameter exponential family. \(\phi\) is called dispersion parameter.

  • In this class, we always assume that \(\phi\) is known.

Example: Normal distribution

Consider the following Normal density function with known variance \(\sigma^2\)

\[\begin{aligned} f_\theta(y) & =\frac{1}{\sigma \sqrt{2 \pi}} \exp \left( {-\frac{(y-\mu)^2}{2 \sigma^2}} \right) \\ & =\exp \left\{\frac{y \mu-\frac{1}{2} \mu^2}{\sigma^2}-\frac{1}{2}\left(\frac{y^2}{\sigma^2}+\log \left(2 \pi \sigma^2\right)\right)\right\} \end{aligned} \]

Therefore \(\theta=\mu, \phi=\sigma^2, b(\theta) = \frac{\theta^2}{2}\), and

\[c(y, \phi) = -\frac{1}{2}\left(\frac{y^2}{\phi}+\log (2 \pi \phi)\right) \]

Example: Poisson distribution

\[f(y | \mu)=\frac{\mu^y}{y !} \exp (-\mu ) = \exp \left[ {y \log \mu-\mu-\log (y !)} \right] \]

Other distributions

\[\begin{array}{c|ccc} \hline & \text { Normal } & \text { Poisson } & \text { Bernoulli } \\ \hline \text { Notation } & \mathcal{N}\left(\mu, \sigma^2\right) & \mathcal{P}(\mu) & \mathcal{B}(p) \\ \text { Range of } y & (-\infty, \infty) & {[0,-\infty)} & \{0,1\} \\ \phi & \sigma^2 & 1 & 1 \\ b(\theta) & \frac{\theta^2}{2} & e^\theta & \log \left(1+e^\theta\right) \\ c(y, \phi) & -\frac{1}{2}\left(\frac{y^2}{\phi}+\log (2 \pi \phi)\right) & -\log y ! & 1 \\ \hline \end{array} \]

3.2 Likelihood

Let \(\ell(\theta)=\log f_\theta(Y)\) denote the log-likelihood function.

The mean \(\mathbb{E}(Y)\) and the variance \(\operatorname{var}(Y)\) can be derived from the following identities

  • First identity

\[\mathbb{E}\left(\frac{\partial \ell}{\partial \theta}\right)=0 \]

  • Second identity

\[\mathbb{E}\left(\frac{\partial^2 \ell}{\partial \theta^2}\right)+\mathbb{E}\left(\frac{\partial \ell}{\partial \theta}\right)^2=0 \]

Obtained from \(\int f_\theta(y) d y \equiv 1\)

3.2.1 Expected value

Note that

\[\ell(\theta)=\frac{Y \theta-b(\theta)}{\phi}+c(Y ; \phi) \]

Therefore

\[\frac{\partial \ell}{\partial \theta}=\frac{Y-b^{\prime}(\theta)}{\phi} \]

It yields

\[0=\mathbb{E}\left(\frac{\partial \ell}{\partial \theta}\right) =\frac{\mathbb{E}(Y)- b^{\prime} (\theta) }{\phi} \]

which leads to

\[\mathbb{E}(Y)=\mu=b^{\prime}(\theta) \]

3.2.2 Variance

On the other hand we have we have

\[\frac{\partial^2 \ell}{\partial \theta^2}+\left(\frac{\partial \ell}{\partial \theta}\right)^2=-\frac{b^{\prime \prime}(\theta)}{\phi}+\left(\frac{Y-b^{\prime}(\theta)}{\phi}\right)^2 \]

and from the previous result,

\[\frac{Y-b^{\prime}(\theta)}{\phi}=\frac{Y-\mathbb{E}(Y)}{\phi} \]

Together, with the second identity, this yields

\[0=-\frac{b^{\prime \prime}(\theta)}{\phi}+\frac{\operatorname{var}(Y)}{\phi^2} \]

which leads to

\[\operatorname{var}(Y)=V(Y)=b^{\prime \prime}(\theta) \phi \]

3.3.1 Definition

\(\beta\) is the parameter of interest, and needs to appear somehow in the likelihood function to use maximum likelihood.

A link function \(g\) relates the linear predictor \(X^{\top} \beta\) to the mean parameter \(\mu\),

\[X^{\top} \beta=g(\mu) \]

\(g\) is required to be monotone increasing and differentiable

\[\mu=g^{-1}\left(X^{\top} \beta\right) \]

The function \(g\) that links the mean \(\mu\) to the canonical parameter \(\theta\) is called Canonical Link:

\[g(\mu)=\theta \]

Since \(\mu=b^{\prime}(\theta)\), the canonical link is given by

\[g(\mu)=\left(b^{\prime}\right)^{-1}(\mu) \]

If \(\phi>0\), the canonical link function is strictly increasing (why)

Example: the Bernoulli distribution

Other examples

\[\begin{array}{c|cc} \hline & b(\theta) & g(\mu) \\ \hline \text { Normal } & \theta^2 / 2 & \mu \\ \text { Poisson } & \exp (\theta) & \log \mu \\ \text { Bernoulli } & \log \left(1+e^\theta\right) & \log \frac{\mu}{1-\mu} \\ \text { Gamma } & -\log (-\theta) & -\frac{1}{\mu} \\ \hline \end{array} \]

3.4 Model and notation

Let \((X_i, Y_i) \in \mathbb{R}^p \times \mathbb{R}, i=1, \ldots, n\) be independent random pairs such that the conditional distribution of \(Y_i\) given \(X_i=x_i\) has density in the canonical exponential family:

\[f_{\theta_i} (y_i)=\exp \left[ \frac{y_i \theta_i-b\left(\theta_i\right)}{\phi}+c\left(y_i, \phi\right)\right] \]

and

\[\mathbf{Y}=\left(Y_1, \ldots, Y_n\right)^{\top}, \quad \mathbf{X}=\left(X_1^{\top}, \ldots, X_n^{\top}\right)^{\top} \]

Here the mean \(\mu_i\) is related to the canonical parameter \(\theta_i\) via

\[\mu_i=b^{\prime}\left(\theta_i\right) \]

and \(\mu_i\) depends linearly on the covariates through a link function \(g\) :

\[g\left(\mu_i\right)=X_i^{\top} \beta . \]

Given a link function \(g\), note the following relationship between \(\beta\) and \(\theta\) :

\[\begin{aligned} \theta_i & =\left(b^{\prime}\right)^{-1}\left(\mu_i\right) \\ & =\left(b^{\prime}\right)^{-1}\left(g^{-1}\left(X_i^{\top} \beta\right)\right) \equiv h\left(X_i^{\top} \beta\right) \end{aligned} \]

where function \(h\) is defined as

\[h=\left(b^{\prime}\right)^{-1} \circ g^{-1}=\left(g \circ b^{\prime}\right)^{-1} \]

Remark: if \(g\) is the canonical link function, \(h\) is identity.

The log-likelihood is given by

\[\begin{aligned} \ell_n(\beta ; \mathbf{Y}, \mathbb{X}) & =\sum_i \frac{Y_i \theta_i-b\left(\theta_i\right)}{\phi} \\ & =\sum_i \frac{Y_i h\left(X_i^{\top} \beta\right)-b\left(h\left(X_i^{\top} \beta\right)\right)}{\phi} \end{aligned} \]

up to a constant term.

Note that when we use the canonical link function, we obtain the simpler expression

\[\ell_n(\beta, \phi ; \mathbf{Y}, \mathbb{X})=\sum_i \frac{Y_i X_i^{\top} \beta-b\left(X_i^{\top} \beta\right)}{\phi} \]

The log-likelihood \(\ell(\theta)\) is strictly concave using the canonical function when \(\phi>0\).

As a consequence the maximum likelihood estimator is unique.

On the other hand, if another parameterization is used, the likelihood function may not be strictly concave leading to several local maxima.

4. Optimization Methods

  • Newton-Raphson Method

  • Fisher-scoring Method

  • Iteratively Re-weighted Least Squares

4.1 Newton-Raphson Method

4.2 Fisher-scoring method

Newton-Raphson works for a deterministic case, which does not have to involve random data.

Sometimes, calculation of the Hessian matrix is quite complicated (we will see an example)

Goal: use directly the fact that we are minimizing the KL divergence

\[\mathrm{KL} "=" - \mathbb{E} [\text{log-likelihood}] \]

Idea: replace the Hessian with its expected value. Recall that

\[\mathbb{E}_{\theta} ( H_{\ell_n} (\theta) ) = - I (\theta) \]

is the Fisher Information

The Fisher Information matrix is positive definite, and can serve as a stand-in for the Hessian in the Newton-Raphson algorithm, giving the update:

\[\theta^{(k+1)}=\theta^{(k)}+I\left(\theta^{(k)}\right)^{-1} \nabla_{\ell_n}\left(\theta^{(k)}\right) \]

It has essentially the same convergence properties as Newton-Raphson, but it is often easier to compute \(I\) than \(H_{\ell_n}\)

References

Wikipedia, 2023, Generalized linear model, site

MIT Lecture Notes, "Chapter 10: Generalized Linear Models (GLMs)" in Statistics for Applications

posted @ 2023-05-14 15:30  veager  阅读(90)  评论(0)    收藏  举报