Generalized Linear Models
Generalized Linear Models
1. Overview
Two components of a linear model
-
Random component: the response variable \(Y|X\) is continuous and normally distributed with mean \(\mu = \mu(X) = \mathbb{E}(Y|X)\)
-
Link: between the random and covariates
A generalized linear model (GLM) generalizes normal linear regression models in the following directions.
- Random component:
- Link: between the random and covariates:
where \(g\) called link function and \(\mu = \mathbb{E}(Y|X)\)
2 Exponential Family Distribution
2.1 Definition
Exponential Family: A family of distribution \(\{P_{\theta} : \theta \in \Theta \}\), \(\Theta \in \mathbb{R}^k\) is said to be a \(k\)-parameter exponential family on \(\mathbb{R}^q\), if there exist real valued functions:
-
\(\eta_1, \eta_2, \cdots \eta_k\) and \(B\) of \(\theta\),
-
\(T_1, T_2, \cdots, T_k\), and \(h\) of \(x \in \mathbb{R}^q\) such that the density
function (PMF or PDF) of \(P_\theta\) can be written as
Example
Normal distribution
Consider \(X \sim \mathcal{N} (\mu, \sigma^2), \theta=(\mu, \sigma^2)\). The density is
which forms a two-parameter exponential family with
- When \(\sigma^2\) is known, it becomes a one-parameter exponential family on \(\mathbb{R}\) :
Discrete distributions
-
Bernoulli distribution
-
Poisson distribution
Continuous distributions
-
Gamma distributions
-
Inverse Gamma distributions
-
Inverse Gaussian distributions
-
Others: Chi-square, Beta, Binomial,Negative binomial distributions.
3 One-parameter canonical exponential family
3.1 Definition
Canonical exponential family for \(k=1\), \(y\in \mathbb{R}\)
for some known functions \(b(\cdot)\) and \(c(\cdot, \cdot)\)
-
If \(\phi\) is known, this is a one-parameter exponential family with \(\theta\) being the canonical parameter.
-
If \(\phi\) is unknown, this may/may not be a two-parameter exponential family. \(\phi\) is called dispersion parameter.
-
In this class, we always assume that \(\phi\) is known.
Example: Normal distribution
Consider the following Normal density function with known variance \(\sigma^2\)
Therefore \(\theta=\mu, \phi=\sigma^2, b(\theta) = \frac{\theta^2}{2}\), and
Example: Poisson distribution
Other distributions
3.2 Likelihood
Let \(\ell(\theta)=\log f_\theta(Y)\) denote the log-likelihood function.
The mean \(\mathbb{E}(Y)\) and the variance \(\operatorname{var}(Y)\) can be derived from the following identities
- First identity
- Second identity
Obtained from \(\int f_\theta(y) d y \equiv 1\)
3.2.1 Expected value
Note that
Therefore
It yields
which leads to
3.2.2 Variance
On the other hand we have we have
and from the previous result,
Together, with the second identity, this yields
which leads to
3.3 Link function
3.3.1 Definition
\(\beta\) is the parameter of interest, and needs to appear somehow in the likelihood function to use maximum likelihood.
A link function \(g\) relates the linear predictor \(X^{\top} \beta\) to the mean parameter \(\mu\),
\(g\) is required to be monotone increasing and differentiable
Examples of link functions
3.3.2 Canonical Link
The function \(g\) that links the mean \(\mu\) to the canonical parameter \(\theta\) is called Canonical Link:
Since \(\mu=b^{\prime}(\theta)\), the canonical link is given by
If \(\phi>0\), the canonical link function is strictly increasing (why)
Example: the Bernoulli distribution
Other examples
3.4 Model and notation
Let \((X_i, Y_i) \in \mathbb{R}^p \times \mathbb{R}, i=1, \ldots, n\) be independent random pairs such that the conditional distribution of \(Y_i\) given \(X_i=x_i\) has density in the canonical exponential family:
and
Here the mean \(\mu_i\) is related to the canonical parameter \(\theta_i\) via
and \(\mu_i\) depends linearly on the covariates through a link function \(g\) :
Given a link function \(g\), note the following relationship between \(\beta\) and \(\theta\) :
where function \(h\) is defined as
Remark: if \(g\) is the canonical link function, \(h\) is identity.
The log-likelihood is given by
up to a constant term.
Note that when we use the canonical link function, we obtain the simpler expression
The log-likelihood \(\ell(\theta)\) is strictly concave using the canonical function when \(\phi>0\).
As a consequence the maximum likelihood estimator is unique.
On the other hand, if another parameterization is used, the likelihood function may not be strictly concave leading to several local maxima.
4. Optimization Methods
-
Newton-Raphson Method
-
Fisher-scoring Method
-
Iteratively Re-weighted Least Squares
4.1 Newton-Raphson Method
4.2 Fisher-scoring method
Newton-Raphson works for a deterministic case, which does not have to involve random data.
Sometimes, calculation of the Hessian matrix is quite complicated (we will see an example)
Goal: use directly the fact that we are minimizing the KL divergence
Idea: replace the Hessian with its expected value. Recall that
is the Fisher Information
The Fisher Information matrix is positive definite, and can serve as a stand-in for the Hessian in the Newton-Raphson algorithm, giving the update:
It has essentially the same convergence properties as Newton-Raphson, but it is often easier to compute \(I\) than \(H_{\ell_n}\)
References
Wikipedia, 2023, Generalized linear model, site
MIT Lecture Notes, "Chapter 10: Generalized Linear Models (GLMs)" in Statistics for Applications

浙公网安备 33010602011771号