[预测分析 2021Spring]chapter 3 Probabilistic models 一些概率模型介绍

Chapter 3 Probabilistic models 一些概率模型介绍

Previously, we introduce the Bayesion approach to machine Learning.

Basically, there are four steps:

probability model of the form \(p(y|x,\theta)=p(y|f(x;\theta))\) 确定模型形式
specify a prior distribution\(p(\theta)\) 确定先验分布
compute the posterior distribution over unknown parameters \(p(\theta|y)\) 计算后验分布
mode predictions using \(p(y_{new}|x,y)\) 利用模型做预测

How to choose a proper model?

depends on our belief about data.根据数据信息选择
we could choose all possible,reasonable models,then pick the"best" one.列举出所有可能、合理的模型，再从中选择最好的一个

Let us review some distributions.

Discrete data: Bernoulli,Binomial,Categorial, multinomial,Poisson,negative Binomial,etc.
Continuous data: Gaussian(univariate,multivariate), student t-dist, Cauchy dist, gamma dist,beta dist,etc.

Discrete

Bernoulli :model binary events 有两面的骰子掷了1次

\[\operatorname{Ber}(y \mid \theta) \triangleq \theta^{y}(1-\theta)^{1-y}=\left\{\begin{array}{ll} 1-\theta & \text { if } y=0 \\ \theta & \text { if } y=1 \end{array}\right. \]

where \(0\le \theta\le1\) is the probability that \(y=1\).

The Bernoulli distribution is a special case of Binomial distribution.

Binomial 有两面的骰子掷了N次

Suppose we observe a set of N Bernoulli trials, denoted \(S=\sum_{n-1}^{N}\mathbb{I}(y_n=1)\)

The distribution of S is given by the Binomial distribution,\(\operatorname{Bin}(s \mid N, \theta) \triangleq\left(\begin{array}{c}N \\ s\end{array}\right) \theta^{s}(1-\theta)^{N-s}\),where \(\left(\begin{array}{c}N \\ k\end{array}\right) \triangleq \frac{N !}{(N-k) ! k !}\).Bernoulli is a special case of Binomial if \(N=1\).

Sigmoid(logistic) function

When we want to predict a binary variable \(y\in \{0,1\}\) given some inputs \(\mathbf{x} \in \mathcal{X}\), we need to use a conditional probability distribution of the form:

\[p(y \mid \mathbf{x}, \boldsymbol{\theta})=\operatorname{Ber}(y \mid f(\mathbf{x} ; \boldsymbol{\theta})) \]

\(f(x;\theta)\):伯努利分布中的参数，为y=1事件发生的概率，要求在0,1之间。所以我们需要对f作变换满足上述条件。

To avoid the requirement that \(0 \leq f(\mathbf{x} ; \boldsymbol{\theta}) \leq 1,\) we can let \(f\) be an unconstrained function, and use the following model:

\[p(y \mid \mathbf{x}, \boldsymbol{\theta})=\operatorname{Ber}(y \mid \sigma(f(\mathbf{x} ; \boldsymbol{\theta}))) \]

Here \(\sigma()\) is the sigmoid or logistic function, defined as follows:

\[\sigma(a) \triangleq \frac{1}{1+e^{-a}} \]

Binary logistic regression

\[p(y \mid \mathbf{x} ; \boldsymbol{\theta})=\operatorname{Ber}\left(y \mid \sigma\left(\mathbf{w}^{\top} \mathbf{x}+b\right)\right) \]

where \(f(\mathbf{x} ; \boldsymbol{\theta})=\mathbf{w}^{\top} \mathbf{x}+b\) （note:为什么原文中没有+b?)

In other words,

\[p(y=1 \mid \mathbf{x} ; \boldsymbol{\theta})=\sigma\left(\mathbf{w}^{\top} \mathbf{x}+b\right)=\frac{1}{1+e^{-\left(\mathbf{w}^{\top} \mathbf{x}+b\right)}} \]

this is called logistic regression.

logistic回归相当于“伯努利分布”，但是伯努利分布的参数p是由协变量\(X\)和模型的参数 \(\theta\) 组成的，并不为伯努利分布。

Categorical distributions 有C面的骰子掷了1次

Categorial distribution generalizes the Bernoulli to \(C>2\) values.\(y\in \{1,2,...,C\}\).

Categorial 分布是对于伯努利分布中y的二分类的推广，推广为C分类（即结果有C中可能性，而不是2种）

The categorical distribution is a discrete probability distribution with one parameter per class:

\[\operatorname{Cat}(y \mid \boldsymbol{\mu}) \triangleq \prod_{c=1}^{C} \theta_{c}^{\mathbb{I}(y=c)} \]

In other words, \(p(y=c \mid \boldsymbol{\theta})=\theta_{c} .\)

Note that the parameters are constrained so that \(0 \leq \theta_{c} \leq 1\) and \(\sum_{c=1}^{C} \theta_{c}=1 ;\) thus there are only \(C-1\) independent parameters.

或者我们可以写成编码形式：当C=3时，我们将三类编码为\((1,0,0),(0,1,0),(0,0,1)\)

分布可以写为：

\[\operatorname{Cat}(\mathbf{y} \mid \boldsymbol{\theta}) \triangleq \prod_{c=1}^{C} \theta_{c}^{y_{c}} \]

The categorical distribution is a special case of the multinomial distribution.

套娃ing.

multinomial distributions 有C面的骰子掷了N次

Suppose we observe \(N\) categorical trials, \(y_{n} \sim \operatorname{Cat}(\cdot \mid \boldsymbol{\theta}),\) for \(n=1: N .\) Concretely, think of rolling a \(C\) -sided dice \(N\) times.

Let us define \(\mathbf{s}\) to be a vector that counts the number of times each face shows up, i.e., \(s_{c} \triangleq \sum_{n=1}^{N} \mathbb{I}\left(y_{n}=c\right)\).

The distribution of \(\mathbf{s}\) is given by the multinomial distribution:

\[\operatorname{Mu}(\mathbf{s} \mid N, \boldsymbol{\theta}) \triangleq\left(\begin{array}{c} N \\ s_{1} \ldots s_{C} \end{array}\right) \prod_{c=1}^{C} \theta_{c}^{s_{c}} \]

where \(\theta_{c}\) is the probability that side \(c\) shows up, and

\[\left(\begin{array}{c} N \\ s_{1} \ldots s_{C} \end{array}\right) \triangleq \frac{N !}{s_{1} ! s_{2} ! \cdots s_{C} !} \]

\(N=\sum_{c=1}^{C} s_{c}\).

Softmax function

对sigmoid函数的推广。

Consider \(p(y \mid \mathbf{x}, \boldsymbol{\theta})=\operatorname{Cat}(y \mid f(\mathbf{x} ; \boldsymbol{\theta}))\),We require that \(0 \leq f_{c}(\mathbf{x} ; \boldsymbol{\theta}) \leq 1\) and \(\sum_{c=1}^{C} f_{c}(\mathbf{x} ; \boldsymbol{\theta})=1\).

To avoid the requirement that \(f\) directly predict a probability vector, it is common to pass the output from \(f\) into the softmax function , also called the multinomial logit. This is defined as follows:

\[\mathcal{S}(\mathbf{a}) \triangleq\left[\frac{e^{a_{1}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}, \cdots, \frac{e^{a_{C}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}\right] \]

This maps \(\mathbb{R}^{C}\) to \([0,1]^{C},\) and satisfies the constraints that \(0 \leq \mathcal{S}(\mathbf{a})_{c} \leq 1\) and \(\sum_{c=1}^{C} \mathcal{S}(\mathbf{a})_{c}=1\)

Multiclass logistic regression

\(f_{c}(\mathbf{x} ; \boldsymbol{\theta})=\mathbf{W} \mathbf{x}+\mathbf{b}\),

\(p(y \mid \mathbf{x} ; \boldsymbol{\theta})=\operatorname{Cat}(y \mid \mathcal{S}(\mathbf{W} \mathbf{x}+\mathbf{b}))\),\(\mathcal{S}(\mathbf{W} \mathbf{x}+\mathbf{b})\)是每一类对应的概率P的向量

\(p(y=c \mid \mathbf{x} ; \boldsymbol{\theta})=\frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}\)，y=c的概率

Log-sum-exp trick

考虑\(\frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}\)，如果直接计算分子和分母，当\(a_c\)较大或者较小时，计算机在运算时会出现Inf or 0（精度问题），故我们需要将数据“转化”为计算机可计算的数值。

根据恒等式：\(\log \sum_{u=1}^{C} \exp \left(a_{c}\right)=m+\log \sum_{u=1}^{C} \exp \left(a_{c}-m\right)\)，令\(m=max \ a_c\),c=1,2,...,C

则有：\(p_c=\frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}=\frac{e^{a_{c}-m}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m}}=exp(\log e^{a_{c}-m}-\log \sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m})\)，再对exp内的两项分别计算。

\(\log p_c=\log e^{a_{c}-m}-\log \sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m}\)(划重点)

Continuous

Gaussian distribution

The pdf of the Gaussian is given by
\(\mathcal{N}\left(y \mid \mu, \sigma^{2}\right) \triangleq \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{1}{2 \sigma^{2}}(y-\mu)^{2}}\)

(太熟悉介绍从简)

预测分析0302

Why is the Gaussian distribution so widely used?
- it has two parameters which are easy to interpret, and which capture some of the most basic properties of a distribution, namely its mean and variance.
参数易解释
- the central limit theorem tells us that sums of independent random variables have an approximately Gaussian distribution, making it a good choice for modeling residual errors or “noise”.
  
  根据中心极限定理，独立随机变量求和具有渐进高斯分布，拟合误差较好
- the Gaussian distribution makes the least number of assumptions (has maximum entropy), subject to the constraint of having a specified mean and variance; this makes it a good default choice in many cases.
  
  当一阶矩存在，二阶矩有限时，根据最大熵原理求得的分布族为高斯分布族
- it has a simple mathematical form, which results in easy to implement, but often highly effective
易实现

Beta distribution常来模拟概率分布

The beta distribution has support over the interval [0,1] and is defined as follows:

\[\operatorname{Beta}(x \mid a, b)=\frac{1}{B(a, b)} x^{a-1}(1-x)^{b-1} \]

where \(B(a, b)\) is the beta function, defined by

\[B(a, b) \triangleq \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)} \]

where \(\Gamma(a)\) is the Gamma function defined by

\[\Gamma(a) \triangleq \int_{0}^{\infty} x^{a-1} e^{-x} d x \]

Gamma distribution

The gamma distribution is a flexible distribution for positive real valued rv's, \(x>0 .\) It is defined in terms of two parameters, called the shape \(a>0\) and the rate \(b>0\) :

\[\mathrm{Ga}(x \mid \text { shape }=a, \text { rate }=b) \triangleq \frac{b^{a}}{\Gamma(a)} x^{a-1} e^{-x b} \]

注：Gamma 分布有许多不同表现形式

Multivariate Gaussian (normal) distribution

Multivariate Gaussian (normal) distribution is defined as:

\[\mathcal{N}(\mathbf{y} \mid \boldsymbol{\mu}, \mathbf{\Sigma}) \triangleq \frac{1}{(2 \pi)^{D / 2}|\mathbf{\Sigma}|^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{y}-\boldsymbol{\mu})^{\top} \mathbf{\Sigma}^{-1}(\mathbf{y}-\boldsymbol{\mu})\right] \]

where \(\boldsymbol{\mu}=\mathbb{E}[\mathbf{y}] \in \mathbb{R}^{D}\) is the mean vector, and \(\boldsymbol{\Sigma}=\operatorname{Cov}[\mathbf{y}]\) is the \(D \times D\) covariance matrix,
defined as follows:

\[\begin{aligned} \operatorname{Cov}[\mathbf{y}] & \triangleq \mathbb{E}\left[(\mathbf{y}-\mathbb{E}[\mathbf{y}])(\mathbf{y}-\mathbb{E}[\mathbf{y}])^{\top}\right] \\ &=\left(\begin{array}{cccc} \mathbb{V}\left[Y_{1}\right] & \operatorname{Cov}\left[Y_{1}, Y_{2}\right] & \cdots & \operatorname{Cov}\left[Y_{1}, Y_{D}\right] \\ \operatorname{Cov}\left[Y_{2}, Y_{1}\right] & \mathbb{V}\left[Y_{2}\right] & \cdots & \operatorname{Cov}\left[Y_{2}, Y_{D}\right] \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{Cov}\left[Y_{D}, Y_{1}\right] & \operatorname{Cov}\left[Y_{D}, Y_{2}\right] & \cdots & \mathbb{V}\left[Y_{D}\right] \end{array}\right) \end{aligned} \]

where

\[\operatorname{Cov}\left[Y_{i}, Y_{j}\right] \triangleq \mathbb{E}\left[\left(Y_{i}-\mathbb{E}\left[Y_{i}\right]\right)\left(Y_{j}-\mathbb{E}\left[Y_{j}\right]\right)\right]=\mathbb{E}\left[Y_{i} Y_{j}\right]-\mathbb{E}\left[Y_{i}\right] \mathbb{E}\left[Y_{j}\right] \]

and \(\mathbb{V}\left[Y_{i}\right]=\operatorname{Cov}\left[Y_{i}, Y_{i}\right]\).

important properties:marginal and conditional distribution are still Gaussian distribution.

边际分布和条件分布仍为正态分布

Mixture model

We create a mixture model by taking a convex combination of simple distribution.

This has the form

\[p(\mathbf{y} \mid \boldsymbol{\theta})=\sum_{k=1}^{K} \pi_{k} p_{k}(\mathbf{y}) \]

where \(p_{k}\) is the \(k\) 'th mixture component, and \(\pi_{k}\) are the mixture weights which satisfy \(0 \leq \pi_{k} \leq 1\)\(\text { and } \sum_{k=1}^{K} \pi_{k}=1 \text { . }\)

We introduce the discrete latent variable \(z \in\{1, \ldots, K\},\) which specifies which distribution to use for generating the output \(\mathbf{y}\). 引入隐变量z，代表着属于“哪一个”分布。便于模型的解释和推断。

The prior on this latent variable is \(p(z=k)=\pi_{k},\) and the conditional is \(p(\mathbf{y} \mid z=k)=p_{k}(\mathbf{y})=p\left(\mathbf{y} \mid \boldsymbol{\theta}_{k}\right)\).

That is, we define the following joint model:

\[\begin{aligned} p(z \mid \boldsymbol{\theta}) &=\operatorname{Cat}(z \mid \boldsymbol{\pi}) \\ p(\mathbf{y} \mid z=k, \boldsymbol{\theta}) &=p\left(\mathbf{y} \mid \boldsymbol{\theta}_{k}\right) \end{aligned} \]

The "generative story" for the data is that we first generate \(z\) (label), and then we generate the observations \(\mathbf{y}\) using the parameters chosen according to the value of \(z\).

首先生成z，根据z再去生成y

\[p(\mathbf{y} \mid \boldsymbol{\theta})=\sum_{k=1}^{K} p(z=k \mid \boldsymbol{\theta}) p(\mathbf{y} \mid z=k, \boldsymbol{\theta})=\sum_{k=1}^{K} \pi_{k} p\left(\mathbf{y} \mid \boldsymbol{\theta}_{k}\right) \]

We can create different kinds of mixture model by varying the base distribution \(p_{k},\).

Gaussian Mixture model(GMM)

\[p(\mathbf{y} \mid \boldsymbol{\theta})=\sum_{k=1}^{K} \pi_{k} N(\mathbf{y}|\mu_k,\Sigma_k) \]

often used for clustering.

Note: y here is not label,but features.(equivalent to covariates in regression models)

在这里说到的y是“特征”，不是“标签/响应变量/label”

Data： y（features）

objective： infer parameters \((\pi_k,\mu_k,\Sigma_k),k=1,2,...,K\), 3*K parameters.估计参数，再对新数据进行推断。

posted on 2021-03-04 14:40 子渔渔渔🐟 阅读(313) 评论(0) 收藏举报