[预测分析 2021Spring]chapter 3 Probabilistic models 一些概率模型介绍
Chapter 3 Probabilistic models 一些概率模型介绍
Previously, we introduce the Bayesion approach to machine Learning.
Basically, there are four steps:
- probability model of the form \(p(y|x,\theta)=p(y|f(x;\theta))\) 确定模型形式
- specify a prior distribution\(p(\theta)\) 确定先验分布
- compute the posterior distribution over unknown parameters \(p(\theta|y)\) 计算后验分布
- mode predictions using \(p(y_{new}|x,y)\) 利用模型做预测
How to choose a proper model?
-
depends on our belief about data.根据数据信息选择
-
we could choose all possible,reasonable models,then pick the"best" one.列举出所有可能、合理的模型,再从中选择最好的一个
Let us review some distributions.
- Discrete data: Bernoulli,Binomial,Categorial, multinomial,Poisson,negative Binomial,etc.
- Continuous data: Gaussian(univariate,multivariate), student t-dist, Cauchy dist, gamma dist,beta dist,etc.
Discrete
Bernoulli :model binary events 有两面的骰子掷了1次
where \(0\le \theta\le1\) is the probability that \(y=1\).
- The Bernoulli distribution is a special case of Binomial distribution.
Binomial 有两面的骰子掷了N次
Suppose we observe a set of N Bernoulli trials, denoted \(S=\sum_{n-1}^{N}\mathbb{I}(y_n=1)\)
The distribution of S is given by the Binomial distribution,\(\operatorname{Bin}(s \mid N, \theta) \triangleq\left(\begin{array}{c}N \\ s\end{array}\right) \theta^{s}(1-\theta)^{N-s}\),where \(\left(\begin{array}{c}N \\ k\end{array}\right) \triangleq \frac{N !}{(N-k) ! k !}\).Bernoulli is a special case of Binomial if \(N=1\).
Sigmoid(logistic) function
When we want to predict a binary variable \(y\in \{0,1\}\) given some inputs \(\mathbf{x} \in \mathcal{X}\), we need to use a conditional probability distribution of the form:
\(f(x;\theta)\):伯努利分布中的参数,为y=1事件发生的概率,要求在0,1之间。所以我们需要对f作变换满足上述条件。
To avoid the requirement that \(0 \leq f(\mathbf{x} ; \boldsymbol{\theta}) \leq 1,\) we can let \(f\) be an unconstrained function, and use the following model:
Here \(\sigma()\) is the sigmoid or logistic function, defined as follows:
Binary logistic regression
where \(f(\mathbf{x} ; \boldsymbol{\theta})=\mathbf{w}^{\top} \mathbf{x}+b\) (note:为什么原文中没有+b?)
In other words,
this is called logistic regression.
logistic回归相当于“伯努利分布”,但是伯努利分布的参数p是由协变量\(X\)和模型的参数 \(\theta\) 组成的,并不为伯努利分布。
Categorical distributions 有C面的骰子掷了1次
Categorial distribution generalizes the Bernoulli to \(C>2\) values.\(y\in \{1,2,...,C\}\).
Categorial 分布是对于伯努利分布中y的二分类的推广,推广为C分类(即结果有C中可能性,而不是2种)
The categorical distribution is a discrete probability distribution with one parameter per class:
In other words, \(p(y=c \mid \boldsymbol{\theta})=\theta_{c} .\)
Note that the parameters are constrained so that \(0 \leq \theta_{c} \leq 1\) and \(\sum_{c=1}^{C} \theta_{c}=1 ;\) thus there are only \(C-1\) independent parameters.
或者我们可以写成编码形式:当C=3时,我们将三类编码为\((1,0,0),(0,1,0),(0,0,1)\)
分布可以写为:
The categorical distribution is a special case of the multinomial distribution.
套娃ing.
multinomial distributions 有C面的骰子掷了N次
Suppose we observe \(N\) categorical trials, \(y_{n} \sim \operatorname{Cat}(\cdot \mid \boldsymbol{\theta}),\) for \(n=1: N .\) Concretely, think of rolling a \(C\) -sided dice \(N\) times.
Let us define \(\mathbf{s}\) to be a vector that counts the number of times each face shows up, i.e., \(s_{c} \triangleq \sum_{n=1}^{N} \mathbb{I}\left(y_{n}=c\right)\).
The distribution of \(\mathbf{s}\) is given by the multinomial distribution:
where \(\theta_{c}\) is the probability that side \(c\) shows up, and
\(N=\sum_{c=1}^{C} s_{c}\).
Softmax function
对sigmoid函数的推广。
Consider \(p(y \mid \mathbf{x}, \boldsymbol{\theta})=\operatorname{Cat}(y \mid f(\mathbf{x} ; \boldsymbol{\theta}))\),We require that \(0 \leq f_{c}(\mathbf{x} ; \boldsymbol{\theta}) \leq 1\) and \(\sum_{c=1}^{C} f_{c}(\mathbf{x} ; \boldsymbol{\theta})=1\).
To avoid the requirement that \(f\) directly predict a probability vector, it is common to pass the output from \(f\) into the softmax function , also called the multinomial logit. This is defined as follows:
This maps \(\mathbb{R}^{C}\) to \([0,1]^{C},\) and satisfies the constraints that \(0 \leq \mathcal{S}(\mathbf{a})_{c} \leq 1\) and \(\sum_{c=1}^{C} \mathcal{S}(\mathbf{a})_{c}=1\)
Multiclass logistic regression
\(f_{c}(\mathbf{x} ; \boldsymbol{\theta})=\mathbf{W} \mathbf{x}+\mathbf{b}\),
\(p(y \mid \mathbf{x} ; \boldsymbol{\theta})=\operatorname{Cat}(y \mid \mathcal{S}(\mathbf{W} \mathbf{x}+\mathbf{b}))\),\(\mathcal{S}(\mathbf{W} \mathbf{x}+\mathbf{b})\)是每一类对应的概率P的向量
\(p(y=c \mid \mathbf{x} ; \boldsymbol{\theta})=\frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}\),y=c的概率
Log-sum-exp trick
考虑\(\frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}\),如果直接计算分子和分母,当\(a_c\)较大或者较小时,计算机在运算时会出现Inf or 0(精度问题),故我们需要将数据“转化”为计算机可计算的数值。
根据恒等式:\(\log \sum_{u=1}^{C} \exp \left(a_{c}\right)=m+\log \sum_{u=1}^{C} \exp \left(a_{c}-m\right)\),令\(m=max \ a_c\),c=1,2,...,C
则有:\(p_c=\frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}=\frac{e^{a_{c}-m}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m}}=exp(\log e^{a_{c}-m}-\log \sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m})\),再对exp内的两项分别计算。
\(\log p_c=\log e^{a_{c}-m}-\log \sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m}\)(划重点)
Continuous
Gaussian distribution
The pdf of the Gaussian is given by
\(\mathcal{N}\left(y \mid \mu, \sigma^{2}\right) \triangleq \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{1}{2 \sigma^{2}}(y-\mu)^{2}}\)
(太熟悉介绍从简)
-
Why is the Gaussian distribution so widely used?
- it has two parameters which are easy to interpret, and which capture some of the most basic properties of a distribution, namely its mean and variance.
参数易解释
-
the central limit theorem tells us that sums of independent random variables have an approximately Gaussian distribution, making it a good choice for modeling residual errors or “noise”.
根据中心极限定理,独立随机变量求和具有渐进高斯分布,拟合误差较好
-
the Gaussian distribution makes the least number of assumptions (has maximum entropy), subject to the constraint of having a specified mean and variance; this makes it a good default choice in many cases.
当一阶矩存在,二阶矩有限时,根据最大熵原理求得的分布族为高斯分布族
-
it has a simple mathematical form, which results in easy to implement, but often highly effective
易实现
Beta distribution常来模拟概率分布
The beta distribution has support over the interval [0,1] and is defined as follows:
where \(B(a, b)\) is the beta function, defined by
where \(\Gamma(a)\) is the Gamma function defined by
Gamma distribution
The gamma distribution is a flexible distribution for positive real valued rv's, \(x>0 .\) It is defined in terms of two parameters, called the shape \(a>0\) and the rate \(b>0\) :
注:Gamma 分布有许多不同表现形式
Multivariate Gaussian (normal) distribution
Multivariate Gaussian (normal) distribution is defined as:
where \(\boldsymbol{\mu}=\mathbb{E}[\mathbf{y}] \in \mathbb{R}^{D}\) is the mean vector, and \(\boldsymbol{\Sigma}=\operatorname{Cov}[\mathbf{y}]\) is the \(D \times D\) covariance matrix,
defined as follows:
where
and \(\mathbb{V}\left[Y_{i}\right]=\operatorname{Cov}\left[Y_{i}, Y_{i}\right]\).
-
important properties:marginal and conditional distribution are still Gaussian distribution.
边际分布和条件分布仍为正态分布
Mixture model
We create a mixture model by taking a convex combination of simple distribution.
This has the form
where \(p_{k}\) is the \(k\) 'th mixture component, and \(\pi_{k}\) are the mixture weights which satisfy \(0 \leq \pi_{k} \leq 1\)\(\text { and } \sum_{k=1}^{K} \pi_{k}=1 \text { . }\)
We introduce the discrete latent variable \(z \in\{1, \ldots, K\},\) which specifies which distribution to use for generating the output \(\mathbf{y}\). 引入隐变量z,代表着属于“哪一个”分布。便于模型的解释和推断。
The prior on this latent variable is \(p(z=k)=\pi_{k},\) and the conditional is \(p(\mathbf{y} \mid z=k)=p_{k}(\mathbf{y})=p\left(\mathbf{y} \mid \boldsymbol{\theta}_{k}\right)\).
That is, we define the following joint model:
The "generative story" for the data is that we first generate \(z\) (label), and then we generate the observations \(\mathbf{y}\) using the parameters chosen according to the value of \(z\).
首先生成z,根据z再去生成y
We can create different kinds of mixture model by varying the base distribution \(p_{k},\).
Gaussian Mixture model(GMM)
often used for clustering.
Note: y here is not label,but features.(equivalent to covariates in regression models)
在这里说到的y是“特征”,不是“标签/响应变量/label”
Data: y(features)
objective: infer parameters \((\pi_k,\mu_k,\Sigma_k),k=1,2,...,K\), 3*K parameters.估计参数,再对新数据进行推断。
浙公网安备 33010602011771号