MLE等于最小SSE吗？

Define:

MLE: Maximum likelihood estimation

LSE: Least-squares estimation

SSE: Sum of squares error

_______________________________

周一刘老师讲PRML ch1，提出了一个问题：MLE是否可以等同于最小SSE？很有趣的问题，查了一下资料。

wikipedia上说：The least-squares estimator optimizes a certain criterion (namely it minimizes the sum of the square of the residuals). 这句话解释了最小SSE和LSE之间的关系，即最小二乘估计LSE优化一种准则，称之为最小残差平方和，即SSE.

Least squares corresponds to the maximum likelihood criterion if the experimental errors have a normal distribution and can also be derived as a method of moments estimator. 这句话的意思是如果实验误差符合高斯分布，则最小二乘LSE对应于极大似然估计MLE. Method of moments 还不知道是什么东东，回头看看，有知道的ggmm也欢迎指教。

Tutorial on maximum likelihood estimation(Myung,2003)中有下列一段话：

As in MLE, ﬁnding the parameter values that minimize SSE generally requires use of a non-linear optimization algorithm. Minimization of LSE is also subject to the local minima problem, especially when the model is non-linear with respect to its parameters. The choice between the two methods of estimation can have non-trivial consequences. In general, LSE estimates tend to differ from MLE estimates, especially for data that are not normally distributed such as proportion correct and response time. An implication is that one might possibly arrive at different conclusions about the same data set depending upon which method of estimation is employed in analyzing the data. When this occurs, MLE should be preferred to LSE, unless the probability density function is unknown or difﬁcult to obtain in an easily computable form, for instance, for the diffusion model of recognition memory (Ratcliff, 1978). There is a situation, however, in which the two methods intersect. This is when observations are independent of one another and are normally distributed with a constant variance. In this case, maximization of the log-likelihood is equivalent to minimization of SSE, and therefore, the same parameter values are obtained under either MLE or LSE.

大致意思是说：一般来说，最小SSE和MLE是不同的，特别是在数据不是正常分布(normally distributed such as proportion correct and response time)。如果出现这种情况，MLE要比最小SSE好，除非pdf未知或者很难计算。但是，这两种方法是交叉的。当观测集是互相独立并且正常分布，那么MLE等于最小SSE。

沿着这个想法，那么是否数据非独立或者非正常分布下，最小SSE和MLE就是不同的呢？在Learning with kernel一书的第二章中，作者也对这个问题提出了探讨，作者认为MLE和最小SSE(经验风险)是等同的，唯一的不同是density function和loss function的选择不同，那么从这个观点看，就算训练数据互相不独立(比如，存在马尔可夫性)，或者非正常分布，这意味着密度函数比较复杂，进一步应该总能够找到对应的损失函数，这个损失函数就对应与SSE！

以上是我个人的一点理解，有不足指出敬请指教！

posted on 2009-05-20 11:33 Bati 阅读(969) 评论(0) 收藏举报

刷新页面返回顶部

Bati's eHome of Tech

公告