NLP - 随笔分类 - COS

Kullback–Leibler divergence-相对熵【转】

摘要：Kullback–Leibler divergenceFrom Wikipedia, the free encyclopedia(Redirected fromRelative entropy)Inprobability theoryandinformation theory, theKullback–Leibler divergence[1][2][3](alsoinformation divergence,information gain,relative entropy, orKLIC) is a non-symmetric measure of the difference betwe 阅读全文

posted @ 2011-12-08 14:48 COS 阅读(1857) 评论(0) 推荐(0)

判别模型、生成模型与朴素贝叶斯方法【转】

摘要：转载时请注明来源：http://www.cnblogs.com/jerrylead1判别模型与生成模型上篇报告中提到的回归模型是判别模型，也就是根据特征值来求结果的概率。形式化表示为，在参数确定的情况下，求解条件概率。通俗的解释为在给定特征后预测结果出现的概率。比如说要确定一只羊是山羊还是绵羊，用判别模型的方法是先从历史数据中学习到模型，然后通过提取这只羊的特征来预测出这只羊是山羊的概率，是绵羊的概率。换一种思路，我们可以根据山羊的特征首先学习出一个山羊模型，然后根据绵羊的特征学习出一个绵羊模型。然后从这只羊中提取特征，放到山羊模型中看概率是多少，再放到绵羊模型中看概率是多少，哪个大就是哪个。阅读全文

posted @ 2011-11-23 10:43 COS 阅读(356) 评论(0) 推荐(0)

最大似然估计(Maximum likelihood estimation) 【转】

摘要：最大似然估计提供了一种给定观察数据来评估模型参数的方法，即：“模型已定，参数未知”。简单而言，假设我们要统计全国人口的身高，首先假设这个身高服从服从正态分布，但是该分布的均值与方差未知。我们没有人力与物力去统计全国每个人的身高，但是可以通过采样，获取部分人的身高，然后通过最大似然估计来获取上述假设中的正态分布的均值与方差。最大似然估计中采样需满足一个很重要的假设，就是所有的采样都是独立同分布的。下面我们具体描述一下最大似然估计：首先，假设为独立同分布的采样，θ为模型参数,f为我们所使用的模型，遵循我们上述的独立同分布假设。参数为θ的模型f产生上述采样可表示为回到上面的“模型已定，参数未知” 阅读全文

posted @ 2011-11-16 15:38 COS 阅读(397) 评论(0) 推荐(1)

判别式模型与生成式模型简单理解

摘要：判别式模型该模型主要对p(y|x)建模，通过x来预测y。在建模的过程中不需要关注联合概率分布。只关心如何优化p(y|x)使得数据可分。通常，判别式模型在分类任务中的表现要好于生成式模型。但判别模型建模过程中通常为有监督的，而且难以被扩展成无监督的。常见的判别式模型有： Logistic regression Linear discriminant analysis Support vector machines Boosting Conditional random fields Linear regression Neural networks生... 阅读全文

posted @ 2011-11-16 15:32 COS 阅读(329) 评论(1) 推荐(1)

Likelihood principle【转】

摘要：Likelihood principleFrom Wikipedia, the free encyclopediaInstatistics, thelikelihood principleis a controversial principle ofstatistical inferencewhich asserts that all of theinformationin asampleis contained in thelikelihood function.Alikelihood functionarises from aconditional probability distribu 阅读全文

posted @ 2011-11-16 15:24 COS 阅读(489) 评论(0) 推荐(1)

TF-IDF【转】

摘要：TF-IDF维基百科，自由的百科全书TF-IDF（term frequency–inverse document frequency）是一种用于资讯检索与文本挖掘的常用加权技术。TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜索引擎应用，作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外，互联网上的搜寻引擎还会使用基于连结分析的评级方法，以确定文件在搜寻结果中出现的顺序。目录[隐藏]1原理2例子3在向量空间模阅读全文

posted @ 2011-11-16 15:20 COS 阅读(428) 评论(1) 推荐(1)

Latent Dirichlet allocation【转】

摘要：Latent Dirichlet allocationFrom Wikipedia, the free encyclopediaInstatistics,latent Dirichlet allocation (LDA)is agenerative modelthat allows sets of observations to be explained byunobservedgroups that explain why some parts of the data are similar. For example, if observations are words collected 阅读全文

posted @ 2011-11-14 14:11 COS 阅读(947) 评论(0) 推荐(0)

Plate notation【转】

摘要：Plate notationFrom Wikipedia, the free encyclopediaPlate notationis a method of representing variables that repeat in agraphical model. Instead of drawing each repeated variable individually, a plate or rectangle is used to group variables into a subgraph that repeat together, and a number is drawn 阅读全文

posted @ 2011-11-14 14:07 COS 阅读(868) 评论(0) 推荐(1)

Probabilistic latent semantic analysis【转】

摘要：Probabilistic latent semantic analysis (PLSA), also known asprobabilistic latent semantic indexing(PLSI, especially in information retrieval circles) is astatistical techniquefor the analysis of two-mode and co-occurrence data. PLSA evolved fromlatent semantic analysis, adding a sounder probabilisti 阅读全文

posted @ 2011-11-14 13:41 COS 阅读(559) 评论(0) 推荐(1)

Latent semantic indexing【转】

摘要：Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that 阅读全文

posted @ 2011-11-01 16:37 COS 阅读(580) 评论(0) 推荐(1)

文本信息检索基本知识【转】

摘要：文本信息检索是针对文本的信息检索技术。在技术社区中，文本信息检索常常被等同于信息检索技术本身。相对视频、音频检索而言，文本信息检索是发展较快也较成熟的，其他模态的信息检索技术，往往也要仰赖文本信息检索的支持。虽然网络搜索引擎目前已不仅仅局限于对文本进行检索，文本信息检索仍然是大部分网络搜索引擎的基础。目录1 历史介绍2 模型2.1 矢量空间模型2.2 概率模型2.3 推理网络模型3 倒排文档索引技术4 关键词权重5 评价指标6 参阅7 参考文献8 外部链接历史介绍自人类的文字产生起，如何快速地从大量的，记录在各种各样的存储媒体中查找或获取信息就成为一个引人注目的问题。这个问题关系到人类如何能够阅读全文

posted @ 2011-10-24 12:36 COS 阅读(630) 评论(0) 推荐(1)

Normal distribution(正态分布，高斯分布)

摘要：Come To：http://en.wikipedia.org/wiki/Normal_distribution 阅读全文

posted @ 2011-10-09 14:46 COS 阅读(610) 评论(0) 推荐(0)

Exponential family(指数簇)

摘要：From Wikipedia, the free encyclopediaNot to be confused with theexponential distribution.(指数分布)"Natural parameter" links here. For the usage of this term in differential geometry, seedifferential geometry of curves.(微分几何曲线)Inprobabilityandstatistics, anexponential familyis an important cla 阅读全文

posted @ 2011-10-09 13:48 COS 阅读(1679) 评论(0) 推荐(0)

Dirichlet distribution

摘要：From Wikipedia, the free encyclopediaJump to: navigation, searchSeveral images of the probability density of the Dirichlet distribution when K=3 for various parameter vectors α. Clockwise from top left: α=(6,2,2), (3,7,5), (6,2,6), (2,3,4).In probability and statistics, the Dirichlet distribution (a 阅读全文

posted @ 2011-09-26 10:57 COS 阅读(1127) 评论(0) 推荐(0)

伽玛函数_gamma

摘要：词义伽玛函数(Gamma Function)作为阶乘的延拓，是定义在复数范围内的亚纯函数，通常写成Γ(x). 当函数的变量是正整数时，函数的值就是前一个整数的阶乘，或者说Γ(n+1)=n!。公式伽玛函数表达式：Γ(x)=∫e^(-t)*t^(x-1)dt (积分的下限是0,上限是+∞) 利用分部积分法(integration by parts)我们可以得到 Γ(x)=（x-1)*Γ(x-1) ，而容易计算得出Γ(1)=1, 由此可得, 在正整数范围有：Γ(n+1)=n! 在概率的研究中有一个重要的分布叫做伽玛分布： f(x)=λe^(-λx)(λx)^(x-... 阅读全文

posted @ 2011-09-02 10:20 COS 阅读(11754) 评论(0) 推荐(1)

KNN邻近算法（转）

摘要：KNN算法的决策过程k-Nearest Neighbor algorithm 右图中，绿色圆要被决定赋予哪个类，是红色三角形还是蓝色四方形？如果K=3，由于红色三角形所占比例为2/3，绿色圆将被赋予红色三角形那个类，如果K=5，由于蓝色四方形比例为3/5，因此绿色圆被赋予蓝色四方形类。 K最近邻(k-Nearest Neighbor，KNN)分类算法，是一个理论上比较成熟的方法，也是最简单的机器学习算法之一。该方法的思路是：如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别，则该样本也属于这个类别。KNN算法中，所选择的邻居都是已经正确分类的对象。该方法. 阅读全文

posted @ 2011-08-15 08:44 COS 阅读(512) 评论(0) 推荐(1)

Google的十个核心技术（摘自CSDN）

摘要：曾任职于IBM中国研究院，从事与云计算相关研究的CSDN博客专家吴朱华曾写过一篇文章《探索Google App Engine背后的奥秘(1)--Google的核心技术》，对Google的核心技术和其整体架构进行详细的分析，现转载于此，供大家学习。本篇将主要介绍Google的十个核心技术，而且可以分为四大类：1.分布式基础设施：GFS，Chubby和Protocol Buffer。2.分布式大规模数据处理：MapReduce和Sawzall。3.分布式数据库技术：BigTable和数据库Sharding。4.数据中心优化技术：数据中心高温化，12V电池和服务器整合。分布式基础设施GFS由于搜索引阅读全文

posted @ 2010-08-13 19:49 COS 阅读(658) 评论(0) 推荐(1)

随笔分类 - NLP