Python自然语言处理学习笔记(57):小结

6.8   Summary 小结

  • Modeling the linguistic data found in corpora can help us to understand linguistic patterns, and can be used to make predictions about new language data.

建模语料库中的语言数据可以帮助我们理解语言模型,并且可以用于进行关于新语言数据的预测。

  • Supervised classifiers use labeled training corpora to build models that predict the label of an input based on specific features of that input.

监督式分类器使用标签训练语料库来构建模型,预测基于特定要素输入的所输入的标签。

  • Supervised classifiers can perform a wide variety of NLP tasks, including document classification, part-of-speech tagging, sentence segmentation, dialogue act type identification, and determining entailment relations, and many other tasks.

监督式分类器可以执行很多NLP任务,包括了文档分类,词性表彰,语句分割,对话行为类型识别,以及确定蕴含关系,以及其他任务。

  • When training a supervised classifier, you should split your corpus into three datasets: a training set for building the classifier model; a dev-test set for helping select and tune the model's features; and a test set for evaluating the final model's performance.

当徐连一个监督式分类器,你应该把你的语料库分为三个数据集:用于构造分类模型的训练及,一个用于帮助选择和调整模型特性的偏差测试集,以及一个用于评价最终模型性能的测试集。

  • When evaluating a supervised classifier, it is important that you use fresh data, that was not included in the training or dev-test set. Otherwise, your evaluation results may be unrealistically optimistic.

   当评价一个监督式分类器时,重要的是你要使用新鲜的没有包含在训练或者偏差测试集中的数据。否则,你的评估结果可能会不切实际地乐观。

  • Decision trees are automatically constructed tree-structured flowcharts that are used to assign labels to input values based on their features. Although they're easy to interpret, they are not very good at handling cases where feature values interact in determining the proper label.

   决策树可以自动地构建树结构的流程图,用于为输入变量基于它们的特性赋值。尽管它们可以简单地解释,但是它们不适合处理特性值相互影响来决定合适标签的情况。

  • In naive Bayes classifiers, each feature independently contributes to the decision of which label should be used. This allows feature values to interact, but can be problematic when two or more features are highly correlated with one another.

 在朴素贝叶斯分类器中,每个特性独立地贡献来决定哪个标签应该被使用。它允许特征值交互,但是当两个或更多的特性高度地相互对应时将会有问题。

  • Maximum Entropy classifiers use a basic model that is similar to the model used by naive Bayes; however, they employ iterative optimization to find the set of feature weights that maximizes the probability of the training set.

        最大熵分类器使用基本的与朴素贝叶斯相似的模型;不过,它们使用了迭代优化来寻找特性加权集来最大化训练集的可能性。

  • Most of the models that are automatically constructed from a corpus are descriptive — they let us know which features are relevant to a given patterns or construction, but they don't give any information about causal relationships between those features and patterns.

   大多数从语料库自动地构建的模型是描述性的——它们让我们知道哪个特性与给定的模式或结构是相关的,但是它们没有给出关于这些特性和模式之间的因果关系的任何信息。

posted @ 2011-09-05 23:28  牛皮糖NewPtone  阅读(656)  评论(0编辑  收藏  举报