April 19, 2011 by Sandro Saitta
Filed under: Uncategorized

If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

During the last years, I’ve read several data mining articles.

Here is a list of my top five articles in data mining.

For each article, I put the title, the authors and part of the abstract. Feel free to suggest your favorite ones.

过去几年,我读了一些有关数据挖掘的文章。选了最好的5篇,每一篇我加了标题、作者和部分的摘要,你可以根据自己的喜欢对任何一篇提出自己的建议。

An Introduction to Variable and Feature Selection

变量和特征选择的介绍

Isabelle Guyon and André Elisseeff

Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data.

变量和特征选择已经成为在大型数据集应用或成千上万的变量的可用性的研究领域中。这些领域包括网络文档的文本编辑,基因表达组学以及组合化学,这种变量选择的目的分为三点:提高预测性能,提供更快更有效的预测,为接下来的产生数据提供一个更易理解的过程。

Data Clustering: A Review

数据聚类:回顾与展望

A.K. Jain, M.N. Murty and P.J. Flynn

Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners.

聚类分析是在聚合中是一个非监督分类模式(观测值、数据分类,特征向量)。这种聚类分析问题已经在许多文章中提到并在许多学科中被研究人员应用。这正反映了它受到广泛的关注,和在数据分析中的一次跨越性的一步。然而,总的来说聚类是一个比较难的问题,在假设和。。在不同的领域中如有用的普遍共识和缓慢产生的方法学中已经有了转换。这篇文章提供了从统计学模式认识的前景的一个概观模式聚类的方法,其目的是给大部分的研究聚类分析问题的从业者提供有用的建议与参考。

From Data Mining to Knowledge Discovery in Databases

在数据库中的数据挖掘到只是发现

Usama Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth

Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases.

数据挖掘和知识发现最近在数据库、产业和媒体的关注中已经吸引了大量的研究,为什么大家对这个表现得特别兴奋呢?这篇文章展示了下这个领域的概述,阐明了在数据库中如何进行数据挖掘和知识发现,两者之间有密切关系,并与相同领域有关系,比如及其学,统计学,数据库等。

Nine Laws of Data Mining

关于数据挖掘的新法律

Tom Khabaza

In its current form, data mining as a field of practise came into existence in the 1990s, aided by the emergence of data mining algorithms packaged within workbenches so as to be suitable for business analysts.  Perhaps because of its origins in practice rather than in theory, relatively little attention has been paid to understanding the nature of the data mining process.  The development of the CRISP-DM methodology in the late 1990s was a substantial step towards a standardised description of the process that had already been found successful and was (and is) followed by most practising data miners.

在其目前的形式,大致在1990年,数据挖掘作为一个实践的领域,数据挖掘算法以一个工作台的的形式出现。也许,由于他的起源是由于实践而不是理论,所以数据挖掘过程的本质也就很少会去理解。1990年SRISP-DM方法学的发展是一个迈向标准化描述该过程的实质性的一步,并已经被认为是有用的,且被大多数试验数据挖掘所证实。

Statistical Modeling: The Two Cultures

统计模型:两种文化

Leo Breiman

There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment
has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics.

在适用统计莫型时有两种文化来达到从数据中得到结论。假设一种数据是由一个给定的随机模型产生,另一个是由其他算法模型产生,并认为其数据机制未知。The statistical community has been committed to the almost exclusive use of data models.统计界一直致力于数据模型独占使用权。 This commitment这一承诺已经产生了has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems.产生了不相关的理论、有问题的结论,并不断的让统计人员从一个有趣的当前存在的问题的大范围内工作。无论在理论还是实践,Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics.算法建模在大多数统计等领域中已迅速发展,。