Zhu Qing

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

It is common that datasets have thousands of features. However, processing thousands of features during training and testing can be computationally infeasible. Besides, many irrelevant features can lead to overfitting. So, we need to select most relevant features in order to obtain faster, better and easier to understand learning models. There are a lot of methods for feature selection, such as wrapper method, filter method, univariate method and multivariate method. Here I want to talk about filter method.

Filter method means rank all the features using a measure of correlation with the label. And then select top K features to use in the model. There are a couple of ways to measure correlation between feature X and label Y: Mutual Information, Chi-square statistic, Pearson Correlation coefficient, Single-to-Noise Ratio and T-test.

 

1) Mutual Information:

As a feature of probability we know that, if X and Y is independent then P(X,Y)=P(X)P(Y);

Measure of dependence:

It is 0 when X and Y are independent.

It is maximum when X=Y.

limitation of the MI method:

-works only with nominal features and labels

-biased toward high arity features

-may choose redundant features

-features may beome relevant in the context of other

(comparision between MI X2, and LLR in [Dunning, CL'98] Accurate methods for the statistics of suprise and coincidence)

2)Chi Square Test of independent

3)Pearson Correlation Coefficient

4)Signal-to-Noise Ratio

5)T-test

posted on 2010-05-10 05:45  Zhu Qing  阅读(353)  评论(1编辑  收藏  举报