[ML] Feature Selectors

SparkML中关于特征的算法可分为:Extractors(特征提取)Transformers(特征转换)Selectors(特征选择)三部分。

Ref: SparkML中三种特征选择算法(VectorSlicer/RFormula/ChiSqSelector)

 

 

一、代码示范

VectorSlicer 只是根据index而“手动指定特征”的手段,不是特征选择的依据。 

RFormula 也只是根据column而“手动指定特征”的手段,不是特征选择的依据。 

VectorSlicer
from pyspark.ml.feature import VectorSlicer from pyspark.ml.linalg import Vectors from pyspark.sql.types import Row df = spark.createDataFrame([ Row(userFeatures=Vectors.sparse(3, {0: -2.0, 1: 2.3})), Row(userFeatures=Vectors.dense([-2.0, 2.3, 0.0]))]) df.show() +--------------------+ | userFeatures| +--------------------+ |(3,[0,1],[-2.0,2.3])| | [-2.0,2.3,0.0]| +--------------------+ slicer = VectorSlicer(inputCol="userFeatures", outputCol="features", indices=[1]) output = slicer.transform(df) output.select("userFeatures", "features").show() +--------------------+-------------+ | userFeatures| features| +--------------------+-------------+ |(3,[0,1],[-2.0,2.3])|(1,[0],[2.3])| | [-2.0,2.3,0.0]| [2.3]| +--------------------+-------------+ RFormula
from pyspark.ml.feature import RFormula dataset = spark.createDataFrame( [(7, "US", 18, 1.0), (8, "CA", 12, 0.0), (9, "NZ", 15, 0.0)], ["id", "country", "hour", "clicked"]) formula = RFormula( formula="clicked ~ country + hour",  # 指定使用两个特征,country特征会自动采用one hot编码。 featuresCol="features", labelCol="label") output = formula.fit(dataset).transform(dataset) output.select("features", "label").show() +--------------+-----+ | features|label| +--------------+-----+ |[0.0,0.0,18.0]| 1.0| |[0.0,1.0,12.0]| 0.0| |[1.0,0.0,15.0]| 0.0| +--------------+-----+ ChiSqSelector
from pyspark.ml.feature import ChiSqSelector from pyspark.ml.linalg import Vectors df = spark.createDataFrame([ (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"]) selector = ChiSqSelector(numTopFeatures=1, featuresCol="features", outputCol="selectedFeatures", labelCol="clicked") result = selector.fit(df).transform(df) print("ChiSqSelector output with top %d features selected" % selector.getNumTopFeatures()) result.show() ChiSqSelector output with top 1 features selected +---+------------------+-------+----------------+ | id| features|clicked|selectedFeatures| +---+------------------+-------+----------------+ | 7|[0.0,0.0,18.0,1.0]| 1.0| [18.0]| | 8|[0.0,1.0,12.0,0.0]| 0.0| [12.0]| | 9|[1.0,0.0,15.0,0.1]| 0.0| [15.0]| +---+------------------+-------+----------------+

 

 

二、实践心得

参考:[Feature] Feature selection

Outline

3.1 Filter

3.1.1 方差选择法

3.1.2 相关系数法

3.1.3 卡方检验    # <---- ChiSqSelector

3.1.4 互信息法

3.2 Wrapper

3.2.1 递归特征消除法

3.3 Embedded

3.3.1 基于惩罚项的特征选择法

3.3.2 基于树模型的特征选择法

  

相关系数

fraud_pd.corr('balance', 'numTrans')

n_numerical = len(numerical)
corr = []
for i in range(0, n_numerical):
    temp = [None] * i
    
    for j in range(i, n_numerical):
        temp.append(fraud_pd.corr(numerical[i], numerical[j]))
    corr.append(temp)  

print(corr)

Output: 

[[1.0,  0.00044, 0.00027],

 [None, 1.0,    -0.00028],

 [None, None,    1.0]] 

 

 

三、Embedded

Ref: [Feature] Feature selection - Embedded topic

问题,spark.ml可以lasso线性回归么?2.4.4貌似没有,但mllib里有,功能完善度不是很满意。

classification (SVMs, logistic regression)

linear regression (least squares, Lasso, ridge)

后者采样后,使用sklearn处理画出"轨迹图"。 

使用Spark SQL在DataFrame中采样构成子数据集的过程。

 

End.

posted @ 2019-11-08 11:49  郝壹贰叁  阅读(234)  评论(0)    收藏  举报