Spark MLlib 基础:分类算法(逻辑回归)
知识点:
逻辑回归原理:二分类算法,基于线性回归的概率映射
MLlib 流水线(Pipeline):Pipeline、PipelineStage、Fit、Transform
模型评估:BinaryClassificationEvaluator(二分类评估)
练习:
用逻辑回归实现鸢尾花二分类,使用 Pipeline 封装流程:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import StringIndexer, VectorAssembler
读取数据
df = spark.read.csv("iris.csv", header=True, inferSchema=True)
标签编码(将字符串标签转为数值)
label_indexer = StringIndexer(inputCol="species", outputCol="label").fit(df)
特征合并
vec_assembler = VectorAssembler(inputCols=["sepal_length", "sepal_width", "petal_length", "petal_width"], outputCol="features")
逻辑回归模型
lr = LogisticRegression(featuresCol="features", labelCol="label", maxIter=10)
构建流水线
pipeline = Pipeline(stages=[label_indexer, vec_assembler, lr])
划分训练集和测试集(8:2)
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)
训练模型
model = pipeline.fit(train_df)
预测
predictions = model.transform(test_df)
评估模型(AUC指标)
evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print(f"模型AUC:{auc}")
易错:
Pipeline 能将特征工程、模型训练、预测封装为一个整体,便于复用
逻辑回归是 MLlib 中最基础的分类算法,需掌握核心参数(maxIter、regParam)

浙公网安备 33010602011771号