Spark MLlib 基础：分类算法（逻辑回归）

知识点：
逻辑回归原理：二分类算法，基于线性回归的概率映射
MLlib 流水线（Pipeline）：Pipeline、PipelineStage、Fit、Transform
模型评估：BinaryClassificationEvaluator（二分类评估）
练习：
用逻辑回归实现鸢尾花二分类，使用 Pipeline 封装流程：

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import StringIndexer, VectorAssembler

读取数据

df = spark.read.csv("iris.csv", header=True, inferSchema=True)

标签编码（将字符串标签转为数值）

label_indexer = StringIndexer(inputCol="species", outputCol="label").fit(df)

特征合并

vec_assembler = VectorAssembler(inputCols=["sepal_length", "sepal_width", "petal_length", "petal_width"], outputCol="features")

逻辑回归模型

lr = LogisticRegression(featuresCol="features", labelCol="label", maxIter=10)

构建流水线

pipeline = Pipeline(stages=[label_indexer, vec_assembler, lr])

划分训练集和测试集（8:2）

train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

训练模型

model = pipeline.fit(train_df)

预测

predictions = model.transform(test_df)

评估模型（AUC指标）

evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print(f"模型AUC：{auc}")
易错：
Pipeline 能将特征工程、模型训练、预测封装为一个整体，便于复用
逻辑回归是 MLlib 中最基础的分类算法，需掌握核心参数（maxIter、regParam）

posted @ 2026-02-05 12:06 再报错就堵桥0 阅读(5) 评论(0) 收藏举报

刷新页面返回顶部

WMKQF