Spark MLlib 基础:分类算法(逻辑回归)

知识点:
逻辑回归原理:二分类算法,基于线性回归的概率映射
MLlib 流水线(Pipeline):Pipeline、PipelineStage、Fit、Transform
模型评估:BinaryClassificationEvaluator(二分类评估)
练习:
用逻辑回归实现鸢尾花二分类,使用 Pipeline 封装流程:

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import StringIndexer, VectorAssembler

读取数据

df = spark.read.csv("iris.csv", header=True, inferSchema=True)

标签编码(将字符串标签转为数值)

label_indexer = StringIndexer(inputCol="species", outputCol="label").fit(df)

特征合并

vec_assembler = VectorAssembler(inputCols=["sepal_length", "sepal_width", "petal_length", "petal_width"], outputCol="features")

逻辑回归模型

lr = LogisticRegression(featuresCol="features", labelCol="label", maxIter=10)

构建流水线

pipeline = Pipeline(stages=[label_indexer, vec_assembler, lr])

划分训练集和测试集(8:2)

train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

训练模型

model = pipeline.fit(train_df)

预测

predictions = model.transform(test_df)

评估模型(AUC指标)

evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print(f"模型AUC:{auc}")
易错:
Pipeline 能将特征工程、模型训练、预测封装为一个整体,便于复用
逻辑回归是 MLlib 中最基础的分类算法,需掌握核心参数(maxIter、regParam)

posted @ 2026-02-05 12:06  再报错就堵桥0  阅读(4)  评论(0)    收藏  举报