202109101338 - spark调优

对于聚类算法,计算密集型的任务,如何调优

# 以下各角色的作用?

MemoryStore

BlockManager

BlockManagerMaster



spark-submit调优并行度的关键点:
并行运行的task数量 = min(partitions, executors x executor-cores)

  • partitions 一个partition对应一个task
  • executors
  • executor-cores

spark-submit的几个参数

--driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN 		                              mode, or all available cores on the worker in                                  standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in                                    cluster mode (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).

测试

spark-submit --master yarn --deploy-mode client \
--conf spark.yarn.dist.archives=hdfs://zjltcluster/share/external_table/share/external_table/app_bonc_zj/hdfs/hivedb/udf_jars/mlpy_env.tar.gz#python36 \
--conf spark.pyspark.driver.python=./python36/mlpy_env/bin/python \
--conf spark.pyspark.python=./python36/mlpy_env/bin/python \
--driver-memory 4g \
--executor-memory 2g \
--driver-cores 2 \
--executor-cores 2 \
--queue boncnqueue \
--num-executors 5 \
testMeanShift.py

0. 原方式
rdd = sc.textFile('hdfs/path')

共2个分区,实际2个executor,2个task,约1.5个小时。yarn资源队列default使用率5%以下。输出2个文件。


1. 增加分区数
rdd.repartition(100)

共100个分区,实际2个executor, task并行度为2,100个task,约1.2个小时。yarn资源队列default使用率5%以下。输出100个文件


2. 增加新的配置
    --driver-memory 4g 
    --executor-memory 2g 
    --executor-cores 1 
    --queue thequeue 
	
2.1 
    executor-cores 1
	num-executors  5
	queue     boncnqueue
spark-submit --master yarn --deploy-mode client --conf spark.yarn.dist.archives=hdfs://zjltcluster/share/external_table/share/external_table/app_bonc_zj/hdfs/hivedb/udf_jars/mlpy_env.tar.gz#python36 --conf spark.pyspark.driver.python=./python36/mlpy_env/bin/python --conf spark.pyspark.python=./python36/mlpy_env/bin/python --driver-memory 4g --executor-memory 5g --driver-cores 2 --executor-cores 10 --num-executors 3 testMeanShift2.py

共100个分区,实际5个executor, task并行度为5,100个task,约23min。yarn资源队列default使用率5%以下。输出100个文件


posted @ 2025-03-20 19:18  钱塘江畔  阅读(14)  评论(0)    收藏  举报