202109101338 - spark调优
对于聚类算法,计算密集型的任务,如何调优
# 以下各角色的作用?
MemoryStore
BlockManager
BlockManagerMaster
spark-submit调优并行度的关键点:
并行运行的task数量 = min(partitions, executors x executor-cores)
- partitions 一个partition对应一个task
- executors
- executor-cores
spark-submit的几个参数
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode)
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode (Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
测试
spark-submit --master yarn --deploy-mode client \
--conf spark.yarn.dist.archives=hdfs://zjltcluster/share/external_table/share/external_table/app_bonc_zj/hdfs/hivedb/udf_jars/mlpy_env.tar.gz#python36 \
--conf spark.pyspark.driver.python=./python36/mlpy_env/bin/python \
--conf spark.pyspark.python=./python36/mlpy_env/bin/python \
--driver-memory 4g \
--executor-memory 2g \
--driver-cores 2 \
--executor-cores 2 \
--queue boncnqueue \
--num-executors 5 \
testMeanShift.py
0. 原方式
rdd = sc.textFile('hdfs/path')
共2个分区,实际2个executor,2个task,约1.5个小时。yarn资源队列default使用率5%以下。输出2个文件。
1. 增加分区数
rdd.repartition(100)
共100个分区,实际2个executor, task并行度为2,100个task,约1.2个小时。yarn资源队列default使用率5%以下。输出100个文件
2. 增加新的配置
--driver-memory 4g
--executor-memory 2g
--executor-cores 1
--queue thequeue
2.1
executor-cores 1
num-executors 5
queue boncnqueue
spark-submit --master yarn --deploy-mode client --conf spark.yarn.dist.archives=hdfs://zjltcluster/share/external_table/share/external_table/app_bonc_zj/hdfs/hivedb/udf_jars/mlpy_env.tar.gz#python36 --conf spark.pyspark.driver.python=./python36/mlpy_env/bin/python --conf spark.pyspark.python=./python36/mlpy_env/bin/python --driver-memory 4g --executor-memory 5g --driver-cores 2 --executor-cores 10 --num-executors 3 testMeanShift2.py
共100个分区,实际5个executor, task并行度为5,100个task,约23min。yarn资源队列default使用率5%以下。输出100个文件

浙公网安备 33010602011771号