spark术语
---------------
1.RDD
弹性分布式数据集 , 轻量级数据集合。
内部含有5方面属性:
a.分区列表
b.计算函数
c.依赖列表
e.分区类(KV)
f.首选位置
创建RDD方式)
a.textFile
b.makeRDD/parallelize()
c.rdd变换
2.Stage
对RDD链条的划分,按照shuffle动作。
ShuffleMapStage
ResultStage
Stage和RDD关联,ResultStage和最后一个RDD关联。**************************
创建stage时按照shuffle依赖进行创建,shffleDep含有指向父RDD的
引用,因此之前的阶段就是ShuffleMapStage,而ShuffleMapStage所关联的
RDD就是本阶段的最后一个RDD。
总结:
每个Stage关联的RDD都是最后一个RDD.
3.依赖
Dependency,
子rdd的每个分区和父RDD的分区集合之间的对应关系。
创建RDD时创建的依赖。
NarrowDependency
OneToOne
Range
prune
ShuffleDependency
4.分区
RDD内分区列表,分区对应的数据的切片。
textFile(,n) ;
5.Task
任务,具体执行的单元。
每个分区对应一个任务。
任务内部含有对应分区和广播变量(串行的rdd和依赖)
ShuffleMapTask
ResultTask
6.
master的local模式
-----------------
local
local[4]
local[*]
local[4,5]
local-cluster[1,2,3] //1:N , 2:内核数,3:内存数
spark://...
Standalone提交job流程
---------------------
首选创建SparkContext,陆续在client创建三个调度框架(dag + task + backend),
启动task调度器,进而启动后台调度器,由后调度器创建AppClient对象,AppClient
启动后创建ClientEndpoint,该终端发送"RegisterApplication"消息给master,
master接受消息后完成应用的注册,回传App注册完成消息给ClientEndPoint,然后master
开始调度资源,向worker发送启动Driver和startExecutor消息,随后Worker上分别启动Driver
和执行器,driver也是在Executor进程中运行。
执行rdd时,由后台调度器发送消息给DriverEndpoint,driver终端再向executor发送LaunchTask的消息,
各worker节点上执行器接受命令,通过Executor启动任务。
CliendEndpoint DriverEndpoint
---------------------------------------------
SubmitDriverResponse StatusUpdate
KillDriverResponse ReviveOffers
RegisteredApplication KillTask
RegisterExecutor
StopDriver
StopExecutors
RemoveExecutor
RetrieveSparkAppConfig
spark job的部署模式
--------------------
[测试代码]
import java.lang.management.ManagementFactory
import java.net.InetAddress
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import scala.tools.nsc.io.Socket
/**
* Created by Administrator on 2018/5/8.
*/
object WCAppScala {
def sendInfo(msg: String) = {
//获取ip
val ip = InetAddress.getLocalHost.getHostAddress
//得到pid
val rr = ManagementFactory.getRuntimeMXBean();
val pid = rr.getName().split("@")(0);//pid
//线程
val tname = Thread.currentThread().getName
//对象id
val oid = this.toString;
val sock = new java.net.Socket("s101", 8888)
val out = sock.getOutputStream
val m = ip + "\t:" + pid + "\t:" + tname + "\t:" + oid + "\t:" + msg + "\r\n"
out.write(m.getBytes)
out.flush()
out.close()
}
def main(args: Array[String]): Unit = {
//1.创建spark配置对象
val conf = new SparkConf()
conf.setAppName("wcApp")
conf.setMaster("spark://s101:7077")
sendInfo("before new sc! ") ;
//2.创建spark上下文件对象
val sc = new SparkContext(conf)
sendInfo("after new sc! ") ;
//3.加载文件
val rdd1 = sc.textFile("hdfs://mycluster/user/centos/1.txt" , 3)
sendInfo("load file! ");
//4.压扁
val rdd2 = rdd1.flatMap(line=>{
sendInfo("flatMap : " + line);
line.split(" ")
})
//5.标1成对
val rdd3 = rdd2.map(w => {
sendInfo("map : " + w)
(w, 1)})
//6.化简
val rdd4 = rdd3.reduceByKey((a,b)=>{
sendInfo("reduceByKey() : " + a + "/" + b)
a + b
})
//收集数据
val arr = rdd4.collect()
arr.foreach(println)
}
}
1.client
driver端运行在client主机上.默认模式。
spark-submit --class WCAppScala --maste spark://s101:7077 --deploy-mode client
2.cluster
driver运行在一台worker上。
上传jar包到hdfs。
hdfs dfs -put myspark.jar .
spark-submit --class WCAppScala --maste spark://s101:7077 --deploy-mode client
Spark资源管理
--------------------
1.Spark涉及的进程
Master //spark守护进程, daemon
Worker //spark守护进程, daemon
//core + memory,指worker可支配的资源。
Driver //
BackExecutor //
2.配置spark支配资源
[spark/conf/spark-env.sh]
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos
# YARN模式读取的选线
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of executors to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)
###################################################################
####################独立模式的守护进程配置#########################
###################################################################
# spark master绑定ip,0000
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
#master rpc端口,默认7077
# - SPARK_MASTER_PORT
#master webui端口 ,默认8080
#SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
#worker支配的内核数,默认所有可用内核。
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
#worker内存, 默认1g
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
#设置每个节点worker进程数,默认1
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
#
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
#分配给master、worker以及历史服务器本身的内存,默认1g.(最大对空间)
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
###################################################################
####################独立模式的守护进程配置#########################
###################################################################
# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE Run the proposed command in the foreground. It will not output a PID file.
3.job提交时,为job制定资源配置。
spark-submit
//设置driver内存数,默认1g
--driver-memory MEM
//每个执行器内存数,默认1g
--executor-memory MEM
//Only : standalone + cluster,Driver使用的内核总数。
--driver-cores NUM
//standalone | mesos , 指定job使用的内核总数
--total-executor-cores NUM
//standalone | yarn , job每个执行器内核数。
--executor-cores NUM
//YARN-only ,
--driver-cores NUM //驱动器内核总数
--num-executors NUM //启动的执行器个数
//内存比较统一
内核配置
集群模式 部署模式 参数
--------------------------------------------------
standalone | cluster | --driver-cores NUM
|--------------------------------------
| --total-executor-cores NUM //总执行器内核
| --executor-cores NUM //
---------------------------------------------------
yarn |--executor-cores NUM
|--driver-cores NUM
|--num-executors NUM