Spark和SparkSQL以及Hive

Spark的安装和配置

1、安装scala:sudo apt-get install scala  //运行scala -version 查看安装情况
2、安装spark
    tar -xzf spark-2.0.1-bin-hadoop2.7.tgz
    sudo mv spark-2.0.1-bin-hadoop2.7 spark210_h2.7 //重新命名
3、配置环境
    cd spark210_h2.7/conf
    sudo gedit slaves //添加worker节点
    datanode1
    ….
    datanoden
    sudo gedit spark-env.sh //添加下面的路径
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_101
export SCALA_HOME=/usr/share/scala
export SPARK_HOME=/home/hadoop/spark210_h2.7
export SPARK_MASTER_IP=192.168.1.105 //spark master的地址
export SPARK_WORKER_MEMORY=1024M //使用worker节点的内存量如果超过worker的实际会#无法启动worker,如果主机master和worker主机的内存都很大,那么两者可以设成一样的,
#在spark-submit设置的时候--executor-memory这个选项 可以设置成是实际主机内存的一半
export SPARK_DRIVER_MEMORY=1024M //
export SPARK_LIBARY_PATH=$JAVA_HOME/lib:$JAVA_HOME/jre/lib:$HADOOP_HOME/lib/native
export HADOOP_HOME=/home/hadoop/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop

sudo gedit ~/.bashrc //添加spark等的路径

#HADOOP VARIABLES START

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_101
#export HIVE_HOME=/home/
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

export HIVE_HOME=/home/hadoop/hive-2.1.0

export PIG_HOME=/home/hadoop/pig-0.16.0

export HADOOP_HOME=/home/hadoop/hadoop
export PATH=$PATH:$HIVE_HOME/bin
export PATH=$PATH:$HADOOP_HOME/bin

export PATH=$PATH:$PIG_HOME/bin

export PATH=$PATH:$HADOOP_HOME/sbin

export SPARK_HOME=/home/hadoop/spark210_h2.7

export PATH=$PATH:$SPARK_HOME/bin

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

#HADOOP VARIABLES END

source  ~/.bashrc

配置完成后 输入http://namenode:8080/

val file = sc.textFile("hdfs://namenode:9000/tmp/READM.txt")
    val count = file.flatMap(line=>(line.split(" "))).map(word=>(word,1)).reduceByKey(_+_)
    count.collect()
    count.textAsFile("hdfs://namenode:9000/tmp/output")

 run-example SparkPi 2>&1 | grep "Pi is roughly"

--集群环境下跑数据 在master上跑数据
spark-submit --master spark://namenode:6066 --class org.apache.spark.examples.SparkPi --executor-memory 512M --deploy-mode cluster  /home/hadoop/spark210_h2.7/examples/jars/spark-examples_2.11-2.0.1.jar 100




//下面的脚本能跑出结果并且能使用worker 程序运行的时候worker主机上会生成一个CoarseGrainedExecutorBackend的进程
spark-submit --master spark://namenode:7077 --class org.apache.spark.examples.SparkPi --executor-memory 512M  /home/hadoop/spark210_h2.7/examples/jars/spark-examples_2.11-2.0.1.jar


Spark on YARN 支持两种运行模式,分别为yarn-cluster和yarn-client,yarn-cluster适用于生产环境;而yarn-client适用于交互和调试,因为能在客户端终端看到程序输出。客户端模式实例和上面集群模式运行过程类似,在此不在赘述。


注意:Spark on YARN 支持两种运行模式,分别为yarn-cluster和yarn-client,yarn-cluster适用于生产环境;而yarn-client适用于交互和调试,因为能在客户端终端看到程序输出。客户端模式实例和上面集群模式运行过程类似,在此不在赘述。
//下面的程序会在集群上利用yarn跑
spark-submit --master yarn-cluster --class org.apache.spark.examples.SparkPi --executor-memory 512M  /home/hadoop/spark210_h2.7/examples/jars/spark-examples_2.11-2.0.1.jar

如果在slave上执行jps
除hadoop和yarn、spark等进程外,增加显示

CoarseGrainedExecutorBackend
ApplicationMaster


这个例子生成的结果存在ui界面http://namenode:8088/cluster/app/application_1477705130965_0003







进入logs可以看到有两个连接:stderr和stdout(包含计算的输出结果)








结果执行过程如下


Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.
16/10/29 17:12:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/10/29 17:12:42 INFO client.RMProxy: Connecting to ResourceManager at namenode/192.168.1.105:8032
16/10/29 17:12:42 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers
16/10/29 17:12:42 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
16/10/29 17:12:42 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
16/10/29 17:12:42 INFO yarn.Client: Setting up container launch context for our AM
16/10/29 17:12:42 INFO yarn.Client: Setting up the launch environment for our AM container
16/10/29 17:12:42 INFO yarn.Client: Preparing resources for our AM container
16/10/29 17:12:43 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/10/29 17:12:48 INFO yarn.Client: Uploading resource file:/tmp/spark-6dde4bfa-3d4a-440f-9a2c-ca2d83168b4e/__spark_libs__6290770329075115438.zip -> hdfs://namenode:9000/user/hadoop/.sparkStaging/application_1477705130965_0003/__spark_libs__6290770329075115438.zip
16/10/29 17:16:56 INFO yarn.Client: Uploading resource file:/home/hadoop/spark210_h2.7/examples/jars/spark-examples_2.11-2.0.1.jar -> hdfs://namenode:9000/user/hadoop/.sparkStaging/application_1477705130965_0003/spark-examples_2.11-2.0.1.jar
16/10/29 17:16:59 INFO yarn.Client: Uploading resource file:/tmp/spark-6dde4bfa-3d4a-440f-9a2c-ca2d83168b4e/__spark_conf__5804032335201433440.zip -> hdfs://namenode:9000/user/hadoop/.sparkStaging/application_1477705130965_0003/__spark_conf__.zip
16/10/29 17:16:59 INFO spark.SecurityManager: Changing view acls to: hadoop
16/10/29 17:16:59 INFO spark.SecurityManager: Changing modify acls to: hadoop
16/10/29 17:16:59 INFO spark.SecurityManager: Changing view acls groups to:
16/10/29 17:16:59 INFO spark.SecurityManager: Changing modify acls groups to:
16/10/29 17:16:59 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
16/10/29 17:16:59 INFO yarn.Client: Submitting application application_1477705130965_0003 to ResourceManager
16/10/29 17:17:02 INFO impl.YarnClientImpl: Submitted application application_1477705130965_0003
16/10/29 17:17:03 INFO yarn.Client: Application report for application_1477705130965_0003 (state: ACCEPTED)
16/10/29 17:17:03 INFO yarn.Client:
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1477732622107
     final status: UNDEFINED
     tracking URL: http://namenode:8088/proxy/application_1477705130965_0003/
     user: hadoop
16/10/29 17:17:04 INFO yarn.Client: Application report for application_1477705130965_0003 (state: ACCEPTED)
16/10/29 17:17:05 INFO yarn.Client: Application report for application_1477705130965_0003 (state: ACCEPTED)
16/10/29 17:17:06 INFO yarn.Client: Application report for application_1477705130965_0003 (state: ACCEPTED)
……
16/10/29 17:18:16 INFO yarn.Client: Application report for application_1477705130965_0003 (state: ACCEPTED)
16/10/29 17:18:17 INFO yarn.Client: Application report for application_1477705130965_0003 (state: ACCEPTED)
16/10/29 17:18:18 INFO yarn.Client: Application report for application_1477705130965_0003 (state: ACCEPTED)
16/10/29 17:18:19 INFO yarn.Client: Application report for application_1477705130965_0003 (state: RUNNING)
16/10/29 17:18:19 INFO yarn.Client:
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 192.168.1.104
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1477732622107
     final status: UNDEFINED
     tracking URL: http://namenode:8088/proxy/application_1477705130965_0003/
     user: hadoop
16/10/29 17:18:20 INFO yarn.Client: Application report for application_1477705130965_0003 (state: RUNNING)
16/10/29 17:18:21 INFO yarn.Client: Application report for application_1477705130965_0003 (state: RUNNING)
16/10/29 17:18:22 INFO yarn.Client: Application report for application_1477705130965_0003 (state: RUNNING)
……
16/10/29 17:20:50 INFO yarn.Client: Application report for application_1477705130965_0003 (state: RUNNING)
16/10/29 17:20:51 INFO yarn.Client: Application report for application_1477705130965_0003 (state: RUNNING)
16/10/29 17:20:52 INFO yarn.Client: Application report for application_1477705130965_0003 (state: FINISHED)
16/10/29 17:20:52 INFO yarn.Client:
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 192.168.1.104
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1477732622107
     final status: SUCCEEDED
     tracking URL: http://namenode:8088/proxy/application_1477705130965_0003/
     user: hadoop
16/10/29 17:20:52 INFO yarn.Client: Deleting staging directory hdfs://namenode:9000/user/hadoop/.sparkStaging/application_1477705130965_0003
16/10/29 17:20:52 INFO util.ShutdownHookManager: Shutdown hook called
16/10/29 17:20:52 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-6dde4bfa-3d4a-440f-9a2c-ca2d83168b4e


Spark 访问Hive:SparkSQL与Hive的整合
1. 拷贝$HIVE_HOME/conf/hive-site.xml和hive-log4j.properties到 $SPARK_HOME/conf/
2. 在$SPARK_HOME/conf/目录中,修改spark-env.sh,添加
export HIVE_HOME=/usr/local/apache-hive-0.13.1-bin
export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.15-bin.jar:$SPARK_CLASSPATH
3. 另外也可以设置一下Spark的log4j配置文件,使得屏幕中不打印额外的INFO信息:
log4j.rootCategory=WARN, console
好了,SparkSQL与Hive的整合就这么简单,配置完后,重启Spark slave和master.
进入$SPARK_HOME/bin
执行 ./spark-sql –name “hadoop″ –master spark://127.0.0.1:7077 进入spark-sql:
 
1. spark-sql> show databases;
2. OK
3. default
4. lxw1234
5. usergroup_mdmp
6. userservice_mdmp
7. ut
8. Time taken: 0.093 seconds, Fetched 5 row(s)
9. spark-sql> use lxw1234;
10. OK
11. Time taken: 0.074 seconds
12. spark-sql> select * from t_lxw1234;
13. 2015-05-10 url1
14. 2015-05-10 url2
15. 2015-06-14 url1
16. 2015-06-14 url2
17. 2015-06-15 url1
18. 2015-06-15 url2
19. Time taken: 0.33 seconds, Fetched 6 row(s)
20. spark-sql> desc t_lxw1234;
21. day string NULL
22. url string NULL
23. Time taken: 0.113 seconds, Fetched 2 row(s)
24. //ROW_NUMBER()
25. spark-sql> select url,day,row_number() over(partition by url order by day) as rn from t_lxw1234;
26. url1 2015-05-10 1
27. url1 2015-06-14 2
28. url1 2015-06-15 3
29. url2 2015-05-10 1
30. url2 2015-06-14 2
31. url2 2015-06-15 3
32. Time taken: 1.114 seconds, Fetched 6 row(s)
33. //COUNT()
34. spark-sql> select url,day,count(1) over(partition by url order by day) as rn from t_lxw1234;
35. url1 2015-05-10 1
36. url1 2015-06-14 2
37. url1 2015-06-15 3
38. url2 2015-05-10 1
39. url2 2015-06-14 2
40. url2 2015-06-15 3
41. Time taken: 0.934 seconds, Fetched 6 row(s)
42. //LAG()
43. spark-sql> select url,day,lag(day) over(partition by url order by day) as rn from t_lxw1234;
44. url1 2015-05-10 NULL
45. url1 2015-06-14 2015-05-10
46. url1 2015-06-15 2015-06-14
47. url2 2015-05-10 NULL
48. url2 2015-06-14 2015-05-10
49. url2 2015-06-15 2015-06-14
50. Time taken: 0.897 seconds, Fetched 6 row(s)
51. spark-sql>


Spark访问Hive
----------------------------------------------------------


SpqrkSQL在YARN上运行

./spark-sql --master yarn --executor-memory 512M

spark-shell On Yarn
如果你已经有一个正常运行的Hadoop Yarn环境,那么只需要下载相应版本的Spark,解压之后做为Spark客户端即可。
需要配置Yarn的配置文件目录,export HADOOP_CONF_DIR=/etc/hadoop/conf   这个可以配置在spark-env.sh中。
运行命令:
cd $SPARK_HOME/bin./spark-shell \--master yarn-client \--executor-memory 1G \--num-executors 10
注意,这里的–master必须使用yarn-client模式,如果指定yarn-cluster,则会报错:
Error: Cluster deploy mode is not applicable to Spark shells.
因为spark-shell作为一个与用户交互的命令行,必须将Driver运行在本地,而不是yarn上。
其中的参数与提交Spark应用程序到yarn上用法一样。
启动之后,在命令行看上去和standalone模式下的无异:

在ResourceManager的WEB页面上,看到了该应用程序(spark-shell是被当做一个长服务的应用程序运行在yarn上):

点击ApplicationMaster的UI,进入到了Spark应用程序监控的WEB页面:

 
spark-sql On Yarn
spark-sql命令行运行在yarn上,原理和spark-shell on yarn一样。只不过需要将Hive使用的相关包都加到Spark环境变量。
1. 将hive-site.xml拷贝到$SPARK_HOME/conf
2.export HIVE_HOME=/usr/local/apache-hive-0.13.1-bin 添加到spark-env.sh
3.将以下jar包添加到Spark环境变量:
datanucleus-api-jdo-3.2.6.jar、datanucleus-core-3.2.10.jar、datanucleus-rdbms-3.2.9.jar、mysql-connector-java-5.1.15-bin.jar
可以在spark-env.sh中直接添加到SPARK_CLASSPATH变量中。
 
运行命令:
cd $SPARK_HOME/bin./spark-sql \--master yarn-client \--executor-memory 1G \--num-executors 10
即可在yarn上运行spark-sql命令行。

在ResourceManager上的显示以及点击ApplicationMaster进去Spark的WEB UI,与spark-shell无异。

 
这样,只要之前有使用Hadoop Yarn,那么就不需要搭建standalone的Spark集群,也能发挥Spark的强大威力了。
 
测试1000万条数据count(*)
spark-sql> select count(*) from  data_signal_all;
16/10/29 21:11:56 INFO execution.SparkSqlParser: Parsing command: select count(*) from  data_signal_all
16/10/29 21:11:56 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=data_signal_all
16/10/29 21:11:56 INFO HiveMetaStore.audit: ugi=hadoop    ip=unknown-ip-addr    cmd=get_table : db=default tbl=data_signal_all    
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=data_signal_all
16/10/29 21:11:57 INFO HiveMetaStore.audit: ugi=hadoop    ip=unknown-ip-addr    cmd=get_table : db=default tbl=data_signal_all    
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:57 INFO parser.CatalystSqlParser: Parsing command: string
16/10/29 21:11:59 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.1.105:34843 in memory (size: 4.3 KB, free: 366.3 MB)
16/10/29 21:11:59 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
16/10/29 21:11:59 INFO memory.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 283.8 KB, free 366.0 MB)
16/10/29 21:11:59 INFO memory.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 24.8 KB, free 366.0 MB)
16/10/29 21:11:59 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.1.105:34843 (size: 24.8 KB, free: 366.3 MB)
16/10/29 21:11:59 INFO spark.SparkContext: Created broadcast 3 from processCmd at CliDriver.java:376
16/10/29 21:11:59 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on datanode1:37160 in memory (size: 4.3 KB, free: 117.0 MB)
16/10/29 21:11:59 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on 192.168.1.105:34843 in memory (size: 3.7 KB, free: 366.3 MB)
16/10/29 21:11:59 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on datanode1:35292 in memory (size: 3.7 KB, free: 117.0 MB)
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned accumulator 0
16/10/29 21:11:59 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.1.105:34843 in memory (size: 2.4 KB, free: 366.3 MB)
16/10/29 21:11:59 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on datanode1:37160 in memory (size: 2.4 KB, free: 117.0 MB)
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned accumulator 45
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned accumulator 46
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned accumulator 47
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned accumulator 48
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned accumulator 49
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned accumulator 50
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned accumulator 51
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned accumulator 52
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned accumulator 53
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned accumulator 54
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned accumulator 55
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned accumulator 56
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned accumulator 57
16/10/29 21:11:59 INFO spark.ContextCleaner: Cleaned shuffle 0
16/10/29 21:12:00 INFO mapred.FileInputFormat: Total input paths to process : 1
16/10/29 21:12:00 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:376
16/10/29 21:12:00 INFO scheduler.DAGScheduler: Registering RDD 15 (processCmd at CliDriver.java:376)
16/10/29 21:12:00 INFO scheduler.DAGScheduler: Got job 2 (processCmd at CliDriver.java:376) with 1 output partitions
16/10/29 21:12:00 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (processCmd at CliDriver.java:376)
16/10/29 21:12:00 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 3)
16/10/29 21:12:00 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 3)
16/10/29 21:12:00 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 3 (MapPartitionsRDD[15] at processCmd at CliDriver.java:376), which has no missing parents
16/10/29 21:12:00 INFO memory.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 11.6 KB, free 366.0 MB)
16/10/29 21:12:00 INFO memory.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 6.1 KB, free 366.0 MB)
16/10/29 21:12:00 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.1.105:34843 (size: 6.1 KB, free: 366.3 MB)
16/10/29 21:12:00 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1012
16/10/29 21:12:00 INFO scheduler.DAGScheduler: Submitting 7 missing tasks from ShuffleMapStage 3 (MapPartitionsRDD[15] at processCmd at CliDriver.java:376)
16/10/29 21:12:00 INFO cluster.YarnScheduler: Adding task set 3.0 with 7 tasks
16/10/29 21:12:00 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, datanode1, partition 0, NODE_LOCAL, 5528 bytes)
16/10/29 21:12:00 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 3.0 (TID 4, datanode1, partition 1, NODE_LOCAL, 5528 bytes)
16/10/29 21:12:00 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 3 on executor id: 2 hostname: datanode1.
16/10/29 21:12:00 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 4 on executor id: 1 hostname: datanode1.
16/10/29 21:12:01 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on datanode1:35292 (size: 6.1 KB, free: 117.0 MB)
16/10/29 21:12:03 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on datanode1:35292 (size: 24.8 KB, free: 116.9 MB)
16/10/29 21:12:18 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on datanode1:37160 (size: 6.1 KB, free: 117.0 MB)
16/10/29 21:12:24 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on datanode1:37160 (size: 24.8 KB, free: 116.9 MB)
16/10/29 21:12:54 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 3.0 (TID 5, datanode1, partition 2, NODE_LOCAL, 5528 bytes)
16/10/29 21:12:54 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 5 on executor id: 2 hostname: datanode1.
16/10/29 21:12:54 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 54405 ms on datanode1 (1/7)
16/10/29 21:13:00 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 3.0 (TID 6, datanode1, partition 3, NODE_LOCAL, 5528 bytes)
16/10/29 21:13:00 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 6 on executor id: 1 hostname: datanode1.
16/10/29 21:13:00 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 3.0 (TID 4) in 59875 ms on datanode1 (2/7)
16/10/29 21:13:11 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 3.0 (TID 7, datanode1, partition 4, NODE_LOCAL, 5528 bytes)
16/10/29 21:13:11 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 7 on executor id: 2 hostname: datanode1.
16/10/29 21:13:11 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 3.0 (TID 5) in 17338 ms on datanode1 (3/7)
16/10/29 21:13:19 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 3.0 (TID 8, datanode1, partition 5, NODE_LOCAL, 5528 bytes)
16/10/29 21:13:19 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 8 on executor id: 1 hostname: datanode1.
16/10/29 21:13:19 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 3.0 (TID 6) in 19574 ms on datanode1 (4/7)
16/10/29 21:13:26 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 3.0 (TID 9, datanode1, partition 6, NODE_LOCAL, 5528 bytes)
16/10/29 21:13:26 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 9 on executor id: 2 hostname: datanode1.
16/10/29 21:13:26 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 3.0 (TID 7) in 14207 ms on datanode1 (5/7)
16/10/29 21:13:32 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 3.0 (TID 8) in 13035 ms on datanode1 (6/7)
16/10/29 21:13:38 INFO scheduler.DAGScheduler: ShuffleMapStage 3 (processCmd at CliDriver.java:376) finished in 98.509 s
16/10/29 21:13:38 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/10/29 21:13:38 INFO scheduler.DAGScheduler: running: Set()
16/10/29 21:13:38 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 4)
16/10/29 21:13:38 INFO scheduler.DAGScheduler: failed: Set()
16/10/29 21:13:38 INFO scheduler.TaskSetManager: Finished task 6.0 in stage 3.0 (TID 9) in 12615 ms on datanode1 (7/7)
16/10/29 21:13:38 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[18] at processCmd at CliDriver.java:376), which has no missing parents
16/10/29 21:13:38 INFO memory.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 7.0 KB, free 366.0 MB)
16/10/29 21:13:38 INFO cluster.YarnScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool
16/10/29 21:13:38 INFO memory.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 3.7 KB, free 366.0 MB)
16/10/29 21:13:38 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.1.105:34843 (size: 3.7 KB, free: 366.3 MB)
16/10/29 21:13:38 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1012
16/10/29 21:13:38 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[18] at processCmd at CliDriver.java:376)
16/10/29 21:13:38 INFO cluster.YarnScheduler: Adding task set 4.0 with 1 tasks
16/10/29 21:13:38 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 10, datanode1, partition 0, NODE_LOCAL, 5381 bytes)
16/10/29 21:13:38 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Launching task 10 on executor id: 1 hostname: datanode1.
16/10/29 21:13:39 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on datanode1:37160 (size: 3.7 KB, free: 116.9 MB)
16/10/29 21:13:40 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 192.168.1.104:38946
16/10/29 21:13:40 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 157 bytes
16/10/29 21:13:43 INFO scheduler.DAGScheduler: ResultStage 4 (processCmd at CliDriver.java:376) finished in 4.165 s
16/10/29 21:13:43 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at CliDriver.java:376, took 102.879223 s
16/10/29 21:13:43 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 10) in 4164 ms on datanode1 (1/1)
16/10/29 21:13:43 INFO cluster.YarnScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool
10000000
Time taken: 106.385 seconds, Fetched 1 row(s)
16/10/29 21:13:43 INFO CliDriver: Time taken: 106.385 seconds, Fetched 1 row(s)

posted @ 2016-10-29 21:31  wangq17  阅读(219)  评论(0编辑  收藏  举报