spark的安装与集群下的使用
今天是学习spark的第一天。
首先我的环境是ubuntu虚拟系统,安装了cloud01,cloud02,cloud03三个虚拟系统,并以cloud01为Master,创建了hadoop集群,安装spark前,已经有了JDK、Hadoop2X,scala2.10.4
1.首先将已经下载好的scala-2.10.4.tgz文件复制到虚拟机的主文件夹中,然后解压安装包
tar -z vxf ~/scala-2.10-2.10.4.tgz
2.配置环境变量
sudo gedit /etc/profile
在该文件中加入如下代码:
export SCALA_HOME=/home/hduser/scala-2.10.4
export PATH=$PATH:$SCALA_HOME/bin
其中第一行的等号后面是你解压scala程序的位置,第二行是设置访问路径
使文件生效:
source /etc/profile
3.发送至cloud02、cloud03
scp -r scala-2.10.4 cloud02:~/
scp -r scala-2.10.4 cloud03:~/
复制好之后分别在cloud02,cloud03虚拟机上重复步骤2
4.在三个虚拟机上分别验证scala安装是否成功
下一步安装spark
1.获取并解压安装包
tar -z vxf spark-1.4.0-bin-hadoop2.4.tgz
2.配置文件
cd到conf位置
cd spark-1.4.0-bin-hadoop2.4/conf
(1)配置spark-env.sh
cp spark-env.sh.template spark-env.sh
打开spark-env.sh文件
sudo gedit spark-
在spark-env.sh中追加如下代码:
export HADOOP_CONF_DIR=/home/hduser/hadoop-2.2.0
export JAVA_HOME=/usr/local/java/jdk1.7.0_51
export SCALA_HOME=/home/hduser/scala-2.10.4
export SPARK_MASTER_IP=192.168.91.128
export SPARK_MASTER_PORT=7077
expot SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=8081
export SPARK_WORKER_CORES=2
export SPARK_WORKER_INSTANCES=1
export SPART_WORKER_MEMORY=2g
export SPARK_JAR=/home/hduser/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar
(2) 配置spark-defaults.conf
cp spark-defaults.conf.template spark-defaults.conf
在spark-defaults.conf中追加下面代码
spark.master=spark://192.168.91.128:7077
(3) 配置spark-slaves
cp slaves.template slaves
在slaves中追加下面代码,id是cloud02和cloud03的
192.168.149.133
192.168.149.134
3.配置环境变量
sudo gedit /etc/profile
加入如下代码
export SPARK_HOME=/home/hduser/spark-1.4.0-bin-hadoop2.4/
export PATH=$PATH:$SPARK_HOME/bin
使文件生效:
source /etc/profile
4. 发送至slave1, slave2
scp -r ~/spark-1.4.0-bin-hadoop2.4 cloud02:~/
scp -r ~/spark-1.4.0-bin-hadoop2.4 cloud03:~/
分别在每个节点上重复步骤3
5.启动spark
先启动hadoop
start-all.sh
cd spark-1.4.0-bin-hadoop2.4
sbin/start-all.sh
bin/spark-shell
出现>scala证明已经安装成功。
(关闭spark:
ctrl+c
bin/stop-all.sh
cd
stop-all.sh)
//****演示默认分区数目
val rdd=sc.parallelize(1 to 5)
rdd.partitions.size
//****演示指定分区数目为4
val rdd=sc.parallelize(1 to 5, 3)
rdd.partitions.size
//****演示集合创建makeRDD
val rdd=sc.makeRDD(1 to 10, 3)
rdd.collect
//****演示集合创建parallelize
val rdd=sc.parallelize(1 to 5)
rdd.collect
//****演示来自于外部存储系统(HDFS)
val rdd=sc.textFile("hdfs://cloud01:9000/data/file.txt")
rdd.collect
//****演示map
val a = sc.parallelize(1 to 9, 3)
val b = a.map(x => x*2)
a.collect
b.collect
//****演示mapValues
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"))
val b = a.map(x => (x.length, x))
b.collect
b.mapValues(x=>"hello" + x + "hello").collect
//****演示flatMap
val a = sc.parallelize(1 to 4)
val b = a.flatMap(x => 1 to x)
b.collect
//****演示flatMapValues
val a = sc.parallelize(List(("A",2),("B",4),("B",6))) //6不满足所以不执行
val b = a.flatMapValues(x=>x to 5)
b.collect
//****演示groupByKey, sortByKey
val s = sc.parallelize(List(("A",2),("B",4),("C",6),("A",3),("C",7)))
s.groupByKey().collect
s.sortByKey().collect
s.sortByKey(false).collect//降序排序
//****演示reduceByKey
val a = sc.parallelize(List(("A",2),("B",4),("B",6),("B",7),("B",1)))
a.reduceByKey((x,y) => x + y).collect
//****演示join
val l=sc.parallelize(Array(("A", 1), ("A", 2), ("B", 1), ("C", 1)), 1)
val r=sc.parallelize(Array(("A", 'x'), ("B", 'y'), ("B", 'z'), ("D", 'w')), 1)
val joinrdd1=l.join(r).collect
val joinrdd2=l.leftOuterJoin(r).collect
val joinrdd3=l.rightOuterJoin(r).collect
val joinrdd3=l.fullOuterJoin(r).collect
//****演示filter
val rdd7=sc.makeRDD(1 to 10).filter(_%3==0)
rdd7.collect
//****演示cache后时间对比
val rdd=sc.textFile("hdfs://cloud01:9000/data/sogou.utf8")
rdd.cache
rdd.count
rdd.count
//****演示执行操作
val rdd10=sc.makeRDD(1 to 10, 1)
rdd10.first
rdd10.count
rdd10.collect
rdd10.take(3)
rdd10.top(3)
//******************************* 案例1 WordCount **********************************
val wcrdd=sc.textFile("hdfs://cloud01:9000/data/file.txt").flatMap(line =>line.split("[ ,?]+")).map(word =>(word,1)).reduceByKey((a,b) => a + b).collect
val wcrdd=sc.textFile("hdfs://cloud01:9000/data/file.txt").flatMap(line =>line.split(" ")).map(word =>(word,1)).reduceByKey((a,b) => a + b).map(x=>(x._2,x._1)).sortByKey(false).map(y=>(y._2,y._1)).collect
val wcrdd=sc.textFile("hdfs://cloud01:9000/data/file.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _).collect
//***结果存储在HDFS上
val wcrdd=sc.textFile("hdfs://cloud01:9000/data/file.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _).saveAsTextFile("hdfs://cloud01:9000/output01")
val wcrdd=sc.textFile("hdfs://cloud01:9000/data/file.txt").flatMap(line=>line.split(" ")).map(word =>(word,1)).reduceByKey((a,b) =>a+b).saveAsTextFile("hdfs://cloud01:9000/output02")
hdfs dfs –ls /input/output01201
hdfs dfs –text /input/output01201/part-00000
hdfs dfs -getmerge /output01201/part-00000 /output01201/part-00001 result
hdfs dfs -getmerge /output01201/part-* result
//***结果按照value值排序
wcrdd.map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).collect //去掉之前不要wcrdd的collect
val wcrdd=sc.textFile("hdfs://cloud01:9000/data/file.txt").flatMap(line =>line.split(" ")).map(word =>(word,1)).reduceByKey((a,b) =>a+b).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).saveAsTextFile("hdfs://cloud01:9000/output03")
//******************************* 案例2 Sogou ***************************************
hdfs dfs -tail /input/sogou.utf8
val data=sc.textFile("hdfs://cloud01:9000/data/sogou.utf8")
data.cache
data.count
data.map(_.split("\t")).map(x=>x(0)).filter(_>"20111230010101").count //data.map(_.split("\t")(0)).filter(_>"20111230010101").count
data.map(_.split("\t")).map(x=>x(0)).filter(x=>(x>"20111230010101")&&(x<"20111230010200")).count
data.map(_.split("\t")).map(x=>x(0)).filter(_>"20111230010101").filter(_<"20111230010200").count
//***统计排名、点击均为第一的链接
data.map(_.split("\t")).filter(_(3).toInt==1).filter(_(4).toInt==1).count
data.map(_.split("\t")).filter(_(3)=="1").filter(_(4)=="1").count
data.map(_.split("\t")).map(x=>(x(3),x(4))).filter(x=>x._1=="1").filter(x=>x._2=="1").count
data.map(_.split("\t")).filter(x=>((x(3).toInt==1) && (x(4).toInt==1))).count
data.map(_.split("\t")).filter(x=>((x(3).toInt==1) || (x(4).toInt==1))).count
//****统计包含google关键字的记录
data.map(_.split("\t")).filter(_.length==6).filter(_(2).contains("google")).count
//***session查询次数排行
val data=sc.textFile("hdfs://cloud01:9000/data/sogou.utf8").cache
data.map(x=>x.split("\t")).map(x=>(x(1),1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).take(10)
//*************连接点击排行榜*************
val data=sc.textFile("hdfs://cloud01:9000/data/sogou.utf8").cache
data. map(x=>x.split("\t")).map(x=>(x(5),1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).take(10)
//***************************** 集群模式下提交spark应用**********************************
bin/spark-submit ~/simpleApp/simpleApp.jar
******************************************************************************************
浙公网安备 33010602011771号