spark的安装与集群下的使用

今天是学习spark的第一天。

首先我的环境是ubuntu虚拟系统,安装了cloud01,cloud02,cloud03三个虚拟系统,并以cloud01为Master,创建了hadoop集群,安装spark前,已经有了JDK、Hadoop2X,scala2.10.4

1.首先将已经下载好的scala-2.10.4.tgz文件复制到虚拟机的主文件夹中,然后解压安装包

 tar -z vxf ~/scala-2.10-2.10.4.tgz

2.配置环境变量

sudo gedit /etc/profile

在该文件中加入如下代码:

export SCALA_HOME=/home/hduser/scala-2.10.4

export PATH=$PATH:$SCALA_HOME/bin

其中第一行的等号后面是你解压scala程序的位置,第二行是设置访问路径

使文件生效:

source /etc/profile

3.发送至cloud02、cloud03

scp -r scala-2.10.4 cloud02:~/

scp -r scala-2.10.4 cloud03:~/

复制好之后分别在cloud02,cloud03虚拟机上重复步骤2

4.在三个虚拟机上分别验证scala安装是否成功

 

下一步安装spark

1.获取并解压安装包

tar -z vxf spark-1.4.0-bin-hadoop2.4.tgz

2.配置文件

cd到conf位置

cd spark-1.4.0-bin-hadoop2.4/conf

(1)配置spark-env.sh

cp spark-env.sh.template spark-env.sh

打开spark-env.sh文件

sudo gedit spark-

在spark-env.sh中追加如下代码:

export HADOOP_CONF_DIR=/home/hduser/hadoop-2.2.0

export JAVA_HOME=/usr/local/java/jdk1.7.0_51

export SCALA_HOME=/home/hduser/scala-2.10.4

export SPARK_MASTER_IP=192.168.91.128

export SPARK_MASTER_PORT=7077

expot SPARK_MASTER_WEBUI_PORT=8080

export SPARK_WORKER_PORT=7078

export SPARK_WORKER_WEBUI_PORT=8081

export SPARK_WORKER_CORES=2

export SPARK_WORKER_INSTANCES=1

export SPART_WORKER_MEMORY=2g

export SPARK_JAR=/home/hduser/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar

 

(2) 配置spark-defaults.conf
cp spark-defaults.conf.template spark-defaults.conf
在spark-defaults.conf中追加下面代码
spark.master=spark://192.168.91.128:7077

(3) 配置spark-slaves
cp slaves.template slaves
在slaves中追加下面代码,id是cloud02和cloud03的
192.168.149.133
192.168.149.134

3.配置环境变量
sudo gedit /etc/profile
加入如下代码
export SPARK_HOME=/home/hduser/spark-1.4.0-bin-hadoop2.4/
export PATH=$PATH:$SPARK_HOME/bin
使文件生效:
source /etc/profile

4. 发送至slave1, slave2
scp -r ~/spark-1.4.0-bin-hadoop2.4 cloud02:~/
scp -r ~/spark-1.4.0-bin-hadoop2.4 cloud03:~/
分别在每个节点上重复步骤3

5.启动spark

先启动hadoop

start-all.sh

cd spark-1.4.0-bin-hadoop2.4

sbin/start-all.sh

bin/spark-shell

出现>scala证明已经安装成功。

(关闭spark:

ctrl+c

bin/stop-all.sh

cd

stop-all.sh)

 

//****演示默认分区数目
val rdd=sc.parallelize(1 to 5)  
rdd.partitions.size

 

//****演示指定分区数目为4
val rdd=sc.parallelize(1 to 5, 3)
rdd.partitions.size


//****演示集合创建makeRDD
val rdd=sc.makeRDD(1 to 10, 3)
rdd.collect


//****演示集合创建parallelize
val rdd=sc.parallelize(1 to 5)
rdd.collect


//****演示来自于外部存储系统(HDFS)
val rdd=sc.textFile("hdfs://cloud01:9000/data/file.txt")
rdd.collect


//****演示map
val a = sc.parallelize(1 to 9, 3)
val b = a.map(x => x*2)
a.collect
b.collect


//****演示mapValues
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"))
val b = a.map(x => (x.length, x))
b.collect
b.mapValues(x=>"hello" + x + "hello").collect


//****演示flatMap
val a = sc.parallelize(1 to 4)
val b = a.flatMap(x => 1 to x)
b.collect


//****演示flatMapValues
val a = sc.parallelize(List(("A",2),("B",4),("B",6))) //6不满足所以不执行
val b = a.flatMapValues(x=>x to 5)
b.collect


//****演示groupByKey, sortByKey
val s = sc.parallelize(List(("A",2),("B",4),("C",6),("A",3),("C",7)))
s.groupByKey().collect
s.sortByKey().collect
s.sortByKey(false).collect//降序排序


//****演示reduceByKey
val a = sc.parallelize(List(("A",2),("B",4),("B",6),("B",7),("B",1)))
a.reduceByKey((x,y) => x + y).collect

 


//****演示join
val l=sc.parallelize(Array(("A", 1), ("A", 2), ("B", 1), ("C", 1)), 1)
val r=sc.parallelize(Array(("A", 'x'), ("B", 'y'), ("B", 'z'), ("D", 'w')), 1)
val joinrdd1=l.join(r).collect
val joinrdd2=l.leftOuterJoin(r).collect
val joinrdd3=l.rightOuterJoin(r).collect
val joinrdd3=l.fullOuterJoin(r).collect

//****演示filter
val rdd7=sc.makeRDD(1 to 10).filter(_%3==0)
rdd7.collect


//****演示cache后时间对比
val rdd=sc.textFile("hdfs://cloud01:9000/data/sogou.utf8")
rdd.cache
rdd.count
rdd.count


//****演示执行操作
val rdd10=sc.makeRDD(1 to 10, 1)
rdd10.first
rdd10.count
rdd10.collect
rdd10.take(3)
rdd10.top(3)

 


//*******************************   案例1  WordCount  **********************************

val wcrdd=sc.textFile("hdfs://cloud01:9000/data/file.txt").flatMap(line =>line.split("[ ,?]+")).map(word =>(word,1)).reduceByKey((a,b) => a + b).collect

val wcrdd=sc.textFile("hdfs://cloud01:9000/data/file.txt").flatMap(line =>line.split(" ")).map(word =>(word,1)).reduceByKey((a,b) => a + b).map(x=>(x._2,x._1)).sortByKey(false).map(y=>(y._2,y._1)).collect


val wcrdd=sc.textFile("hdfs://cloud01:9000/data/file.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _).collect

//***结果存储在HDFS上
val wcrdd=sc.textFile("hdfs://cloud01:9000/data/file.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _).saveAsTextFile("hdfs://cloud01:9000/output01")

val wcrdd=sc.textFile("hdfs://cloud01:9000/data/file.txt").flatMap(line=>line.split(" ")).map(word =>(word,1)).reduceByKey((a,b) =>a+b).saveAsTextFile("hdfs://cloud01:9000/output02")

 

hdfs dfs –ls /input/output01201
hdfs dfs –text /input/output01201/part-00000
hdfs dfs -getmerge /output01201/part-00000 /output01201/part-00001 result
hdfs dfs -getmerge /output01201/part-* result


//***结果按照value值排序
wcrdd.map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).collect  //去掉之前不要wcrdd的collect

val wcrdd=sc.textFile("hdfs://cloud01:9000/data/file.txt").flatMap(line =>line.split(" ")).map(word =>(word,1)).reduceByKey((a,b) =>a+b).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).saveAsTextFile("hdfs://cloud01:9000/output03")

 

//*******************************   案例2  Sogou   ***************************************
hdfs dfs -tail /input/sogou.utf8

val data=sc.textFile("hdfs://cloud01:9000/data/sogou.utf8")
data.cache
data.count

data.map(_.split("\t")).map(x=>x(0)).filter(_>"20111230010101").count  //data.map(_.split("\t")(0)).filter(_>"20111230010101").count
data.map(_.split("\t")).map(x=>x(0)).filter(x=>(x>"20111230010101")&&(x<"20111230010200")).count
data.map(_.split("\t")).map(x=>x(0)).filter(_>"20111230010101").filter(_<"20111230010200").count
//***统计排名、点击均为第一的链接
data.map(_.split("\t")).filter(_(3).toInt==1).filter(_(4).toInt==1).count
data.map(_.split("\t")).filter(_(3)=="1").filter(_(4)=="1").count
data.map(_.split("\t")).map(x=>(x(3),x(4))).filter(x=>x._1=="1").filter(x=>x._2=="1").count


data.map(_.split("\t")).filter(x=>((x(3).toInt==1) && (x(4).toInt==1))).count
data.map(_.split("\t")).filter(x=>((x(3).toInt==1) || (x(4).toInt==1))).count

//****统计包含google关键字的记录
data.map(_.split("\t")).filter(_.length==6).filter(_(2).contains("google")).count

//***session查询次数排行
val data=sc.textFile("hdfs://cloud01:9000/data/sogou.utf8").cache
data.map(x=>x.split("\t")).map(x=>(x(1),1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).take(10)


//*************连接点击排行榜*************
val data=sc.textFile("hdfs://cloud01:9000/data/sogou.utf8").cache
data. map(x=>x.split("\t")).map(x=>(x(5),1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).take(10)

 

//*****************************   集群模式下提交spark应用**********************************
bin/spark-submit ~/simpleApp/simpleApp.jar

******************************************************************************************

posted on 2017-04-02 17:39  zijin89  阅读(127)  评论(0)    收藏  举报

导航