Spark算子总结
目录
Spark算子总结
RDD算子分类
value类transformation
输入分区与输出分区一对一
(1)map(func):生成一个新的RDD,新的RDD中每个元素均由父RDD通过作用func函数映射变换而来
val rdd1: RDD[Int] = sc.parallelize(List(1,2,3,4,5))
val rdd2: RDD[Int] = rdd1.map(x => x*2)
println(rdd2.collect().mkString(","))//2,4,6,8,10
(2)mapPartitions
//mapPartitions 获取到每个分区的迭代器,对每个分区中每个元素进行操作
//实际的效果和map是等同的,但是可以实现map一些不能实现的功能,如在每个分区连接一次数据库
val rdd3: RDD[String] = sc.parallelize(List("20180101", "20180102", "20180103", "20180104", "20180105", "20180106"),3)
val rdd4: RDD[String] = rdd3.mapPartitions(iter => {
//打印三次
println("===================")
iter.map(date => "aa" + date)
})
//aa20180101,aa20180102,aa20180103,aa20180104,aa20180105,aa20180106
println(rdd4.collect().mkString(","))
(3)flatMap
//flatMap 将RDD的灭个元素通过func函数转换为新的元素,进行扁平化;合并所有的集合为一个新集合
val rdd5 = sc.parallelize(List("I have a pen", "I have an apple", "I have a pen", "I have a pineapple"), 2)
val rdd6: RDD[Array[String]] = rdd5.map(x => x.split(" "))
//map======= [Ljava.lang.String;@72d0f2b4,[Ljava.lang.String;@6d2dc9d2,[Ljava.lang.String;@1da4b6b3,[Ljava.lang.String;@b2f4ece
println("map======= " + rdd6.collect().mkString(","))
val rdd7: RDD[String] = rdd5.flatMap(x => x.split(" "))
//flatMap==== I,have,a,pen,I,have,an,apple,I,have,a,pen,I,have,a,pineapple
println("flatMap==== " + rdd7.collect().mkString(","))
输入分区与输出分区多对一
(1)union
(2)cartesian
val rdd11 = sc.parallelize(Seq("Apple", "Banana", "Orange","Orange"))
val rdd22 = sc.parallelize(Seq("Banana", "Pineapple"))
val rdd33 = sc.parallelize(Seq("Durian"))
//union 合并两个RDD,元素数据类型需要相同,并不进行去重操作
val rdd44: RDD[String] = rdd11.union(rdd22).union(rdd33)
//union========== Apple,Banana,Orange,Orange,Banana,Pineapple,Durian
println("union========== " + rdd44.collect().mkString(","))
//cartesian 对两个RDD进行笛卡尔计算
val rdd55: RDD[(String, String)] = rdd11.cartesian(rdd22)
//cartesian======== (Apple,Banana),(Apple,Pineapple),(Banana,Banana),(Banana,Pineapple),(Orange,Banana),(Orange,Pineapple),(Orange,Banana),(Orange,Pineapple)
println("cartesian======== " + rdd55.collect().mkString(","))
输入分区与输出分区多对多
(1)groupBy:接收一个函数,这个函数返回的值作为key,然后通过这个key来对里面的元素进行分区
val rdd1_1: RDD[Int] = sc.parallelize(1 to 9, 3)
val rdd2_2: RDD[(String, Iterable[Int])] = rdd1_1.groupBy(x => {if (x % 2 == 0) "even" else "old"})
//(even,CompactBuffer(2, 4, 6, 8))
//(old,CompactBuffer(1, 3, 5, 7, 9))
rdd2_2.collect().foreach(println(_))
输出分区是输入分区的子集
(1)filter
(2)distinct
(3)subtract
(4)sample
(5)intersection
//filter 对RDD中元素的数据进行过滤,当满足func返回值为true时保留元素,否则丢弃
//filter=========== Banana
println("filter=========== " + rdd11.filter(_.contains("ana")).collect().mkString(","))
//distinct对中元素进行去重操作
//distinct========== Orange,Apple,Banana
println("distinct========== " + rdd11.distinct().collect().mkString(","))
//intersection 对两个RDD元素取交集
val rdd66: RDD[String] = rdd11.intersection(rdd22)
//intersection========== Banana
println("intersection========== " + rdd66.collect().mkString(","))
//subtract 该函数类似于intersecion,但返回在RDD中出现,并且不再otherRDD中出现的元素,不去重
//subtract========== Apple,Orange,Orange
println("subtract========== " + rdd11.subtract(rdd22).collect().mkString(","))
//sample(withReplacement: Boolean,fraction: Double, seed: Long = Utils.random.nextLong)
//对RDD进行抽样,其中参数withReplacement为true时表示抽样之后还放回,可以多次被抽样,false表示不放回
//fraction表示抽样比例
//seed为随机数种子,比如当前时间戳
val value: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
/**
* 每次打印的都不相同,相当于java中的随机函数
* 1.拿出来后放进去,让别人拿,可能相同
* 2.拿出来后不放进去,让别人拿,绝对不同
*/
println("sample======== " + value.sample(false,0.5,System.currentTimeMillis()).collect().mkString(","))
println("sample======== " + value.sample(true,0.5,System.currentTimeMillis()).collect().mkString(","))
Cache类型
(1)cache
(2)persist
key-value类transformation
输入分区与输出分区一对一
(1)mapValues 同基本转换操作中的map,只不过mapValues是针对[K,V]中的V值进行map操作
val rdd1: RDD[(Int, String)] = sc.parallelize(Array((1,"A"),(2,"B"),(3,"C"),(4,"D")),2)
val rdd2: RDD[(Int, String)] = rdd1.mapValues(x => x + "_")
//mapValues========= (1,A_),(2,B_),(3,C_),(4,D_)
println("mapValues========= " + rdd2.collect().mkString(","))
对单个RDD聚集
(1)groupByKey
(2)reduceByKey
//groupByKey,reduceByKey 对RDD[key,value]按照相同的key进行分组
val scoreDetail = sc.parallelize(List(("xiaoming","A"), ("xiaodong","B"), ("peter","B"), ("liuhua","C"), ("xiaofeng","A")), 3)
val scoreDetail2 = sc.parallelize(List("A", "B", "B", "D", "B", "D", "E", "A", "E"), 3)
val scoreGroup: Array[(String, Iterable[String])] = scoreDetail.map(x => (x._2,x._1)).groupByKey().collect()
val scoreGroup2: Array[(String, Iterable[Int])] = scoreDetail2.map(x => (x,1)).groupByKey().collect()
val scoreReduce: Array[(String, String)] = scoreDetail.map(x => (x._2,x._1)).reduceByKey(_ + _).collect()
val scoreReduce2: Array[(String, Int)] = scoreDetail2.map(x => (x,1)).reduceByKey(_ + _).collect()
println("=============groupByKey=====")
//(B,CompactBuffer(xiaodong, peter)),(C,CompactBuffer(liuhua)),(A,CompactBuffer(xiaoming, xiaofeng))
println(scoreGroup.mkString(","))
println("=============groupByKey=====")
//(B,CompactBuffer(1, 1, 1)),(E,CompactBuffer(1, 1)),(A,CompactBuffer(1, 1)),(D,CompactBuffer(1, 1))
println(scoreGroup2.mkString(","))
println("=============reduceByKey=====")
//(B,xiaodongpeter),(C,liuhua),(A,xiaomingxiaofeng)
println(scoreReduce.mkString(","))
println("=============reduceByKey=====")
//(B,3),(E,2),(A,2),(D,2)
println(scoreReduce2.mkString(","))
(3)combineByKey
/**
* combineByKey(creationCombiner,mergeValue,mergeCombiners,partitioner)
* combineByKey是最为常用的基于键进行聚合的函数。大多数基于键聚合的函数都是用它是西安的。
* combineByKey()会遍历分区中的所有元素,因此每个元素的键要么没有遇到过,要么就和之前的某个元素的键相同
* creationCombiner函数对于分区中遇到的一个新元素,会创建那个键对应的累加器的初始值。
* 需要注意的是,这一过程会在每个分区中第一次出现各个键时发生,而不是在整个RDD中第一次出现一个键时发生
* mergeValue如果这是一个在处理当前分区之前已经遇到过的键,它会使用mergeValue方法将该键的累加器对应的当前值与这个新值进行合并
* mergeCombiners由于每个分区都是独立处理的,因此对于同一个键可以有多个累加器。如果两个或多个分区
* 都对应同一个键的累加器,就需要使用mergeCombiners方法将各个结果合并
*
*/
val initialScores = Array(("Fred", 88.0), ("Fred", 95.0), ("Fred", 91.0), ("Wilma", 93.0), ("Wilma", 95.0), ("Wilma", 98.0))
val r1 = sc.parallelize(initialScores)
val r2: RDD[(String, (Int, Double))] = r1.combineByKey(
score => (1, score),
(k: (Int, Double), v) => (k._1 + 1, k._2 + v),
(k: (Int, Double), v: (Int, Double)) => (k._1 + v._1, k._2 + v._2)
)
println("=============combineByKey================== ")
//(Wilma,95.33333333333333),(Fred,91.33333333333333)
println(r2.map(k => (k._1,k._2._2/k._2._1)).collect().mkString(","))
(4)partitionBy
对两个RDD聚集
(1)cogroup
//cogroup 对两个RDD中的KV元素,每个RDD中相同key中的元素分别聚合成一个集合。
// 与groupByKey不同的时针对两个RDD中相同的元素进行合并
val data1 = sc.parallelize(List((1,"Hadoop"),(2,"Spark")))
val data2 = sc.parallelize(List((1,"Java"),(1,"Java"),(2,"Scala"),(3,"Python")))
val data3 = sc.parallelize(List((1,"Hbase"),(2,"Hive"),(3,"Mongodb")))
//注意这个返回值类型是:(Int, (Iterable[String], Iterable[String], Iterable[String]))
val data4: RDD[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = data1.cogroup(data2,data3)
//(1,(CompactBuffer(Hadoop),CompactBuffer(Java, Java),CompactBuffer(Hbase))),(3,(CompactBuffer(),CompactBuffer(Python),CompactBuffer(Mongodb))),(2,(CompactBuffer(Spark),CompactBuffer(Scala),CompactBuffer(Hive)))
println(data4.collect().mkString(","))
连接
(1)join
//join 对两个RDD根据key进行连接操作
val data6 = sc.parallelize(Array(("A", 1),("b", 2),("c", 3)))
val data7 = sc.parallelize(Array(("A", 4),("A", 6),("b", 7),("c", 3),("c", 8)))
//注意这个返回值类型是:(String, (Int, Int))
val data8: RDD[(String, (Int, Int))] = data6.join(data7)
//(b,(2,7)),(A,(1,4)),(A,(1,6)),(c,(3,3)),(c,(3,8))
println(data8.collect().mkString(","))
(2)leftOuterJoin
(3)leftOuterJoin
Action算子
无输出
(1)foreach
操作HDFS
(1)saveAsTextFile
(2)saveAsObjectFile
统计类
(1)count
(2)countByKey
(3)countByValue
val data1 = sc.parallelize(Array(("A", 1),("b", 2),("c", 3),("A", 4),("A", 6),("b", 7),("c", 3),("c", 8)))
//count 从RDD中返回元素个数
println(data1.count()) //8
//countByKey 从RDD[K,V]中返回key出现的次数,返回类型collection.Map[String, Long]
//b -> 2,A -> 3,c -> 3
println(data1.countByKey().mkString(","))
//countByValue 统计RDD中值出现的次数,collection.Map[(String, Int), Long]
//(c,3) -> 2,(b,2) -> 1,(b,7) -> 1,(c,8) -> 1,(A,6) -> 1,(A,1) -> 1,(A,4) -> 1
println(data1.countByValue().mkString(","))
集合类
(1)collect
(2)take
(3)takeOrdered
(4)top
val data1 = sc.parallelize(Array(("A", 1),("b", 2),("c", 3),("A", 4),("A", 6),("b", 7),("c", 3),("c", 8)))
//take 从RDD中取0到num - 1 下标的元素,不排序
//(A,1)
println(data1.take(1).mkString(","))
//takeOrdered 从RDD中返按从小到大(默认)返回num个元素
//(A,1),(A,4),(A,6)
println(data1.takeOrdered(3).mkString(","))
//top 和takeOrdered类似,但是排序顺序从大到小
//(c,8),(c,3),(c,3)
println(data1.top(3).mkString(","))
聚合类
(1)reduce
(2)fold
val data2 = sc.parallelize(List(1,2,3,2,3,4,5,4,3,6,8,76,8),3)
//reduce 对RDD中的元素进行聚合操作
//注意:reduceByKey是Transformation
val d = data2.reduce(_ + _)
val f = data2.filter(_ > 4).reduce(_ + _)
println(d)
println(f)
//fold 类似于reduce,对RDD进行聚合操作,
val e = data2.fold(0)(_ + _)
val h = data2.filter(_ > 4).fold(0)(_ + _)
println(e)
println(h)
(3)aggregate
/**
* aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U
* 将初始值和第一个分区中的第一个元素传递给seq函数计算,
* 然后将计算结果和第二个元素传递给seq函数,直到计算到最后一个值。
* 第二个分区也是同理操作,最后将初始值(注意:这里还有一个初始值3需要进入combine的)、所有分区的结果经过combine函数计算,并返回最终结果
* 注意:aggregate的传参的是一个柯理化函数aggregate(3)(seq1,combine1)),并非aggregate(3,seq1,combine1)))
*/
val data3: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6),3)
def seq1(a: Int,b: Int) = {
println("seq: " + a + "\t" + b)
math.min(a,b)
}
def combine1(a: Int,b: Int) = {
//输出过程
//combine: 3 1
//combine: 4 3
//combine: 7 3
println("combine: " + a + "\t" + b)
a + b
}
println("aggregate=========== " + data3.aggregate(3)(seq1,combine1)) //10
(4)aggregateByKey 这是一个action算子,而combineByKey是一个Transformation算子,其底层也是由combineByKey实现的
//分组计算平均值
val data4 = sc.parallelize(Seq(("A",110),("A",130),("A",120), ("B",200),("B",206),("B",206), ("C",150),("C",160),("C",170)))
val data5: RDD[(String, (Int, Int))] = data4.aggregateByKey((0,0))((k, v) => (k._1 + v,k._2 +1), (k, v) => (k._1 + v._1,k._2 + v._2))
val data6: RDD[(String, Int)] = data5.mapValues(x => x._1/x._2)
println(data6.collect().mkString(","))