1月24日
分区算子:可以将RDD中的分区数量改变:
1-repartition (再分配) 可将分区数量变大变小: 转换类算子
示例:
scala> val a=Array(1,2,34,4,5,6,7)
a: Array[Int] = Array(1, 2, 34, 4, 5, 6, 7)
scala> sc.makeRDD(a,3)
res90: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[64] at makeRDD at <console>:27
scala> res90.repartition(2)
res91: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[68] at repartition at <console>:29
scala> res91.partitions.size
res92: Int = 2
使用repartition算子的时候会重新分区,会产生shuffle,数据会被打乱重新分配
2-coalesce 只能将分区 数量变小,不能变大 转换类算子
示例:
scala> val a=Array(1,2,3,4,45,5,6,67,7,0)
a: Array[Int] = Array(1, 2, 3, 4, 45, 5, 6, 67, 7, 0)
scala> sc.makeRDD(a,3)
res98: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at makeRDD at <console>:27
scala> res98.coalesce(4)
res99: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[72] at coalesce at <console>:29
scala> res99.partitions.size
res100: Int = 3
聚合函数:
1-aggregate() :行动类算子,有初始化值的。
aggregate()是有初始化值的 (函数 ,局部聚合,全局聚合)
先将数据分区,然后将先相加分区,再相加分区里面的数据:
scala> val a=Array(1,2,3,4,5,6,7)
a: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7)
scala> sc.makeRDD(a,3)
res105: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[77] at
makeRDD at <console>:27
scala> res105.aggregate(0)((a,b)=>a+b,_+_)
res108: Int = 28
2-aggregateByKey() :转换类算子,有初始化值的。
scala> val data=Array(("hello",1),("hello",2),("world",1))
data: Array[(String, Int)] = Array((hello,1), (hello,2), (world,1))
scala> sc.makeRDD(data,3)
res119: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[81] at makeRDD at <console>:27
scala> res119.aggregateByKey(0)(_+_,_+_)
res120: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[82] at aggregateByKey at <console>:29
scala> res120.collect
res121: Array[(String, Int)] = Array((world,1), (hello,3))
算子说明:
scala> res13.aggregateByKey(10000)(_+_,_+_)
_代表:初始化值 _每个元素的value ,全局聚合的时候不进行相加,
_每个分区的聚合结果+_下一个分区的聚合结果 全局聚合。
(0)(_+_,_+_) 1=0 2=v 3=0 4=分区中的数据
浙公网安备 33010602011771号