1月24日

分区算子:可以将RDD中的分区数量改变:

1-repartition  (再分配)  可将分区数量变大变小:  转换类算子

示例:
scala> val a=Array(1,2,34,4,5,6,7)
a: Array[Int] = Array(1, 2, 34, 4, 5, 6, 7)

scala> sc.makeRDD(a,3)
res90: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[64] at makeRDD at <console>:27

scala> res90.repartition(2)
  res91: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[68] at repartition at <console>:29

scala> res91.partitions.size
    res92: Int = 2
 

使用repartition算子的时候会重新分区,会产生shuffle,数据会被打乱重新分配

 

 

 

2-coalesce  只能将分区 数量变小,不能变大     转换类算子
示例:
scala> val a=Array(1,2,3,4,45,5,6,67,7,0)
a: Array[Int] = Array(1, 2, 3, 4, 45, 5, 6, 67, 7, 0)

scala> sc.makeRDD(a,3)
res98: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at makeRDD at <console>:27

scala> res98.coalesce(4)
  res99: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[72] at coalesce at <console>:29

scala> res99.partitions.size
    res100: Int = 3





聚合函数:

1-aggregate() :行动类算子,有初始化值的。

aggregate()是有初始化值的  (函数 ,局部聚合,全局聚合)

先将数据分区,然后将先相加分区,再相加分区里面的数据:


scala>  val a=Array(1,2,3,4,5,6,7)
a: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7)

scala> sc.makeRDD(a,3)
res105: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[77] at
makeRDD at <console>:27

scala> res105.aggregate(0)((a,b)=>a+b,_+_)
res108: Int = 28




2-aggregateByKey() :转换类算子,有初始化值的。

scala> val data=Array(("hello",1),("hello",2),("world",1))
data: Array[(String, Int)] = Array((hello,1), (hello,2), (world,1))

scala> sc.makeRDD(data,3)
res119: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[81] at makeRDD at <console>:27

scala> res119.aggregateByKey(0)(_+_,_+_)
  res120: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[82] at aggregateByKey at <console>:29

scala> res120.collect
  res121: Array[(String, Int)] = Array((world,1), (hello,3))
 
算子说明:
scala> res13.aggregateByKey(10000)(_+_,_+_)  
 _代表:初始化值 _每个元素的value ,全局聚合的时候不进行相加,
 _每个分区的聚合结果+_下一个分区的聚合结果  全局聚合。
(0)(_+_,_+_) 1=0 2=v 3=0 4=分区中的数据
posted @ 2022-01-24 21:49  不咬牙  阅读(99)  评论(0)    收藏  举报