1月21日

10-reduceByKey:算子:

** 一个聚合算子,将相同的kry进行聚合。

** 可以分区的算子

示例1:
  val conf = new SparkConf().setMaster("local[*]").setAppName("demo")
val sc = new SparkContext(conf)
val data=Array(("hello",1),("hello",1),("world",2))
val rdd1: RDD[(String, Int)] = sc.makeRDD(data,3)
val rdd2 = rdd1.reduceByKey(_+_,2)
println(rdd2.partitions.size)

11-sortBy算子:(a=>a)   RDD中数据排序没有sortWith 和 sorted

   ** 该操作用于排序数据。在排序之前,可以将数据通过f函数进行处理,之后按照f函数处理的结果进行排序,默认为正序排列。排序后新产生的RDD的分区数与原RDD的分区数一致。

scala> val a=Array(4,3,2,5,67,87,9,6,54,33,67)
a: Array[Int] = Array(4, 3, 2, 5, 67, 87, 9, 6, 54, 33, 67)

scala> sc.makeRDD(a,3)
res38: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at makeRDD at <console>:27

scala> res38.sortBy(a=>a)
  res39: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[22] at sortBy at <console>:29

scala> res39.collect
res40: Array[Int] = Array(2, 3, 4, 5, 6, 9, 33, 54, 67, 67, 87)
//默认是正序输出,如果想倒叙输出可以使用: reverse方法实现。
//可以根据hashCode值进行排序:

 

 

12-sortByKey算子:使用key进行排序:

** 该操作用于key、value形式RDD中数据中的key进行排序:

scala> val a=Array(("zhangsan",2000),("lisi",3000),("wangwu",4000))
a: Array[(String, Int)] = Array((zhangsan,2000), (lisi,3000), (wangwu,4000))

scala> sc.makeRDD(a,3)
res45: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[27] at makeRDD at <console>:27

scala> res46.sortByKey()
    res47: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[31] at sortByKey at <console>:31

scala> res47.collect
      res48: Array[(Int, String)] = Array((2000,zhangsan), (3000,lisi), (4000,wangwu))


13-并集union 

** 该操作可以将两个类型一样的RDD中的数据进行合并:

** 会将两个RDD中的分区进行合并:

scala> val data1=Array(1,2,3)
data1: Array[Int] = Array(1, 2, 3)

scala> val data2=Array(4,5,6)
data2: Array[Int] = Array(4, 5, 6)

scala> sc.makeRDD(data1,3)
res54: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at makeRDD at <console>:27

scala> sc.makeRDD(data2,3)
  res55: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[33] at makeRDD at <console>:27

scala> res54.union(res55)
    res56: org.apache.spark.rdd.RDD[Int] = UnionRDD[34] at union at <console>:33

scala> res56.collect
   res57: Array[Int] = Array(1, 2, 3, 4, 5, 6)


14-Distinct去重:toSet

** distinct中可以加入分区数量(可以重新分区)

scala> val data1=Array(1,2,3,4,5,6,7,8,8,7,6,5,5,4,3,2,1)
data1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 8, 7, 6, 5, 5, 4, 3, 2, 1)

scala> sc.makeRDD(data1,3)
res59: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[36] at makeRDD at <console>:27

scala> res59.distinct
def distinct(): org.apache.spark.rdd.RDD[Int]
def distinct(numPartitions: Int)(implicit ord: Ordering[Int]): org.apache.spark.rdd.RDD[Int]

scala> res59.distinct()  //可以不给参数,如果给了参数指定的含义是: 重新指定了分区的个数:
res60: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[39] at distinct at <console>:29

scala> res60.collect()
res61: Array[Int] = Array(6, 3, 4, 1, 7, 8, 5, 2)



posted @ 2022-01-21 23:21  不咬牙  阅读(151)  评论(0)    收藏  举报