RDD的创建
RDD的几种创建方式
1.parallelize,可指定分区数
scala> val rdd1 = sc.parallelize(1 to 10)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24
scala> rdd1.collect res14: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> rdd1.getNumPartitions res15: Int = 2 scala> val rdd1 = sc.parallelize(1 to 10, 4) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24 scala> rdd1.collect res16: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> rdd1.getNumPartitions
res15: Int = 4
2.range,左闭右开,步长默认为1
scala> val rdd1 = sc.range(1,11) rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[17] at range at <console>:24 scala> rdd1.collect res20: Array[Long] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> val rdd1 = sc.range(1,11,2) rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[1] at range at <console>:24 scala> rdd1.collect res0: Array[Long] = Array(1, 3, 5, 7, 9)
3.makeRDD,指定分区数时和parallelize一致,不指定分区时官方注释makeRDD会为每个集合创建最佳的分区,对后续调整优化有帮助
scala> val lst = List(1,3,4,5,6,7,9) lst: List[Int] = List(1, 3, 4, 5, 6, 7, 9) scala> val rdd1 = sc.parallelize(lst) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:26 scala> rdd1.collect res3: Array[Int] = Array(1, 3, 4, 5, 6, 7, 9) scala> rdd1.getNumPartitions res4: Int = 2 scala> val rdd1 = sc.makeRDD(lst) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at makeRDD at <console>:26 scala> rdd1.collect res5: Array[Int] = Array(1, 3, 4, 5, 6, 7, 9) scala> rdd1.getNumPartitions res6: Int = 2
4.从本地文件系统加载数据
scala> val rdd1 = sc.textFile("file:///data/hello.txt") rdd1: org.apache.spark.rdd.RDD[String] = hdfs:///data/hello.txt MapPartitionsRDD[9] at textFile at <console>:24 scala> rdd1.collect res9: Array[String] = Array(hello spark, this is a local file, hello zhangcong)
5.从分布式文件系统加载数据,以hdfs为例
scala> val rdd1 = sc.textFile("hdfs:///data/hello.txt") rdd1: org.apache.spark.rdd.RDD[String] = hdfs:///data/hello.txt MapPartitionsRDD[9] at textFile at <console>:24 scala> rdd1.collect res9: Array[String] = Array(hello spark, this is a hdfs file, hello zhangcong)
6.从RDD创建RDD,本质是将一个RDD转换成另一个RDD,详见RDD的转换