DStream、RDD、DataFrame 的相互转换、spark 比 MapReduce 快的原因

DStream、RDD、DataFrame 的相互转换

DStream → RDD → DataFrame

package com.shujia.stream

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Duration, Durations, StreamingContext}

object Demo4DStoRDD {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession
      .builder()
      .master("local[2]")
      .appName("stream")
      .config("spark.sql.shuffle.partitions", 1)
      .getOrCreate()

    //需要加上隐式转换 -- RDD → DataFrame 
    import spark.implicits._

    val sc: SparkContext = spark.sparkContext

    val ssc = new StreamingContext(sc, Durations.seconds(5))

    val linesDS: ReceiverInputDStream[String] = ssc.socketTextStream("master", 8888)

    /**
      * DStream 底层是不断重复计算的rdd,
      * 可以将DStream转换成RDD来使用
      *
      * foreachRDD相当于一个循环,每隔5秒执行一次,rdd的数据是当前batch接收到的数据
      *
      */

    linesDS.foreachRDD((rdd: RDD[String]) => {

      /*val kvRDD: RDD[(String, Int)] = rdd.flatMap(_.split(",")).map((_, 1))
      val countRDD: RDD[(String, Int)] = kvRDD.reduceByKey(_ + _)
      countRDD.foreach(println)*/

      /**
        * RDD可以转换成DF
        *
        */

      val linesDF: DataFrame = rdd.toDF("line")

      //注册一张表
      linesDF.createOrReplaceTempView("lines")

      val countDF: DataFrame = spark.sql(
        """
          |select word,count(1) as c from (
          |select explode(split(line,',')) as word from lines
          |) as a group by word
          |
        """.stripMargin)

      countDF.show()

    })

    ssc.start()
    ssc.awaitTermination()
    ssc.stop()

  }
}

RDD → DStream

package com.shujia.stream

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Durations, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}

object Demo5RDDtoDS {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession
      .builder()
      .master("local[2]")
      .appName("stream")
      .config("spark.sql.shuffle.partitions", 1)
      .getOrCreate()

    import spark.implicits._

    val sc: SparkContext = spark.sparkContext

    val ssc = new StreamingContext(sc, Durations.seconds(5))

    val linesDS: ReceiverInputDStream[String] = ssc.socketTextStream("master", 8888)

    /**
      * transform:传入一个RDD,返回RDD并将之构建成DStream
      *
      * 将 DS 转换成rdd之后,rdd没有 有状态算子 ,所以不能进行全局累加
      * transform:将 DS 转换成RDD,使用rdd的 API,处理完之后返回一个新的rdd
      *
      */

    val tfDS: DStream[(String, Int)] = linesDS.transform((rdd: RDD[String]) => {


      val countRDD: RDD[(String, Int)] = rdd
        .flatMap(_.split(","))
        .map((_, 1))
        .reduceByKey(_ + _)

      //返回一个rdd,得到一个新的DS
      countRDD
    })

    tfDS.print()

    ssc.start()
    ssc.awaitTermination()
    ssc.stop()

  }
}

spark 比 MapReduce 快的原因

1、当对同一个rdd多次使用的时候可以将这个rdd缓存起来

2、spark -- 粗粒度的资源调度,MapReduce -- 细粒度的资源调度

3、DAG有向无环图

两次shuffle的中间结果不需要落地

spark没有MapReduce稳定,因为spark用内存较多

posted @ 2022-03-13 15:18  赤兔胭脂小吕布  阅读(257)  评论(0编辑  收藏  举报