大数据Spark实时处理--结构化流2(Structured Streaming)

基于EventTime的窗口统计原理详解

  • 10分钟一个窗口,5分钟更新一次
  • 从12:00开始计算,隐藏之前的窗口
  • 更新时间分别是:12:05、12:10、12:15、12:20
  • 窗口(左闭右开区间)):12:00--12:10、12:05--12:15、12:10--12:20、12:15--12:25
  • 更新时间一:12:05,展示12:00--12:10中的值累加
  • 更新时间二:12:10,展示12:00--12:10中的值累加、12:05--12:15中的值累加
  • 更新时间三:12:15,展示12:00--12:10中的值累加、12:05--12:15中的值累加、12:10--12:20中的值累加
  • 更新时间四:12:20,展示展示12:00--12:10中的值累加、12:05--12:15中的值累加、12:10--12:20中的值累加、12:15--12:25中的值累加
  • 总结:一条数据被分到多个窗口中

 

基于EventTime的窗口统计功能实现

  • 打开dfs、yarn、zookeeper、多broker的kafka服务、master,9999端口
  • 运行IDEA
  • C:\Users\jieqiong\IdeaProjects\spark-log4j\log-sss\src\main\scala\com\imooc\spark\sss\SourceApp.scala
package com.imooc.spark.sss

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.types.{IntegerType, StringType, StructType}
import org.apache.spark.sql.functions.window

object SourceApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[2]")
      .appName(this.getClass.getName).getOrCreate()//调用方法4---eventTimeWindow
    eventTimeWindow(spark)

  }//方法4---eventTimeWindow
  def eventTimeWindow(spark:SparkSession): Unit = {
    import spark.implicits._
    spark.readStream.format("socket")
      .option("host","spark000")
      .option("port",9999)
      .load.as[String]
      .map(x => {
        val splits = x.split(",")
        (splits(0),splits(1))
      }).toDF("ts", "word")
      .groupBy(
        window($"ts", "10 minutes", "5 minutes"),
        $"word"
      ).count()
      .sort("window")
      .writeStream
      .format("console")
      .option("truncate","false")
      .outputMode(OutputMode.Complete())
      .start()
      .awaitTermination()
  }
}
  • 在9999端口处,[hadoop@spark000 ~]$ nc -lk 9999,输入数据
2021-10-01 12:02:00,cat
2021-10-01 12:02:00,dog
2021-10-01 12:03:00,dog
2021-10-01 12:03:00,dog
2021-10-01 12:07:00,owl
2021-10-01 12:07:00,cat
2021-10-01 12:11:00,dog
2021-10-01 12:13:00,owl
  • 结果(在控制台输出)
  • 时间一:12:00---[11:55---12:05)的值累加
  • 时间二:12:05---[12:00---12:10)的值累加
  • 时间三:12:10---[12:05---12:15)的值累加
  • 时间四:12:15---[12:10---12:20)的值累加

 

延迟数据处理及Watermark

  • 数据的乱序及延迟到达
  • 基于上述9999的数据输入,将2021-10-01 12:11:00,dog的数据,改为,2021-10-01 12:14:00,dog的数据输入,但接收数据的时间为2021-10-01 12:11:00。
  • 此时应用程序应使用2021-10-01 12:04:00的时间。
  • 所以要更新12:00到12:10窗口中的值。
  • 原理:结构化流可以在很长一段时间内保持部分聚合的中间状态,以便后期数据可以正确更新旧窗口的聚合。
  • Watermark即阈值
  • max event time seen by the engine - late threshold > T

 

FIle Sink

  • 所有服务启动、9999端口
  • C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\scala\com\imooc\spark\sss\SinkApp.scala
  • 结果输出到:C:\Users\jieqiong\IdeaProjects\log-time\out\part-00000-e01a8049-231e-4533-86c2-37beb7ac9901-c000.json
package com.imooc.spark.sss

import java.sql.{Connection, DriverManager, PreparedStatement}
import org.apache.spark.sql.{ForeachWriter, Row, SparkSession}
import org.apache.spark.sql.functions.window
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.types.{IntegerType, StringType, StructType}

object SinkApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[2]")
      .appName(this.getClass.getName).getOrCreate()

    fileSink(spark)
  }

  def fileSink(spark:SparkSession): Unit = {
    import spark.implicits._
    spark.readStream
      .format("socket")
      .option("host","spark000")
      .option("port",9999)
      .load().as[String]
      .flatMap(_.split(","))
      .map(x => (x,"pk"))
      .toDF("word","new_word")
      .writeStream
      .format("json")
      .option("path","out")
      .option("checkpointLocation","chk")
      .start()
      .awaitTermination()
  }
}

 

Kafka Sink

  •  关闭上一节的生产者,启动一个消费者
  •  在此处接收数据
[hadoop@spark000 bin]$ pwd
/home/hadoop/app/kafka_2.12-2.5.0/bin
[hadoop@spark000 bin]$ ./kafka-console-producer.sh --broker-list spark000:9092 --topic ssskafkatopic
>^C[hadoop@spark000 bin]$ 
[hadoop@spark000 bin]$ 
[hadoop@spark000 bin]$ ./kafka-console-consumer.sh --bootstrap-server spark000:9092 --topic ssskafkatopic
  • 启动9999端口
  • 在9999端口输入测试数据
  • IDEA代码
  • C:\Users\jieqiong\IdeaProjects\spark-log4j\log-sss\src\main\scala\com\imooc\spark\sss\SinkApp.scala
  • 结果,在9999端口输入数据,在consumer中接收数据。
package com.imooc.spark.sss

import java.sql.{Connection, DriverManager, PreparedStatement}
import org.apache.spark.sql.{ForeachWriter, Row, SparkSession}
import org.apache.spark.sql.functions.window
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.types.{IntegerType, StringType, StructType}

object SinkApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[2]")
      .appName(this.getClass.getName).getOrCreate()

    kafkaSink(spark)
  }

  def kafkaSink(spark:SparkSession): Unit = {
    import spark.implicits._
    spark.readStream.format("socket")
      .option("host","spark000")
      .option("port",9999)
      .load().as[String]
      .writeStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "spark000:9092")
      .option("topic", "ssskafkatopic")
      .option("checkpointLocation","kafka-chk")
      .start()
      .awaitTermination()
  }
}

 

ForeachSink到MySQL

  • 开启所有服务(dfs、yarn、zookeeper、多broker的kafka、master)、9999端口
  • 在9999端口处,输入数据
  • MySQL创建table
  • 词频统计,计算的结果放入数据库中
[hadoop@spark000 ~]$ mysql -uroot -proot
mysql> show databases;
mysql> use jieiong;
Database changed
mysql> create table t_wc(
    -> word varchar(20) not null,
    -> cnt int not null,
    -> primary key (word)
    -> );
Query OK, 0 rows affected (0.01 sec)

mysql> select * from t_wc;
Empty set (0.01 sec)
  • 添加依赖
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
  • IDEA
  • C:\Users\jieqiong\IdeaProjects\spark-log4j\log-sss\src\main\scala\com\imooc\spark\sss\SinkApp.scala
package com.imooc.spark.sss

import java.sql.{Connection, DriverManager, PreparedStatement}
import org.apache.spark.sql.{ForeachWriter, Row, SparkSession}
import org.apache.spark.sql.functions.window
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.types.{IntegerType, StringType, StructType}

object SinkApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[2]")
      .appName(this.getClass.getName).getOrCreate()

    mysqlSink(spark)
  }

  def mysqlSink(spark:SparkSession): Unit = {
    import spark.implicits._
    spark.readStream
      .format("socket")
      .option("host","spark000")
      .option("port",9999)
      .load().as[String]
      .flatMap(_.split(","))
      .groupBy("value")
      .count()
      .repartition(2)
      .writeStream
      .outputMode(OutputMode.Update())
      .foreach(new ForeachWriter[Row] {
        var connection:Connection = _
        var pstmt:PreparedStatement = _
        var batchCount = 0

        override def process(value: Row): Unit = {
          println("处理数据...")

          val word = value.getString(0)
          val cnt = value.getLong(1).toInt

          println(s"word:$word, cnt:$cnt...")

          pstmt.setString(1, word)
          pstmt.setInt(2, cnt)
          pstmt.setString(3, word)
          pstmt.setInt(4, cnt)
          pstmt.addBatch()

          batchCount += 1
          if(batchCount >= 10) {
            pstmt.executeBatch()
            batchCount = 0
          }

        }

        override def close(errorOrNull: Throwable): Unit = {
          println("关闭...")
          pstmt.executeBatch()
          batchCount = 0
          connection.close()
        }

        override def open(partitionId: Long, epochId: Long): Boolean = {
          println(s"打开connection: $partitionId, $epochId")
          Class.forName("com.mysql.jdbc.Driver")
          connection = DriverManager.getConnection("jdbc:mysql://spark000:3306/jieqiong","root","root")

          val sql =
            """
              |insert into t_wc(word,cnt)
              |values(?,?)
              |on duplicate key update word=?,cnt=?;
              |
              """.stripMargin

          pstmt = connection.prepareStatement(sql)

          connection!=null && !connection.isClosed && pstmt != null
        }
      })

      .start()
      .awaitTermination()
  }
}

 

容错语义

  • 结构化流可以确保在任何故障情况下端到端只执行一次语义。

 

posted @ 2022-03-22 17:04  酱汁怪兽  阅读(182)  评论(0)    收藏  举报