大数据Spark实时处理--结构化流2(Structured Streaming)
基于EventTime的窗口统计原理详解
- 10分钟一个窗口,5分钟更新一次
- 从12:00开始计算,隐藏之前的窗口
- 更新时间分别是:12:05、12:10、12:15、12:20
- 窗口(左闭右开区间)):12:00--12:10、12:05--12:15、12:10--12:20、12:15--12:25
- 更新时间一:12:05,展示12:00--12:10中的值累加
- 更新时间二:12:10,展示12:00--12:10中的值累加、12:05--12:15中的值累加
- 更新时间三:12:15,展示12:00--12:10中的值累加、12:05--12:15中的值累加、12:10--12:20中的值累加
- 更新时间四:12:20,展示展示12:00--12:10中的值累加、12:05--12:15中的值累加、12:10--12:20中的值累加、12:15--12:25中的值累加
- 总结:一条数据被分到多个窗口中
基于EventTime的窗口统计功能实现
- 打开dfs、yarn、zookeeper、多broker的kafka服务、master,9999端口
- 运行IDEA
- C:\Users\jieqiong\IdeaProjects\spark-log4j\log-sss\src\main\scala\com\imooc\spark\sss\SourceApp.scala
package com.imooc.spark.sss import org.apache.spark.sql.SparkSession import org.apache.spark.sql.streaming.OutputMode import org.apache.spark.sql.types.{IntegerType, StringType, StructType} import org.apache.spark.sql.functions.window object SourceApp { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().master("local[2]") .appName(this.getClass.getName).getOrCreate()//调用方法4---eventTimeWindow eventTimeWindow(spark) }//方法4---eventTimeWindow def eventTimeWindow(spark:SparkSession): Unit = { import spark.implicits._ spark.readStream.format("socket") .option("host","spark000") .option("port",9999) .load.as[String] .map(x => { val splits = x.split(",") (splits(0),splits(1)) }).toDF("ts", "word") .groupBy( window($"ts", "10 minutes", "5 minutes"), $"word" ).count() .sort("window") .writeStream .format("console") .option("truncate","false") .outputMode(OutputMode.Complete()) .start() .awaitTermination() } }
- 在9999端口处,[hadoop@spark000 ~]$ nc -lk 9999,输入数据
2021-10-01 12:02:00,cat 2021-10-01 12:02:00,dog 2021-10-01 12:03:00,dog 2021-10-01 12:03:00,dog 2021-10-01 12:07:00,owl 2021-10-01 12:07:00,cat 2021-10-01 12:11:00,dog 2021-10-01 12:13:00,owl
- 结果(在控制台输出)
- 时间一:12:00---[11:55---12:05)的值累加
- 时间二:12:05---[12:00---12:10)的值累加
- 时间三:12:10---[12:05---12:15)的值累加
- 时间四:12:15---[12:10---12:20)的值累加
延迟数据处理及Watermark
- 数据的乱序及延迟到达
- 基于上述9999的数据输入,将2021-10-01 12:11:00,dog的数据,改为,2021-10-01 12:14:00,dog的数据输入,但接收数据的时间为2021-10-01 12:11:00。
- 此时应用程序应使用2021-10-01 12:04:00的时间。
- 所以要更新12:00到12:10窗口中的值。
- 原理:结构化流可以在很长一段时间内保持部分聚合的中间状态,以便后期数据可以正确更新旧窗口的聚合。
- Watermark即阈值
- max event time seen by the engine - late threshold > T
FIle Sink
- 所有服务启动、9999端口
- C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\scala\com\imooc\spark\sss\SinkApp.scala
- 结果输出到:C:\Users\jieqiong\IdeaProjects\log-time\out\part-00000-e01a8049-231e-4533-86c2-37beb7ac9901-c000.json
package com.imooc.spark.sss import java.sql.{Connection, DriverManager, PreparedStatement} import org.apache.spark.sql.{ForeachWriter, Row, SparkSession} import org.apache.spark.sql.functions.window import org.apache.spark.sql.streaming.OutputMode import org.apache.spark.sql.types.{IntegerType, StringType, StructType} object SinkApp { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().master("local[2]") .appName(this.getClass.getName).getOrCreate() fileSink(spark) } def fileSink(spark:SparkSession): Unit = { import spark.implicits._ spark.readStream .format("socket") .option("host","spark000") .option("port",9999) .load().as[String] .flatMap(_.split(",")) .map(x => (x,"pk")) .toDF("word","new_word") .writeStream .format("json") .option("path","out") .option("checkpointLocation","chk") .start() .awaitTermination() } }
Kafka Sink
- 关闭上一节的生产者,启动一个消费者
- 在此处接收数据
[hadoop@spark000 bin]$ pwd /home/hadoop/app/kafka_2.12-2.5.0/bin [hadoop@spark000 bin]$ ./kafka-console-producer.sh --broker-list spark000:9092 --topic ssskafkatopic >^C[hadoop@spark000 bin]$ [hadoop@spark000 bin]$ [hadoop@spark000 bin]$ ./kafka-console-consumer.sh --bootstrap-server spark000:9092 --topic ssskafkatopic
- 启动9999端口
- 在9999端口输入测试数据
- IDEA代码
- C:\Users\jieqiong\IdeaProjects\spark-log4j\log-sss\src\main\scala\com\imooc\spark\sss\SinkApp.scala
- 结果,在9999端口输入数据,在consumer中接收数据。
package com.imooc.spark.sss import java.sql.{Connection, DriverManager, PreparedStatement} import org.apache.spark.sql.{ForeachWriter, Row, SparkSession} import org.apache.spark.sql.functions.window import org.apache.spark.sql.streaming.OutputMode import org.apache.spark.sql.types.{IntegerType, StringType, StructType} object SinkApp { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().master("local[2]") .appName(this.getClass.getName).getOrCreate() kafkaSink(spark) } def kafkaSink(spark:SparkSession): Unit = { import spark.implicits._ spark.readStream.format("socket") .option("host","spark000") .option("port",9999) .load().as[String] .writeStream .format("kafka") .option("kafka.bootstrap.servers", "spark000:9092") .option("topic", "ssskafkatopic") .option("checkpointLocation","kafka-chk") .start() .awaitTermination() } }
ForeachSink到MySQL
- 开启所有服务(dfs、yarn、zookeeper、多broker的kafka、master)、9999端口
- 在9999端口处,输入数据
- MySQL创建table
- 词频统计,计算的结果放入数据库中
[hadoop@spark000 ~]$ mysql -uroot -proot mysql> show databases; mysql> use jieiong; Database changed mysql> create table t_wc( -> word varchar(20) not null, -> cnt int not null, -> primary key (word) -> ); Query OK, 0 rows affected (0.01 sec) mysql> select * from t_wc; Empty set (0.01 sec)
- 添加依赖
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
- IDEA
- C:\Users\jieqiong\IdeaProjects\spark-log4j\log-sss\src\main\scala\com\imooc\spark\sss\SinkApp.scala
package com.imooc.spark.sss import java.sql.{Connection, DriverManager, PreparedStatement} import org.apache.spark.sql.{ForeachWriter, Row, SparkSession} import org.apache.spark.sql.functions.window import org.apache.spark.sql.streaming.OutputMode import org.apache.spark.sql.types.{IntegerType, StringType, StructType} object SinkApp { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().master("local[2]") .appName(this.getClass.getName).getOrCreate() mysqlSink(spark) } def mysqlSink(spark:SparkSession): Unit = { import spark.implicits._ spark.readStream .format("socket") .option("host","spark000") .option("port",9999) .load().as[String] .flatMap(_.split(",")) .groupBy("value") .count() .repartition(2) .writeStream .outputMode(OutputMode.Update()) .foreach(new ForeachWriter[Row] { var connection:Connection = _ var pstmt:PreparedStatement = _ var batchCount = 0 override def process(value: Row): Unit = { println("处理数据...") val word = value.getString(0) val cnt = value.getLong(1).toInt println(s"word:$word, cnt:$cnt...") pstmt.setString(1, word) pstmt.setInt(2, cnt) pstmt.setString(3, word) pstmt.setInt(4, cnt) pstmt.addBatch() batchCount += 1 if(batchCount >= 10) { pstmt.executeBatch() batchCount = 0 } } override def close(errorOrNull: Throwable): Unit = { println("关闭...") pstmt.executeBatch() batchCount = 0 connection.close() } override def open(partitionId: Long, epochId: Long): Boolean = { println(s"打开connection: $partitionId, $epochId") Class.forName("com.mysql.jdbc.Driver") connection = DriverManager.getConnection("jdbc:mysql://spark000:3306/jieqiong","root","root") val sql = """ |insert into t_wc(word,cnt) |values(?,?) |on duplicate key update word=?,cnt=?; | """.stripMargin pstmt = connection.prepareStatement(sql) connection!=null && !connection.isClosed && pstmt != null } }) .start() .awaitTermination() } }
容错语义
- 结构化流可以确保在任何故障情况下端到端只执行一次语义。

浙公网安备 33010602011771号