Spark Structured Stream 2

❤Limitations of DStream API

Batch Time Constraint

application级别的设置。

不支持EventTime

event time 比process time更重要

Weak support for Dataset/Dataframe

No custom triggers

比如session的处理,当session跨越长时间,窗口处理也无法满足。

NO Update sematic

new event可能会update之前已经处理过的state,DStream没有相关的语义。

Structured Streaming is a complete rethinking of stream processing in spark. It replaces earlier fast batch processing model with true stream processing abstraction.

❤Structured streaming

Batch Time

After successfully running the example, one question immediately comes in to mind. How frequently socket data is processed?. Another way of asking question is, what’s the batch time and where we have specified in code?

In structured streaming, there is no batch time. Rather than batch time, we use trigger abstraction to indicate the frequency of processing. Triggers can be specified in variety of ways. One of the way of specifying is using processing time which is similar to batch time of earlier API.

By default, the trigger is ProcessingTime(0). Here 0 indicates asap. This means as and when data arrives from the source spark tries to process it. This is very similar to per message semantics of the other streaming systems like storm.

Structured Streaming defines source and sinks using data source API. This unification of API makes it easy to move from batch world to streaming world.

Run using Query

Once we have the logic implemented, next step is to connect to a sink and create query. We will be using console sink as last post.

val query = 
countDs
  .writeStream
  .format("console")
  .outputMode(OutputMode.Complete())
query.start().awaitTermination()

Output Mode

In the above code, we have used output mode complete. In last post, we used we used append mode. What are these signify?.
In structured streaming, output of the stream processing is a dataframe or table. The output modes of the query signify how this infinite output table is written to the sink, in our example to console.
There are three output modes, they are

  • Append - In this mode, the only records which arrive in the last trigger(batch) will be written to sink. This is supported for simple transformations like select, filter etc. As these transformations don’t change the rows which are calculated for earlier batches, appending the new rows work fine.

  • Complete - In this mode, every time complete resulting table will be written to sink. Typically used with aggregation queries. In case of aggregations, the output of the result will be keep on changing as and when the new data arrives.

  • Update - In this mode, the only records that are changed from last trigger will be written to sink. We will talk about this mode in future posts.

Depending upon the queries we use , we need to select appropriate output mode. Choosing wrong one result in run time exception as below.

org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;

State Management

Dstream is stateless, Structured Stream is stateful

Once you run the program, you can observe that whenever we enter new lines it updates the global wordcount. So every time spark processes the data, it gives complete wordcount from the beginning of the program. This indicates spark is keeping track of the state of us. So it’s a stateful wordcount.

In structured streaming, all aggregation by default stateful. All the complexities involved in keeping state across the stream and failures is hidden from the user. User just writes the simple dataframe based code and spark figures out the intricacies of the state management.

It’s different from the earlier DStream API. In that API, by default everything was stateless and it’s user responsibility to handle the state. But it was tedious to handle state and it became one of the pain point of the API. So in structured streaming spark has made sure that most of the common work is done at the framework level itself. This makes writing stateful stream processing much more simpler.

❤Stream Enrichment

With unified dataset abstraction across batch and stream, we can seamlessly join stream data with batch data. This makes stream enrichment much simpler compared to other stream processing systems.

In real world, stream data often contains minimal data for capturing the events happening in real time. For example, whenever there is sale happens in a e-commerce website, it contains the customer’s id rather than complete customer information. This is done to reduce the amount of data generated and transmitted per transaction in a large traffic site.

Often many of the stream processing operations needs the data more than that’s available in the stream. We often want to add data from static stores like files or databases to stream data to do better decisions. In our example, if we have customer data in a static file, we want to look up the information for given id in the stream to understand better about the customer. // similar scenario as IOT

This step of adding additional information to the stream data is known as stream enrichment step in stream processing. It’s one often one of the most important step of many stream processing operations.

Unified Dataset Abstraction

In data enrichment, we often combine stream data with static data. So having both world, static and stream, talking same abstraction will make life much easier for the developer. In case of spark, both spark batch API and structured streaming API share a common abstraction of dataset. Since both share the same abstraction, we can easily join the datasets across the boundary of batch and streams. This is one of the unique feature of spark streaming compared to other streaming systems out there.

Example

1. Reading Static Customer Data

case class Customer(customerId: String, customerName: String)

  val customerDs = sparkSession.read
      .format("csv")
      .option("header", true)
      .load("src/main/resources/customers.csv")
      .as[Customer]

In the above code, we read customer data from a csv file. We are using read method which indicates that we are using batch API. We are converting the data to a custom class named Customer using case classes.

2. Reading Sales Stream Data

val socketStreamDf = sparkSession.readStream
      .format("socket")
      .option("host", "localhost")
      .option("port", 50050)
      .load()

In the above code, we use readStream to read the sales data from socket.

3. Parsing Sales Data

case class Sales(
  transactionId: String,
  customerId:    String,
  itemId:        String,
  amountPaid:    Double)

val dataDf = socketStreamDf.as[String].flatMap(value âalue.split(" "))
val salesDs = dataDf
  .as[String]
  .map(value =>
    val values = value.split(",")
    Sales(values(0), values(1), values(2), values(3).toDouble)
  })

The data from socket is in string format. We need to convert it to a user defined format before we can use it for data enrichment. So in above code, we parse the text data as comma separated values. Then using map method on the stream, we create sales dataset.

4. Stream Enrichment using Joins

Now we have both sales and customer data in the desired format. Now we can do dataset joins to enrich the sales stream data with customer information. In our example, it will be adding customer name to the sales stream.

val joinedDs = salesDs
      .join(customerDs, "customerId")

In above code, we use join API on dataset to achieve the enrichment. Here we can see how seamless it’s to join stream data with batch data.

Question: join性能如何?如果static数据量巨大,stream数据量很小,join的代价太大。

❤Concept of Time in Streaming Application

Structured streaming has a rich time abstractions which makes modeling different stream processing applications much easier than earlier API.

A streaming application is an always running application. So in order to understand the behavior of the application over time, we need to take snapshots of the stream in various points. Normally these various points are defined using a time component.

Time in streaming application is way to correlate different events in the stream to extract some meaningful insights. For example, when we say count of words in a word count example for last 10 seconds, we normally mean to collect all the records arrived in that point of time and run a word count on it.

In DStream API,spark supported one concept of time. But structured streaming API support multiple different ones.

Time in Structured Streaming

When we say, last 10 seconds what it means in structured streaming? it depends. It can be one of three following

  • Processing Time -- DStream

This concept of time is very familiar to most of the users. In this, time is tracked using a clock run by the processing engine. So in this time, last 10 seconds means the records arrived in last 10 seconds for the processing. Here we only use the semantics of when the records came for processing. DStream supported this abstraction of time in it’s API.

Though processing time is good time measure to have,it’s not always enough. For example, if we want to calculate state of sensors at given point of time, we want to collect events that happened in that time range. But if the events arrive lately to processing system due to various reasons, we may miss some of the events as processing clock does not care about the actual time of events. To address this, structured streaming support another kind of time called event time.

  • Event Time

Event time is the time embed in the data that is coming into the system. So here 10 seconds means, all the records generated in those 10 seconds at the source. These may come out of order to processing. This time is independent of the clock that is kept by the processing engine.Event time is extremely useful for handling the late arrival events.

  • Ingestion Time

Ingestion time is the time when events ingested into the system. This time is in between of the event time and processing time. Normally in processing time, each machine in cluster is used to assign the time stamp to track events. This may result in little inconsistent view of the data, as there may be delays in time across the cluster. But ingestion time, time stamp is assigned in ingestion so that all the machines in the cluster have exact same view. These are useful to calculate results on data that arrive in order at the level of ingestion.

WaterMarks

As structured streaming supports multiple concept of time, how it keep tracks of time?. Because in the normal processing time, a system clock can be used. But you cannot use the system clock in case of the event time and ingestion time. So there has to be a generic mechanism to handle this.

Watermarks is the mechanism used by the structured streaming in order to signify the passing of time in stream. Watermarks are implemented using partial aggregations and the update output mode. Watermarks allow us to implement both tracking time and handle late events.We will discuss more about water marks in upcoming posts.

❤Processing Time Window

1. Reading Data From Socket

The below code is to read the data from socket.

val socketStreamDf = sparkSession.readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", 50050)
  .load()

2. Adding the Time Column

To define a window in structured streaming, we need to have a column in dataframe of the type Timestamp. As we are working with processing time, we will use current_timestamp() function of spark SQL to add processing time to our data.

val currentTimeDf = socketStreamDf.withColumn("processingTime",current_timestamp())

In above code, we are adding a column named processingTime which captures the current time of the processing engine.

3. Extracting Words

The below code extracts the words from socket and creates words with timestamp.

import sparkSession.implicits._
val socketDs = currentTimeDf.as[(String, Timestamp)]
val wordsDs = socketDs
             .flatMap(line => line._1.split(" ").map(word => (word, line._2))) 
             .toDF("word", "processingTime")

4. Defining Window

Once we have words, we define a tumbling window which aggregates data for last 15 seconds.

val windowedCount = wordsDs
           .groupBy( window($"processingTime", "15 seconds") )
           .count()
           .orderBy("window")

In above code, we define window as part of groupby. We group the records that we received in last 15 seconds.

Window API takes below parameters:

  • Time Column - Name of the time column. In our example it’s processingTime.
  • Window Time - How long is the window. In our example it’s 15 seconds .
  • Slide Time - An optional parameter to specify the sliding time. As we are implementing tumbling window, we have skipped it.

Once we do groupBy, we do the count and sort by window to observe the results.

Query

Once we have defined the window, we can setup the execution using query.

val query = windowedCount.writeStream
                          .format("console") 
                          .option("truncate","false") 
                          .outputMode(OutputMode.Complete())
query.start().awaitTermination()

You access complete code on github.
Output
When we run the example, we will observe the below results

+---------------------------------------------+-----+
|window                                       |count|
+---------------------------------------------+-----+
|[2017-09-01 12:44:00.0,2017-09-01 12:44:15.0]|2    |
|[2017-09-01 12:44:15.0,2017-09-01 12:44:30.0]|8    |
+---------------------------------------------+-----+

As you can observe from the above output, we have count for each fifteen second interval. This makes sure that our window function is working.

❤Ingest Time

Reading Data From Socket with Ingestion Time

The below code is to read the data from socket.

val socketStreamDf = sparkSession.readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", 50050)
  .option("includeTimestamp", true)
  .load()

❤Event Time

1. Reading Data From Socket

The below code is to read the data from socket.

val socketStreamDs = sparkSession
                                    .readStream 
                                    .format("socket") 
                                    .option("host", "localhost")
                                    .option("port", 50050) 
                                    .load() 
                                    .as[String]

2. Extracting Time from Stream

case class Stock(time:Long, symbol:String,value:Double)

val stockDs = socketStreamDs.map(value => {
                                         val columns = value.split(",") 
                                         Stock(new Timestamp(columns(0).toLong), columns(1), columns(2).toDouble)
                                    })

In above code, we declared a model which tracks stock price in a given point of time. The first field is timestamp of stock, then second is the symbol and third one is the value of the stock at that point of given time. Normally stock price analysis depends on event time rather than processing time as they want to correlate the change in stock prices when they happened in the market rather than they ended up in our processing engine.

So once we define the model, we convert of string network stream into model which we want to use. So the time in the model, signifies when this stock reading is done.

3. Defining Window on Event Time

val windowedCount = stockDs.groupBy( window($"time", "10 seconds") ) .sum("value")

The above code defines the window which aggregates the stock value for last 10 seconds.

Passing of time

Whenever we say, we want to calculate max of a stock in last 10 seconds, how spark knows all the records for that 10 seconds are reached? It’s the way of saying how spark knows the passage of time in the source? We cannot use system clocks because there will be delay between these two systems.
As we discussed in previous post, watermarks are the solution to this problem. Watermark signify the passage of time in source which will help spark to understand flow in time.
By default spark uses the window time column to track the passing of time with option of infinite delay. So in this model all windows are remembered as state,so that even if the event delays long time, spark will calculate the right value.But this creates an issue.As time goes number of windows increases and they use more and more resources. We can limit the state, number of windows to be remembered, using custom watermarks. We will discuss more about it in next post.

Running the Example

Enter the below records in socket console. These are records for AAPL with time stamps.

First Event

The first records is for time Wed, 27 Apr 2016 11:34:22 GMT.
1461756862000,"aapl",500.0

Spark outputs below results which indicates start of window

-------------------------------------------Batch: 0-------
+---------------------------------------------+----------+
|window                                       |sum(value)|
+---------------------------------------------+----------+
|[2016-04-27 17:04:20.0,2016-04-27 17:04:30.0]|500.0     |
+---------------------------------------------+----------+

Event after 5 seconds

Now we send the next record, which is after 5 seconds. This signifies to spark that, 5 seconds have passed in source. So spark will be updating the same window. This event is for time Wed, 27 Apr 2016 11:34:27 GMT
1461756867001,"aapl",600.0

The output of the spark will be as below. You can observe from output, spark is updating same window.

-------------------------------------------Batch: 1-----------
+---------------------------------------------+----------+
|window                                       |sum(value)|
+---------------------------------------------+----------+
|[2016-04-27 17:04:20.0,2016-04-27 17:04:30.0]|1100.0    |
+---------------------------------------------+----------+

Event after 11 seconds

Now we send another event, which is after 6 seconds from this time. Now spark understands 11 seconds have been passed. This event is for Wed, 27 Apr 2016 11:34:32 GMT
1461756872000,"aapl",400.0

Now spark completes the first window and add the above event to next window.

-------------------------------------------Batch: 2-------------------------------------------
+---------------------------------------------+----------+
|window                                       |sum(value)|
+---------------------------------------------+----------+
|[2016-04-27 17:04:20.0,2016-04-27 17:04:30.0]|1100.0    |
|[2016-04-27 17:04:30.0,2016-04-27 17:04:40.0]|400.0     |
+---------------------------------------------+----------+

Late Event (回填)

Let’s say we get an event which got delayed. It’s an event is for Wed, 27 Apr 2016 11:34:27 which is 5 seconds before the last event.
1461756867001,"aapl",200.0

If you observe the spark result now, you can observe that it’s added it to right window.

-------------------------------------------Batch: 3-------------------------------------------
+---------------------------------------------+----------+
|window                                       |sum(value)|
+---------------------------------------------+----------+
|[2016-04-27 17:04:20.0,2016-04-27 17:04:30.0]|1300.0    |
|[2016-04-27 17:04:30.0,2016-04-27 17:04:40.0]|400.0     |
+---------------------------------------------+----------+

❤Watermark (上面例子中all states are remembered, 在此基础上加上watermark,减少历史state开销)

Unbounded State for Late Events

In last example we discussed about the event time abstraction.

By default, spark remembers all the windows forever and waits for the late events forever.

This may be good for the small volume data, but as volume increases keeping around all the state becomes problematic. As the time goes, the number of windows increases and resource usage will shoot upward. So we need a mechanism which allows us to control state in bounded way. Watermarks are one of those mechanisms.

Watermarks

Watermarks is a threshold , which defines the how long we wait for the late events. Combining watermarks with automatic source time tracking ( event time) spark can automatically drop the state and keep it in bounded way.

When you enable watermarks, for a specific window starting at time T, spark will maintain state and allow late data to update the state until

max event time seen by the engine - late threshold > T

In other words, late data within the threshold will be aggregated, but data later than the threshold will be dropped.

Analysing Stock Data

In this example, we analyse the stock data using event time abstraction as last post. As each stock event comes with a timestamp, we can use that time to define aggregations. We will be using socket stream as our source.

1. Specifying the Watermark

The below code is to define watermark on stream.

val windowedCount = stockDs 
          .withWatermark("time", "500 milliseconds") 
          .groupBy( window($"time", "10 seconds") ) 
          .sum("value")

In above example, whenever we create window on event time, we can specify the watermark with withWatermark. In our example, we specified watermark as 500 milliseconds. So spark will wait for that time for late events. Make sure to use update output mode. Complete mode doesn’t honor watermarks.

2.Running the Example

Enter the below records in socket console. These are records for AAPL with time stamps. As we are using update output mode, result will only show changed windows.

  • First Event

The first records is for time Wed, 27 Apr 2016 11:34:22 GMT.
1461756862000,"aapl",500.0

Spark outputs below results which indicates start of window

-------------------------------------------Batch: 0-------------------------------------------
+---------------------------------------------+----------+
|window                                       |sum(value)|
+---------------------------------------------+----------+
|[2016-04-27 17:04:20.0,2016-04-27 17:04:30.0]|500.0     |
+---------------------------------------------+----------+
  • Event after 5 seconds
    Now we send the next record, which is after 5 seconds. This signifies to spark that, 5 seconds have passed in source. So spark will be updating the same window. This event is for time Wed, 27 Apr 2016 11:34:27 GMT
    1461756867001,"aapl",600.0

The output of the spark will be as below. You can observe from output, spark is updating same window.

-------------------------------------------
Batch: 1
-------------------------------------------
+---------------------------------------------+----------+
|window                                       |sum(value)|
+---------------------------------------------+----------+
|[2016-04-27 17:04:20.0,2016-04-27 17:04:30.0]|1100.0    |
+---------------------------------------------+----------+
  • Event after 11 seconds
    Now we send another event, which is after 6 seconds from this time. Now spark understands 11 seconds have been passed. This event is for Wed, 27 Apr 2016 11:34:32 GMT
    1461756872000,"aapl",400.0

Now spark completes the first window and add the above event to next window.

-------------------------------------------
Batch: 2
-------------------------------------------
+---------------------------------------------+----------+
|window                                       |sum(value)|
+---------------------------------------------+----------+
|[2016-04-27 17:04:30.0,2016-04-27 17:04:40.0]|400.0     |
+---------------------------------------------+----------+
  • Late Event
    Let’s say we get an event which got delayed. It’s an event is for Wed, 27 Apr 2016 11:34:27 which is 5 seconds before the last event.
    1461756867001,"aapl",200.0

If you observe the spark result now, there are no updated window. This signifies that late event is dropped.

-------------------------------------------
Batch: 3
-------------------------------------------
+---------------------------------------------+----------+
|window                                       |sum(value)|
+---------------------------------------------+----------+
+---------------------------------------------+----------+
posted @ 2017-10-25 16:06  wlu  阅读(1307)  评论(0编辑  收藏  举报