记录一些Spark算子


1.  mapWithState

val spark = SparkSession.builder()
.master("local[*]")
.appName("MapWithState")
.getOrCreate()

val ssc: StreamingContext = new StreamingContext(spark.sparkContext, Seconds(3))

ssc.checkpoint("/data/spark/logs")

val dStream = ssc.socketTextStream("linux01", 9999).flatMap(_.toLowerCase.split(",")).map((_, 1))

//如果不需要初始化的值，则去掉initialRDD以及function后面的initialState方法

//只打印更新的，不更新的不显示，下一次更新时，继续显示

val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))

dStream.mapWithState(StateSpec.function(mapFunction).initialState(initialRDD))

.transform(rdd => rdd.sortByKey()).print()

// 如果设置超期时间，设定范围内key没有更新，则key值过期，下次更新时，重新计算

// dStream.mapWithState(StateSpec.function(func).initialState(initialRDD).timeout(Seconds(30))).print()

ssc.start()

ssc.awaitTermination()

/**
  * word: String, one: Option[Int], state: State[Int]
  * 这个函数里面有三个参数
  * 第一个参数：word: String  key
  * 第二个参数：one: Option[Int] value
  * 第三个参数：state: State[Int] 状态（历史状态，也就是上次的结果）
  */
private val mapFunction = (word: String, one: Option[Int], state: State[Int]) => {
    val count = one.getOrElse(0) + state.getOption().getOrElse(0)
    state.update(count)
    (word, count)
}

    /**
      * dStream中使用timeout方法时，自定义函数中必须判断isTimeOut()，否则会报错
      */

//    private val func = (word: String, option: Option[Int], state: State[Int]) => {
//        if (state.isTimingOut()) {
//            println(word + "is time out.")
//        } else {
//            val sum = option.getOrElse(0) + state.getOption().getOrElse(0)
//            state.update(sum)
//            (word, sum)
//        }
//    }

2. updateStateByKey

updateStateByKey的问题
有的值不需要变化的时候，还会再打印出来。
每个批次的数据都会出现，如果向redis保存更新的时候，会把不需要变化的值也更新。
有七种重载函数：
1> 只传入更新函数
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of each key.
* Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S]
): DStream[(K, S)] = ssc.withScope {
updateStateByKey(updateFunc, defaultPartitioner())
}

updateStateByKey()方法有一个参数updateFunc，返回值DStream[(K, S)]；updateFunc有两个参数Seq[V], Option[S]，返回值Option[S]
Seq[v]是(key, List(v, a: v, b: v, c: v))，v是value的泛型，Option[S]是checkpoint保存的中间数据，S是(key, value)中value的泛型

dStream.updateStateByKey((newValue: Seq[Int], state:Option[S]) => {
val value = newValue.sum + state.getOrElse(0)
Some(value)
})

2> 传入更新函数和分区数
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of each key.
* Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @param numPartitions Number of partitions of each RDD in the new DStream.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
numPartitions: Int
): DStream[(K, S)] = ssc.withScope {
updateStateByKey(updateFunc, defaultPartitioner(numPartitions))
}

3> 传入更新函数和自定义分区数
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of the key.
* [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @param partitioner Partitioner for controlling the partitioning of each RDD in the new
* DStream.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
partitioner: Partitioner
): DStream[(K, S)] = ssc.withScope {
val cleanedUpdateF = sparkContext.clean(updateFunc)
val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
}
updateStateByKey(newUpdateFunc, partitioner, true)
}

4> 传入完整的状态更新函数
前面的函数传入的都是不完整的更新函数，只是针对一个key的，他们在执行的时候也会生成一个完整的状态更新函数。
Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)] 入参是一个迭代器，参数1是key，参数2是这个key在这个batch中更新的值的集合，参数3是当前状态，最终得到key-->newvalue
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of each key.
* [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.
* @param updateFunc State update function. Note, that this function may generate a different
* tuple with a different key than the input key. Therefore keys may be removed
* or added in this way. It is up to the developer to decide whether to
* remember the partitioner despite the key being changed.
* @param partitioner Partitioner for controlling the partitioning of each RDD in the new
* DStream
* @param rememberPartitioner Whether to remember the partitioner object in the generated RDDs.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
partitioner: Partitioner,
rememberPartitioner: Boolean): DStream[(K, S)] = ssc.withScope {
val cleanedFunc = ssc.sc.clean(updateFunc)
val newUpdateFunc = (_: Time, it: Iterator[(K, Seq[V], Option[S])]) => {
cleanedFunc(it)
}
new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, None)
}
举例：
def function1(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
val newCount = newValues.sum + runningCount.getOrElse(0)
Some(newCount)
}
val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
iterator.flatMap(t => function1(t._2, t._3).map(s => (t._1, s)))
}
入参((key, List(v, v, v)), (key2, (v2, v2, v2)), ...), (key, value)。t._2是本批次key对应value的list，t._3是checkpoint保存的中间状态值，
最终迭代后返回累加的和，map之后补上key，变成最终的(K, V)。个人感觉与第一个方法结果一样。

5> 加入初始状态
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of the key.
* org.apache.spark.Partitioner is used to control the partitioning of each RDD.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @param partitioner Partitioner for controlling the partitioning of each RDD in the new
* DStream.
* @param initialRDD initial state value of each key.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
partitioner: Partitioner,
initialRDD: RDD[(K, S)]
): DStream[(K, S)] = ssc.withScope {
val cleanedUpdateF = sparkContext.clean(updateFunc)
val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
}
updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)
}

6> 是否记得当前的分区
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of each key.
* org.apache.spark.Partitioner is used to control the partitioning of each RDD.
* @param updateFunc State update function. Note, that this function may generate a different
* tuple with a different key than the input key. Therefore keys may be removed
* or added in this way. It is up to the developer to decide whether to
* remember the partitioner despite the key being changed.
* @param partitioner Partitioner for controlling the partitioning of each RDD in the new
* DStream
* @param rememberPartitioner Whether to remember the partitioner object in the generated RDDs.
* @param initialRDD initial state value of each key.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
partitioner: Partitioner,
rememberPartitioner: Boolean,
initialRDD: RDD[(K, S)]): DStream[(K, S)] = ssc.withScope {
val cleanedFunc = ssc.sc.clean(updateFunc)
val newUpdateFunc = (_: Time, it: Iterator[(K, Seq[V], Option[S])]) => {
cleanedFunc(it)
}
new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, Some(initialRDD))
}

7> 函数不返回任何值，则键值对将被取消
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of the key.
* org.apache.spark.Partitioner is used to control the partitioning of each RDD.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @param partitioner Partitioner for controlling the partitioning of each RDD in the new
* DStream.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](updateFunc: (Time, K, Seq[V], Option[S]) => Option[S],
partitioner: Partitioner,
rememberPartitioner: Boolean,
initialRDD: Option[RDD[(K, S)]] = None): DStream[(K, S)] = ssc.withScope {
val cleanedFunc = ssc.sc.clean(updateFunc)
val newUpdateFunc = (time: Time, iterator: Iterator[(K, Seq[V], Option[S])]) => {
iterator.flatMap(t => cleanedFunc(time, t._1, t._2, t._3).map(s => (t._1, s)))
}
new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, initialRDD)
}

简单的样例代码

val wordCountDS = wordDS.updateStateByKey((values: Seq[Int], state: Option[Int]) => {
    val currentCount = values.sum //获取此次本单词出现的次数
    val count = state.getOrElse(0) //获取上一次的结果 也就是中间状态
    Some(currentCount + count)
})

//        val wordCountDS = wordDS.updateStateByKey(func)
wordCountDS.print()

/**
  * 可以把匿名函数单独拎出来
  * 第一个参数是，更新的key对应的value的List集合
  * 第二个参数是中间数据
  */
//    private val func = (value: Seq[Int], state: Option[Int]) => {
//        val sum = value.sum + state.getOrElse(0)
//       Some(sum)
//    }

3. 窗口函数

/只规定窗口时间
def window(windowDuration: Duration): DStream[T] = window(windowDuration, this.slideDuration)
//规定窗口时间及滑动时间
def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {
new WindowedDStream(this, windowDuration, slideDuration)
}

//reduce函数，窗口时间，滑动时间
def reduceByWindow(
reduceFunc: (T, T) => T,
windowDuration: Duration,
slideDuration: Duration
): DStream[T] = ssc.withScope {
this.reduce(reduceFunc).window(windowDuration, slideDuration).reduce(reduceFunc)
}

//同上，增量方式，但是效率比上面高
def reduceByWindow(
reduceFunc: (T, T) => T,
invReduceFunc: (T, T) => T,
windowDuration: Duration,
slideDuration: Duration
): DStream[T] = ssc.withScope {
this.map((1, _))
.reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, 1)
.map(_._2)
}

//内部调用高效率reduceByWindow方法，将相同的key聚合之后再统计
def countByWindow(
windowDuration: Duration,
slideDuration: Duration): DStream[Long] = ssc.withScope {
this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)
}
//同上，内部调用reduceByKeyAndWindow
def countByValueAndWindow(
windowDuration: Duration,
slideDuration: Duration,
numPartitions: Int = ssc.sc.defaultParallelism)
(implicit ord: Ordering[T] = null)
: DStream[(T, Long)] = ssc.withScope {
this.map((_, 1L)).reduceByKeyAndWindow(
(x: Long, y: Long) => x + y,
(x: Long, y: Long) => x - y,
windowDuration,
slideDuration,
numPartitions,
(x: (T, Long)) => x._2 != 0L
)
}

//窗口期分组
def groupByKeyAndWindow(windowDuration: Duration): DStream[(K, Iterable[V])] = ssc.withScope {
groupByKeyAndWindow(windowDuration, self.slideDuration, defaultPartitioner())
}
//窗口期，滑动时间内分组
def groupByKeyAndWindow(windowDuration: Duration, slideDuration: Duration)
: DStream[(K, Iterable[V])] = ssc.withScope {
groupByKeyAndWindow(windowDuration, slideDuration, defaultPartitioner())
}

//同上，同时传入分区数(默认分区)
def groupByKeyAndWindow(
windowDuration: Duration,
slideDuration: Duration,
numPartitions: Int
): DStream[(K, Iterable[V])] = ssc.withScope {
groupByKeyAndWindow(windowDuration, slideDuration, defaultPartitioner(numPartitions))
}

//同上，但是自定义分区
def groupByKeyAndWindow(
windowDuration: Duration,
slideDuration: Duration,
partitioner: Partitioner
): DStream[(K, Iterable[V])] = ssc.withScope {
val createCombiner = (v: Iterable[V]) => new ArrayBuffer[V] ++= v
val mergeValue = (buf: ArrayBuffer[V], v: Iterable[V]) => buf ++= v
val mergeCombiner = (buf1: ArrayBuffer[V], buf2: ArrayBuffer[V]) => buf1 ++= buf2
self.groupByKey(partitioner)
.window(windowDuration, slideDuration)
.combineByKey[ArrayBuffer[V]](createCombiner, mergeValue, mergeCombiner, partitioner)
.asInstanceOf[DStream[(K, Iterable[V])]]
}

//function函数及窗口长度
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(reduceFunc, windowDuration, self.slideDuration, defaultPartitioner())
}

//function函数、窗口长度及滑动窗口长度
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())
}

//同上，添加分区数
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
numPartitions: Int
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration,
defaultPartitioner(numPartitions))
}

//同上，自定义分区
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
partitioner: Partitioner
): DStream[(K, V)] = ssc.withScope {
self.reduceByKey(reduceFunc, partitioner)
.window(windowDuration, slideDuration)
.reduceByKey(reduceFunc, partitioner)
}

//增量计算，eg：第一个滑动窗1、2、3,第二个2、3、4，计算方法是，1、2、3 -1 + 4，即减去前面的，加上后面的，保留中间的，效率会高
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
invReduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration = self.slideDuration,
numPartitions: Int = ssc.sc.defaultParallelism,
filterFunc: ((K, V)) => Boolean = null
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(
reduceFunc, invReduceFunc, windowDuration,
slideDuration, defaultPartitioner(numPartitions), filterFunc
)
}

def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
invReduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
partitioner: Partitioner,
filterFunc: ((K, V)) => Boolean
): DStream[(K, V)] = ssc.withScope {

val cleanedReduceFunc = ssc.sc.clean(reduceFunc)
val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc)
val cleanedFilterFunc = if (filterFunc != null) Some(ssc.sc.clean(filterFunc)) else None
new ReducedWindowedDStream[K, V](
self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc,
windowDuration, slideDuration, partitioner
)
}

简单样例代码：

/**
  * 窗口函数，三个参数，第一个是需要操作的function，第二个参数窗口持续时间，第三个参数窗口滑动时间
  * 如果仅仅是window函数，只有后两个参数的或者一个参数
  * 具体函数看入参
  */
val value = dStream.flatMap(_.split(",")).map((_, 1))
//                value.window(Seconds(6), Seconds(4)).print()
value.reduceByKeyAndWindow((value1: Int, value2: Int) => value1 + value2, Seconds(6), Seconds(4)).print()

以上，学习笔记，忘记是看哪位大佬的技术文学习的了，在次致敬无私分享的大佬们

posted @ 2020-08-24 21:39 惨遭虐泉的小学生阅读(79) 评论(0) 收藏举报

刷新页面返回顶部

记录一些Spark算子

公告