记录一些Spark算子
1. mapWithState
val spark = SparkSession.builder()
.master("local[*]")
.appName("MapWithState")
.getOrCreate()
val ssc: StreamingContext = new StreamingContext(spark.sparkContext, Seconds(3))
ssc.checkpoint("/data/spark/logs")
val dStream = ssc.socketTextStream("linux01", 9999).flatMap(_.toLowerCase.split(",")).map((_, 1))
//如果不需要初始化的值,则去掉initialRDD以及function后面的initialState方法
//只打印更新的,不更新的不显示,下一次更新时,继续显示
val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))
dStream.mapWithState(StateSpec.function(mapFunction).initialState(initialRDD))
.transform(rdd => rdd.sortByKey()).print()
// 如果设置超期时间,设定范围内key没有更新,则key值过期,下次更新时,重新计算
// dStream.mapWithState(StateSpec.function(func).initialState(initialRDD).timeout(Seconds(30))).print()
ssc.start()
ssc.awaitTermination()
/**
* word: String, one: Option[Int], state: State[Int]
* 这个函数里面有三个参数
* 第一个参数:word: String key
* 第二个参数:one: Option[Int] value
* 第三个参数:state: State[Int] 状态(历史状态,也就是上次的结果)
*/
private val mapFunction = (word: String, one: Option[Int], state: State[Int]) => {
val count = one.getOrElse(0) + state.getOption().getOrElse(0)
state.update(count)
(word, count)
}
/**
* dStream中使用timeout方法时,自定义函数中必须判断isTimeOut(),否则会报错
*/
// private val func = (word: String, option: Option[Int], state: State[Int]) => {
// if (state.isTimingOut()) {
// println(word + "is time out.")
// } else {
// val sum = option.getOrElse(0) + state.getOption().getOrElse(0)
// state.update(sum)
// (word, sum)
// }
// }
2. updateStateByKey
updateStateByKey的问题
有的值不需要变化的时候,还会再打印出来。
每个批次的数据都会出现,如果向redis保存更新的时候,会把不需要变化的值也更新。
有七种重载函数:
1> 只传入更新函数
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of each key.
* Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S]
): DStream[(K, S)] = ssc.withScope {
updateStateByKey(updateFunc, defaultPartitioner())
}
updateStateByKey()方法有一个参数updateFunc,返回值DStream[(K, S)];updateFunc有两个参数Seq[V], Option[S],返回值Option[S]
Seq[v]是(key, List(v, a: v, b: v, c: v)),v是value的泛型,Option[S]是checkpoint保存的中间数据,S是(key, value)中value的泛型
dStream.updateStateByKey((newValue: Seq[Int], state:Option[S]) => {
val value = newValue.sum + state.getOrElse(0)
Some(value)
})
2> 传入更新函数和分区数
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of each key.
* Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @param numPartitions Number of partitions of each RDD in the new DStream.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
numPartitions: Int
): DStream[(K, S)] = ssc.withScope {
updateStateByKey(updateFunc, defaultPartitioner(numPartitions))
}
3> 传入更新函数和自定义分区数
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of the key.
* [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @param partitioner Partitioner for controlling the partitioning of each RDD in the new
* DStream.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
partitioner: Partitioner
): DStream[(K, S)] = ssc.withScope {
val cleanedUpdateF = sparkContext.clean(updateFunc)
val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
}
updateStateByKey(newUpdateFunc, partitioner, true)
}
4> 传入完整的状态更新函数
前面的函数传入的都是不完整的更新函数,只是针对一个key的,他们在执行的时候也会生成一个完整的状态更新函数。
Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)] 入参是一个迭代器,参数1是key,参数2是这个key在这个batch中更新的值的集合,参数3是当前状态,最终得到key-->newvalue
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of each key.
* [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.
* @param updateFunc State update function. Note, that this function may generate a different
* tuple with a different key than the input key. Therefore keys may be removed
* or added in this way. It is up to the developer to decide whether to
* remember the partitioner despite the key being changed.
* @param partitioner Partitioner for controlling the partitioning of each RDD in the new
* DStream
* @param rememberPartitioner Whether to remember the partitioner object in the generated RDDs.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
partitioner: Partitioner,
rememberPartitioner: Boolean): DStream[(K, S)] = ssc.withScope {
val cleanedFunc = ssc.sc.clean(updateFunc)
val newUpdateFunc = (_: Time, it: Iterator[(K, Seq[V], Option[S])]) => {
cleanedFunc(it)
}
new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, None)
}
举例:
def function1(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
val newCount = newValues.sum + runningCount.getOrElse(0)
Some(newCount)
}
val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
iterator.flatMap(t => function1(t._2, t._3).map(s => (t._1, s)))
}
入参((key, List(v, v, v)), (key2, (v2, v2, v2)), ...), (key, value)。t._2是本批次key对应value的list,t._3是checkpoint保存的中间状态值,
最终迭代后返回累加的和,map之后补上key,变成最终的(K, V)。个人感觉与第一个方法结果一样。
5> 加入初始状态
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of the key.
* org.apache.spark.Partitioner is used to control the partitioning of each RDD.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @param partitioner Partitioner for controlling the partitioning of each RDD in the new
* DStream.
* @param initialRDD initial state value of each key.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
partitioner: Partitioner,
initialRDD: RDD[(K, S)]
): DStream[(K, S)] = ssc.withScope {
val cleanedUpdateF = sparkContext.clean(updateFunc)
val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
}
updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)
}
6> 是否记得当前的分区
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of each key.
* org.apache.spark.Partitioner is used to control the partitioning of each RDD.
* @param updateFunc State update function. Note, that this function may generate a different
* tuple with a different key than the input key. Therefore keys may be removed
* or added in this way. It is up to the developer to decide whether to
* remember the partitioner despite the key being changed.
* @param partitioner Partitioner for controlling the partitioning of each RDD in the new
* DStream
* @param rememberPartitioner Whether to remember the partitioner object in the generated RDDs.
* @param initialRDD initial state value of each key.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
partitioner: Partitioner,
rememberPartitioner: Boolean,
initialRDD: RDD[(K, S)]): DStream[(K, S)] = ssc.withScope {
val cleanedFunc = ssc.sc.clean(updateFunc)
val newUpdateFunc = (_: Time, it: Iterator[(K, Seq[V], Option[S])]) => {
cleanedFunc(it)
}
new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, Some(initialRDD))
}
7> 函数不返回任何值,则键值对将被取消
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of the key.
* org.apache.spark.Partitioner is used to control the partitioning of each RDD.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @param partitioner Partitioner for controlling the partitioning of each RDD in the new
* DStream.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](updateFunc: (Time, K, Seq[V], Option[S]) => Option[S],
partitioner: Partitioner,
rememberPartitioner: Boolean,
initialRDD: Option[RDD[(K, S)]] = None): DStream[(K, S)] = ssc.withScope {
val cleanedFunc = ssc.sc.clean(updateFunc)
val newUpdateFunc = (time: Time, iterator: Iterator[(K, Seq[V], Option[S])]) => {
iterator.flatMap(t => cleanedFunc(time, t._1, t._2, t._3).map(s => (t._1, s)))
}
new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, initialRDD)
}
简单的样例代码
val wordCountDS = wordDS.updateStateByKey((values: Seq[Int], state: Option[Int]) => {
val currentCount = values.sum //获取此次本单词出现的次数
val count = state.getOrElse(0) //获取上一次的结果 也就是中间状态
Some(currentCount + count)
})
// val wordCountDS = wordDS.updateStateByKey(func)
wordCountDS.print()
/**
* 可以把匿名函数单独拎出来
* 第一个参数是,更新的key对应的value的List集合
* 第二个参数是中间数据
*/
// private val func = (value: Seq[Int], state: Option[Int]) => {
// val sum = value.sum + state.getOrElse(0)
// Some(sum)
// }
3. 窗口函数
/只规定窗口时间
def window(windowDuration: Duration): DStream[T] = window(windowDuration, this.slideDuration)
//规定窗口时间及滑动时间
def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {
new WindowedDStream(this, windowDuration, slideDuration)
}
//reduce函数,窗口时间,滑动时间
def reduceByWindow(
reduceFunc: (T, T) => T,
windowDuration: Duration,
slideDuration: Duration
): DStream[T] = ssc.withScope {
this.reduce(reduceFunc).window(windowDuration, slideDuration).reduce(reduceFunc)
}
//同上,增量方式,但是效率比上面高
def reduceByWindow(
reduceFunc: (T, T) => T,
invReduceFunc: (T, T) => T,
windowDuration: Duration,
slideDuration: Duration
): DStream[T] = ssc.withScope {
this.map((1, _))
.reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, 1)
.map(_._2)
}
//内部调用高效率reduceByWindow方法,将相同的key聚合之后再统计
def countByWindow(
windowDuration: Duration,
slideDuration: Duration): DStream[Long] = ssc.withScope {
this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)
}
//同上,内部调用reduceByKeyAndWindow
def countByValueAndWindow(
windowDuration: Duration,
slideDuration: Duration,
numPartitions: Int = ssc.sc.defaultParallelism)
(implicit ord: Ordering[T] = null)
: DStream[(T, Long)] = ssc.withScope {
this.map((_, 1L)).reduceByKeyAndWindow(
(x: Long, y: Long) => x + y,
(x: Long, y: Long) => x - y,
windowDuration,
slideDuration,
numPartitions,
(x: (T, Long)) => x._2 != 0L
)
}
//窗口期分组
def groupByKeyAndWindow(windowDuration: Duration): DStream[(K, Iterable[V])] = ssc.withScope {
groupByKeyAndWindow(windowDuration, self.slideDuration, defaultPartitioner())
}
//窗口期,滑动时间内分组
def groupByKeyAndWindow(windowDuration: Duration, slideDuration: Duration)
: DStream[(K, Iterable[V])] = ssc.withScope {
groupByKeyAndWindow(windowDuration, slideDuration, defaultPartitioner())
}
//同上,同时传入分区数(默认分区)
def groupByKeyAndWindow(
windowDuration: Duration,
slideDuration: Duration,
numPartitions: Int
): DStream[(K, Iterable[V])] = ssc.withScope {
groupByKeyAndWindow(windowDuration, slideDuration, defaultPartitioner(numPartitions))
}
//同上,但是自定义分区
def groupByKeyAndWindow(
windowDuration: Duration,
slideDuration: Duration,
partitioner: Partitioner
): DStream[(K, Iterable[V])] = ssc.withScope {
val createCombiner = (v: Iterable[V]) => new ArrayBuffer[V] ++= v
val mergeValue = (buf: ArrayBuffer[V], v: Iterable[V]) => buf ++= v
val mergeCombiner = (buf1: ArrayBuffer[V], buf2: ArrayBuffer[V]) => buf1 ++= buf2
self.groupByKey(partitioner)
.window(windowDuration, slideDuration)
.combineByKey[ArrayBuffer[V]](createCombiner, mergeValue, mergeCombiner, partitioner)
.asInstanceOf[DStream[(K, Iterable[V])]]
}
//function函数及窗口长度
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(reduceFunc, windowDuration, self.slideDuration, defaultPartitioner())
}
//function函数、窗口长度及滑动窗口长度
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())
}
//同上,添加分区数
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
numPartitions: Int
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration,
defaultPartitioner(numPartitions))
}
//同上,自定义分区
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
partitioner: Partitioner
): DStream[(K, V)] = ssc.withScope {
self.reduceByKey(reduceFunc, partitioner)
.window(windowDuration, slideDuration)
.reduceByKey(reduceFunc, partitioner)
}
//增量计算,eg:第一个滑动窗1、2、3,第二个2、3、4,计算方法是,1、2、3 -1 + 4,即减去前面的,加上后面的,保留中间的,效率会高
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
invReduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration = self.slideDuration,
numPartitions: Int = ssc.sc.defaultParallelism,
filterFunc: ((K, V)) => Boolean = null
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(
reduceFunc, invReduceFunc, windowDuration,
slideDuration, defaultPartitioner(numPartitions), filterFunc
)
}
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
invReduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
partitioner: Partitioner,
filterFunc: ((K, V)) => Boolean
): DStream[(K, V)] = ssc.withScope {
val cleanedReduceFunc = ssc.sc.clean(reduceFunc)
val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc)
val cleanedFilterFunc = if (filterFunc != null) Some(ssc.sc.clean(filterFunc)) else None
new ReducedWindowedDStream[K, V](
self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc,
windowDuration, slideDuration, partitioner
)
}
简单样例代码:
/**
* 窗口函数,三个参数,第一个是需要操作的function,第二个参数窗口持续时间,第三个参数窗口滑动时间
* 如果仅仅是window函数,只有后两个参数的或者一个参数
* 具体函数看入参
*/
val value = dStream.flatMap(_.split(",")).map((_, 1))
// value.window(Seconds(6), Seconds(4)).print()
value.reduceByKeyAndWindow((value1: Int, value2: Int) => value1 + value2, Seconds(6), Seconds(4)).print()
以上,学习笔记,忘记是看哪位大佬的技术文学习的了,在次致敬无私分享的大佬们
浙公网安备 33010602011771号