Spark Streaming DStream 创建方式
1、通过RDD队列创建DStream
测试过程中,可以通过使用ssc.queueStream(queueOfRDDs)来创建DStream,每一个推送到这个队列中的RDD,都会作为一个DStream处理。
创建方式
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStreaing")
//StreamingContext 两个参数 sparkConf 配置文件 Seconds(3) 微批采集周期
val ssc = new StreamingContext(sparkConf, Seconds(3))
//声明队列
val rddQueue = new mutable.Queue[RDD[Int]]()
//ssc.queueStream(rddQueue, oneAtATime = false) oneAtATime 一个采集周期只出现一次,默认 true
val inputSream: InputDStream[Int] = ssc.queueStream(rddQueue, oneAtATime = false)
val mapStream: DStream[(Int, Int)] = inputSream.map((_, 1))
val reduceStream: DStream[(Int, Int)] = mapStream.reduceByKey(_ + _)
reduceStream.print()
// 启动采集器
ssc.start()
for (i <- 1 to 5) {
// 放数据到 队列
rddQueue += ssc.sparkContext.makeRDD(seq = 1 to 5, numSlices = 10)
Thread.sleep(2000)
}
//等待采集器关闭
ssc.awaitTermination()
}
执行效果
-------------------------------------------
Time: 1650099129000 ms
-------------------------------------------
(4,2)
(1,2)
(5,2)
(2,2)
(3,2)
-------------------------------------------
Time: 1650099132000 ms
-------------------------------------------
(4,1)
(1,1)
(5,1)
(2,1)
(3,1)
-------------------------------------------
Time: 1650099135000 ms
-------------------------------------------
(4,2)
(1,2)
(5,2)
(2,2)
(3,2)
2、自定义数据源
需要继承Receiver,并实现onStart、onStop方法来自定义数据源采集。
实现方式
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStreaing")
//StreamingContext 两个参数 sparkConf 配置文件 Seconds(3) 微批采集周期
val ssc = new StreamingContext(sparkConf, Seconds(3))
val line: ReceiverInputDStream[String] = ssc.receiverStream(new myReceiver("hadoop103", 9999))
line
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
.print()
// 启动采集器
ssc.start()
//等待采集器关闭
ssc.awaitTermination()
}
/**
* 自定义数据采集器
* 1、继承 Receiver,定义泛型,传参数
*
*/
private class myReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_ONLY) {
private var socket: Socket = _
override def onStart(): Unit = {
new Thread("socket Receiver") {
setDaemon(true)
override def run() {
receiver()
}
}.start()
}
def receiver(): Unit = {
try {
//读取端口数据
socket = new Socket(host, port)
val bf: BufferedReader = new BufferedReader(new InputStreamReader(socket.getInputStream, StandardCharsets.UTF_8))
//定义变量存储读取的数据
var line: String = null
while ((line = bf.readLine()) != null) {
//缓存到内存 store() 是 Receiver 提供的方法
store(line)
}
} catch {
case e: ConnectException =>
restart(s"Error connecting to $host:$port...", e)
return
}
}
override def onStop(): Unit = {
synchronized {
if (socket != null) {
socket.close()
socket = null
}
}
}
}
3、Kafka 数据源
ReceiverAPI:需要一个专门的Executor去接收数据,然后发送给其他的Executor做计算。存在的问题,接收数据的Executor和计算的Executor速度会有所不同,特别在接收数据的Executor速度大于计算的Executor速度,会导致计算数据的节点内存溢出。DirectAPI:是由计算的Executor来主动消费Kafka的数据,速度由自身控制
实现方式
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStreaing")
//StreamingContext 两个参数 sparkConf 配置文件 Seconds(3) 微批采集周期
val ssc: StreamingContext = new StreamingContext(sparkConf, Seconds(3))
val kafkaPara: Map[String, Object] = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "hadoop103:9092,hadoop104:9092,hadoop105:9092", //kafka所在集群主机端口信息
ConsumerConfig.GROUP_ID_CONFIG -> "hui", "key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer", "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer")
//从 kafka 读取数据
val kfkDataDS: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
ssc, //ssc 上下文环境对象
LocationStrategies.PreferConsistent, //LocationStrategies 位置策略 PreferConsistent 采集节点和计算节点自己控制
ConsumerStrategies.Subscribe[String, String](Set("tbg"), //ConsumerStrategies 消费策略 tbg kafka topic
kafkaPara //kafka主题 kafkaPara kafka配置
))
kfkDataDS
.flatMap(_.value().split(" "))
.map((_, 1))
.reduceByKey(_ + _)
.print()
/**
bin/kafka-topics.sh --bootstrap-server hadoop103:9092 --list
bin/kafka-console-producer.sh --bootstrap-server hadoop103:9092 --topic tbg
**/
// 启动采集器
ssc.start()
//等待采集器关闭
ssc.awaitTermination()
}

浙公网安备 33010602011771号