Spark Streaming
----------------
流计算,不间断。
Spark Streaming模块,
实现方式是批量计算,按照时间片对stream切割形成静态数据。
//创建上下文时,指定时间片。
val ssc = new StreamingContext(conf, Seconds(1))
//已经限定了时间片
ssc.socketTextStream();
...
socket文本流运行在executor端,不在driver端。
SockeTextStream执行过程
-------------------------------
driver端创建StreamingContext对象,启动上下文时,依次创建
JobScheduler和ReceiverTracker,并调用他们的start方法。
ReceiverTracker在start方法中发送启动接收器消息给远程Executor,
消息内部含有ServerSocket的地址信息,在executor一侧,由Receiver
TrackerEndpoint终端接受消息,抽取消息内容,利用sparkContext结合
消息内容创建ReceiverRDD对象,最后提交rdd给spark集群.
流计算的窗口化处理
------------------------
在批次的基础上扩展应用,
窗口长度和滑动间隔(计算频率)这个指标都需要是batch的整倍数。
reduceByKeyAndWindow((a:Int,b:Int)=> {a + b}, Seconds(5) , Seconds(3))
windows().reduceByKey(...);
DStream的分区
------------------------
DStream的分区是对内部每个RDD的分区。
dstream.repartition(num){
//
this.transform(_.repartition(numPartitions))
}
updateStateByKey()
-----------------------
计算自流应用启动以来,每个单词的数量。
更新可以结合windows出计算。
val ds3 = ds2.window(Seconds(5),Seconds(3))
def update(a:Seq[Int] , state:Option[ArrayBuffer[(Long, Int)]]) : Option[ArrayBuffer[(Long, Int)]] = {
//println("a + " + a)
val count = a.sum
val time = System.currentTimeMillis()
if(state.isEmpty){
val buf:ArrayBuffer[(Long,Int)] = ArrayBuffer[(Long, Int)]()
buf.append((time ,count))
Some(buf)
}
else{
val buf2: ArrayBuffer[(Long, Int)] = ArrayBuffer[(Long, Int)]()
var buf = state.get
for(t <- buf){
if((time - t._1) <= 4000){
buf2.+=(t)
}
}
buf2.append((time, count))
Some(buf2)
}
}
val ds3 = ds2.window(Seconds(5),Seconds(3))
val ds4 = ds3.updateStateByKey(update _)
SparkStreaming计算中分区的计算方式
-----------------------------------
DStream分区是RDD的分区,分区由conf.set("spark.streaming.blockInterval" ,"200ms"),
就是哪找指定时间片切割数据成小块,对应一个分区。
DStream.foreachRDD
----------------------
针对流中的每个RDD进行操作。
import java.sql.DriverManager
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by Administrator on 2018/3/8.
*/
object SparkStreamingForeachRDDScala {
def createNewConnection() = {
Class.forName("com.mysql.jdbc.Driver")
val conn = DriverManager.getConnection("jdbc:mysql://192.168.231.1:3306/big9","root","root")
conn
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("worldCount")
conf.setMaster("local[4]")
//时间片是2秒
val ssc = new StreamingContext(conf ,Seconds(2))
ssc.checkpoint("file:///d:/java/chk")
//创建套接字文本流
val ds1 = ssc.socketTextStream("s101", 8888)
val ds2 = ds1.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
ds2.foreachRDD(rdd=>{
rdd.foreachPartition(it=>{
val conn = createNewConnection()
// executed at the driver
val ppst = conn.prepareStatement("insert into wc(word,cnt) values(?,?)")
conn.setAutoCommit(false)
for(e <- it){
ppst.setString(1 , e._1)
ppst.setInt(2,e._2)
ppst.executeUpdate()
}
conn.commit()
conn.close()
ppst.close()
})
})
//启动流
ssc.start()
ssc.awaitTermination()
}
}
Spark Stream + Spark SQL组合使用
--------------------------------
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by Administrator on 2018/3/8.
*/
object SparkStreamingWordCountSparkSQLScala {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("worldCount")
conf.setMaster("local[2]")
//时间片是2秒
val ssc = new StreamingContext(conf ,Seconds(2))
ssc.checkpoint("file:///d:/java/chk")
//创建套接字文本流
val lines = ssc.socketTextStream("s101", 8888)
//压扁生成单词流
val words = lines.flatMap(_.split(" "))
words.foreachRDD(rdd=>{
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val df1= rdd.toDF("word")
df1.createOrReplaceTempView("_temp")
spark.sql("select word,count(*) from _temp group by word").show()
})
//启动流
ssc.start()
ssc.awaitTermination()
}
}
Kafka
--------------------
消息系统.
Scala
针对分区,n - 1
Spark Streaming集成kafka
-------------------------
1.注意
spark-streaming-kafka-0-10_2.11不兼容之前的版本,
spark-streaming-kafka-0-8_2.11兼容0.9和0.10.
2.启动kafka集群并创建主题.
xkafka.sh start
3.验证kafka是否ok
3.1)启动消费者
kafka-console-consumer.sh --zookeeper s102:2181 --topic t1
3.2)启动生产者
kafka-console-producer.sh --broker-list s102:9092 --topic t1
3.3)发送消息
...
4.引入maven依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.1.0</version>
</dependency>
5.编程
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
/**
* Created by Administrator on 2018/3/8.
*/
object SparkStreamingKafkaScala {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("kafka")
conf.setMaster("local[*]")
val ssc = new StreamingContext(conf , Seconds(2))
//kafka参数
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "s102:9092,s103:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "g1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("topic1") //val topics = Array("topicl")
val stream = KafkaUtils.createDirectStream[String, String]( //val stream = KafkaUtils.createDirectStream[String,String]
ssc, //ssc,PreferConsistent,Subscribe[String,String](topics,kafkaParams)
PreferConsistent, //位置策略
Subscribe[String, String](topics, kafkaParams) //消费者策略
)
val ds2 = stream.map(record => (record.key, record.value)) //val ds2 = stream.map(record=>(record.key,record.value))
ds2.print() //ds2.print() ssc.start() ssc.awaitTermination()
ssc.start()
ssc.awaitTermination()
}
}
6.在控制台生产者发送消息
SparkKafka直接流(createDirectStream)和kafka分区
--------------------------------
每个kafka主题分区对应一个RDD分区。
spark可以通过spark.streaming.kafka.maxRatePerPartition
配置,对每个分区每秒接受的消息树进行控制。
LocationStrategies
----------------
位置策略,
控制特定的主题分区在哪个执行器上消费的。
在executor针对主题分区如何对消费者进行调度。
位置的选择是相对的,位置策略有三种方案:
1.PreferBrokers
首选kafka服务器,只有在kafka服务器和executor位于同一主机,可以使用该中策略。
2.PreferConsistent
首选一致性.
多数时候采用该方式,在所有可用的执行器上均匀分配kakfa的主题的所有分区。
综合利用集群的计算资源。
3.PreferFixed
首选固定模式。
如果负载不均衡,可以使用该中策略放置在特定节点使用指定的主题分区。手动控制方案。
没有显式指定的分区仍然采用(2)方案。
ConsumerStrategy
-------------------
ConsumerStrategies
--------------------
消费者策略,是控制如何创建和配制消费者对象。
或者对kafka上的消息进行如何消费界定,比如t1主题的分区0和1,
或者消费特定分区上的特定消息段。
该类可扩展,自行实现。
1.ConsumerStrategies.Assign
指定固定的分区集合,指定了特别详细的方范围。
def Assign[K, V](
topicPartitions: Iterable[TopicPartition],
kafkaParams: collection.Map[String, Object],
offsets: collection.Map[TopicPartition, Long])
2.ConsumerStrategies.Subscribe
允许消费订阅固定的主题集合。
3.ConsumerStrategies.SubscribePattern
使用正则表达式指定感兴趣的主题集合。
消费者策略和语义模型
-----------------------------
import java.net.Socket
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable.ArrayBuffer
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.TopicPartition
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
/**
* Created by Administrator on 2018/3/8.
*/
object SparkStreamingKafkaScala {
def sendInfo(msg: String, objStr: String) = {
//获取ip
val ip = java.net.InetAddress.getLocalHost.getHostAddress
//得到pid
val rr = java.lang.management.ManagementFactory.getRuntimeMXBean();
val pid = rr.getName().split("@")(0);
//pid
//线程
val tname = Thread.currentThread().getName
//对象id
val sock = new java.net.Socket("s101", 8888)
val out = sock.getOutputStream
val m = ip + "\t:" + pid + "\t:" + tname + "\t:" + msg + "\t:" + objStr + "\r\n"
out.write(m.getBytes)
out.flush()
out.close()
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("kafka")
// conf.setMaster("spark://s101:7077")
conf.setMaster("local[8]")
val ssc = new StreamingContext(conf, Seconds(5))
//kafka参数
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "s102:9092,s103:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "g1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val map = scala.collection.mutable.Map[TopicPartition,String]()
map.put(new TopicPartition("t1" , 0) , "s102")
map.put(new TopicPartition("t1" , 1) , "s102")
map.put(new TopicPartition("t1" , 2) , "s102")
map.put(new TopicPartition("t1" , 3) , "s102")
val locStra = LocationStrategies.PreferFixed(map) ;
val consit = LocationStrategies.PreferConsistent
val topics = Array("t1")
//主题分区集合
val tps = scala.collection.mutable.ArrayBuffer[TopicPartition]()
tps.+=(new TopicPartition("t1" , 0))
// tps.+=(new TopicPartition("t2" , 1))
// tps.+=(new TopicPartition("t3" , 2))
//偏移量集合
val offsets = scala.collection.mutable.Map[TopicPartition,Long]()
offsets.put(new TopicPartition("t1", 0), 3)
// offsets.put(new TopicPartition("t2", 1), 3)
// offsets.put(new TopicPartition("t3", 2), 0)
val conss = ConsumerStrategies.Assign[String,String](tps , kafkaParams , offsets)
//创建kakfa直向流
val stream = KafkaUtils.createDirectStream[String,String](
ssc,
locStra,
ConsumerStrategies.Assign[String, String](tps, kafkaParams, offsets)
)
val ds2 = stream.map(record => {
val t = Thread.currentThread().getName
val key = record.key()
val value = record.value()
val offset = record.offset()
val par = record.partition()
val topic = record.topic()
val tt = ("k:"+key , "v:" + value , "o:" + offset, "p:" + par,"t:" + topic ,"T : " + t)
//xxxx(tt) ;
//sendInfo(tt.toString() ,this.toString)
tt
})
ds2.print()
ssc.start()
ssc.awaitTermination()
}
}
kafka消费语义
//tt.offset
-------------------
1.at most once
最多消费一次
commit(offset) //wrong
xxx(tt) //ok
2.at least once
最少一次
xxx(tt) //ok
commit(offset) //wrong
3.extact once
精准消费一次。
mysql
Assign: --->