spark streaming kafka
SparkStreaming+Kafka •kafka是什么,有哪些特点 •SparkStreaming+Kafka有什么好处 –解耦 –缓冲


消息列队的特点 生产者消费者模式 •可靠性保证 –自己不丢数据 –消费者不丢数据:“至少一次,严格一次”

broker n. 经纪人,掮客 vt. 以中间人等身分安排... vi. 作为权力经纪人进行谈判






kafka部署 node2,3,4 基于zookeeper 启动 三台 zookeeper /opt/sxt/zookeeper-3.4.6/bin/zkServer.sh start 配置kafka tar -zxvf kafka_2.10-0.8.2.2.tgz -C /opt/sxt/ kafka_2.10-0.8.2.2/config/ vi server.properties broker.id=0 ## node2为0 node3为1 node4为2 log.dirs=/kafka-logs zookeeper.connect=node2:2181,node3:2181,node4:2181 ## scp 到node3,node4,并且修改 broker.id=0 node2 kafka bin下 尝试启动一台 ./kafka-server-start.sh ../config/server.properties ##编写脚本 使用后台启动 [root@node2 shells]# pwd /root/shells [root@node2 shells]# cat start-kafka.sh cd /opt/sxt/kafka_2.10-0.8.2.2 nohup bin/kafka-server-start.sh config/server.properties >kafka.log 2>&1 & ## scp 到node3,4 ## 批量执行 node2,3,4 /root/shells/start-kafka.sh ##启动三台kafka集群成功。

如何创建topic 并且生产者生产数据,消费者消费数据 node,2,3,4 是broker node1 作为producer,node5 作为consumer. scp kafka 到node1,node5,作为客户端准备
创建topic:
./kafka-topics.sh -zookeeper node2:2181,node3,node4 --create --topic t0425 --partitions 3 --replication-factor 3
[root@node1 bin]# ./kafka-console-producer.sh --topic t0425 --broker-list node2:9092,node3:9092,node4:9092
[2019-09-28 10:35:33,341] WARN Property topic is not valid (kafka.utils.VerifiableProperties)
hello ## 生产数据
world
hello
world
a
b
c
d
e
[root@node5 bin]# ./kafka-console-consumer.sh --zookeeper node2,node3:2181,node4 --topic t0425
world ## 消费数据
hello
world
a
b
c
d ## ....
## 查看topic
[root@node5 bin]# ./kafka-topics.sh --zookeeper node2:2181,node4,node5 --list
t0425
[root@node2 bin]# cd /opt/sxt/zookeeper-3.4.6/bin/zkCli.sh ##在node2上查看topic和维护的数据
[zk: localhost:2181(CONNECTED) 1] ls /
[zk: localhost:2181(CONNECTED) 8] ls /brokers/topics/t0425/partitions/0/state
[zk: localhost:2181(CONNECTED) 9] get /brokers/topics/t0425/partitions/0/state
{"controller_epoch":7,"leader":1,"version":1,"leader_epoch":0,"isr":[1,2,0]}
## isr":[1,2,0]用于检查数据的完整性
[zk: localhost:2181(CONNECTED) 16] get /consumers/console-consumer-66598/offsets/t0425/0
11 ## 表示分区管理的数据条数
[zk: localhost:2181(CONNECTED) 17] get /consumers/console-consumer-66598/offsets/t0425/1
5
[zk: localhost:2181(CONNECTED) 18] get /consumers/console-consumer-66598/offsets/t0425/2
0
## 过10分钟左右在生产数据,其他分区如1分区的数据会增长,相当于切换hash一次分区
## 如何删除topic
[root@node5 bin]# ./kafka-topics.sh --zookeeper node2:2181,node4,node5 --delete --topic t0425
[root@node5 bin]# ./kafka-topics.sh --zookeeper node2:2181,node4,node5 --list
t0425 - marked for deletion
[root@node2 bin]# cd /kafka-logs/
[root@node2 kafka-logs]# rm -rf ./t0425*
[root@node3 ~]# cd /kafka-logs/
[root@node3 kafka-logs]# rm -rf ./t0425*
[root@node4 ~]# cd /kafka-logs/
[root@node4 kafka-logs]# rm -rf ./t0425*
[root@node2 bin]# ./zkCli.sh
[zk: localhost:2181(CONNECTED) 3] rmr /brokers/topics/t0425
[zk: localhost:2181(CONNECTED) 5] rmr /admin/delete_topics/t0425
## 接下来过一周之后topic会自动删除。但是目前也还可以继续使用。
## 查看描述情况
[root@node5 bin]# ./kafka-topics.sh --zookeeper node2:2181,node4,node5 --describe
Topic:t0425 PartitionCount:3 ReplicationFactor:3 Configs:
Topic: t0425 Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
Topic: t0425 Partition: 1 Leader: 2 Replicas: 2,0,1 Isr: 2,0,1
Topic: t0425 Partition: 2 Leader: 0 Replicas: 0,1,2 Isr: 0,1,2
[root@node5 bin]# ./kafka-topics.sh --zookeeper node2:2181,node4,node5 --describe --topic t0425 ## 查看指定topic的订阅情况
Topic:t0425 PartitionCount:3 ReplicationFactor:3 Configs:
Topic: t0425 Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
Topic: t0425 Partition: 1 Leader: 2 Replicas: 2,0,1 Isr: 2,0,1
Topic: t0425 Partition: 2 Leader: 0 Replicas: 0,1,2 Isr: 0,1,2
## Replicas: 1,2,0 备份点,Isr: 1,2,0数据完整性检查点
## kafka leader 均衡的机制。 如上: 每一个分区属于一个leader,当leader0挂掉之后,kafka会自动寻找leader,继续完成接收producer,处理comsumer请求。当leader0恢复正常后,kafka会将partition0自动释放给leader0
[root@node1 bin]# ./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list node2:9092 --topic t0426 --time -1 ## 查看各个分区的消息数量 t0426:0:0 t0426:1:0 t0426:2:0
SparkStreaming + kafka 整合有两种模式 Receiver Direct 模式。 官网: http://kafka.apache.org/documentation/#producerapi http://kafka.apache.org/082/documentation.html 0.8 版本

package com.bjsxt.sparkstreaming;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.storage.StorageLevel;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaPairReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import scala.Tuple2;
/**
* receiver 模式并行度是由blockInterval决定的
* @author root
*
*/
public class SparkStreamingOnKafkaReceiver {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("SparkStreamingOnKafkaReceiver")
.setMaster("local[2]");
//开启预写日志 WAL机制
conf.set("spark.streaming.receiver.writeAheadLog.enable","true");
JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(5));
jsc.checkpoint("./receivedata");
Map<String, Integer> topicConsumerConcurrency = new HashMap<String, Integer>();
/**
* 设置读取的topic和接受数据的线程数
*/
topicConsumerConcurrency.put("t0426", 1);
/**
* 第一个参数是StreamingContext
* 第二个参数是ZooKeeper集群信息(接受Kafka数据的时候会从Zookeeper中获得Offset等元数据信息)
* 第三个参数是Consumer Group 消费者组
* 第四个参数是消费的Topic以及并发读取Topic中Partition的线程数
*
* 注意:
* KafkaUtils.createStream 使用五个参数的方法,设置receiver的存储级别
*/
// JavaPairReceiverInputDStream<String,String> lines = KafkaUtils.createStream(
// jsc,
// "node2:2181,node3:2181,node3:2181",
// "MyFirstConsumerGroup",
// topicConsumerConcurrency);
JavaPairReceiverInputDStream<String,String> lines = KafkaUtils.createStream(
jsc,
"node2:2181,node3:2181,node3:2181",
"MyFirstConsumerGroup",
topicConsumerConcurrency/*,
StorageLevel.MEMORY_AND_DISK()*/);
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<Tuple2<String,String>, String>() {
/**
*
*/
private static final long serialVersionUID = 1L;
public Iterable<String> call(Tuple2<String,String> tuple) throws Exception {
System.out.println("key = " + tuple._1);
System.out.println("value = " + tuple._2);
return Arrays.asList(tuple._2.split("\t"));
}
});
JavaPairDStream<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
/**
*
*/
private static final long serialVersionUID = 1L;
public Tuple2<String, Integer> call(String word) throws Exception {
return new Tuple2<String, Integer>(word, 1);
}
});
JavaPairDStream<String, Integer> wordsCount = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
//对相同的Key,进行Value的累计(包括Local和Reducer级别同时Reduce)
/**
*
*/
private static final long serialVersionUID = 1L;
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
});
wordsCount.print(100);
jsc.start();
jsc.awaitTermination();
jsc.close();
}
}
package com.bjsxt.sparkstreaming;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Properties;
import java.util.Random;
import kafka.javaapi.producer.Producer;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
import kafka.serializer.StringEncoder;
/**
* 向kafka中生产数据
* @author root
*
*/
public class SparkStreamingDataManuallyProducerForKafka extends Thread{
static String[] channelNames = new String[]{
"Spark","Scala","Kafka","Flink","Hadoop","Storm",
"Hive","Impala","HBase","ML"
};
static String[] actionNames = new String[]{"View", "Register"};
private String topic; //发送给Kafka的数据,topic
private Producer<String, String> producerForKafka;
private static String dateToday;
private static Random random;
public SparkStreamingDataManuallyProducerForKafka(String topic){
dateToday = new SimpleDateFormat("yyyy-MM-dd").format(new Date());
this.topic = topic;
random = new Random();
Properties properties = new Properties();
properties.put("metadata.broker.list","node2:9092,node3:9092,node4:9092");
//发送消息key的编码格式
properties.put("key.serializer.class", StringEncoder.class.getName());
//发送消息value的编码格式
properties.put("serializer.class", StringEncoder.class.getName());
producerForKafka = new Producer<String, String>(new ProducerConfig(properties)) ;
}
@Override
public void run() {
int counter = 0;
while(true){
counter++;
String userLog = userlogs();
// System.out.println("product:"+userLog+" ");
producerForKafka.send(new KeyedMessage<String, String>(topic,userLog));
producerForKafka.send(new KeyedMessage<String, String>(topic,"key-" + counter,userLog));
//每两条数据暂停2秒
if(0 == counter%2){
// counter = 0;
try {
Thread.sleep(2000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
public static void main( String[] args ){
new SparkStreamingDataManuallyProducerForKafka("t0426").start();
}
//生成随机数据
private static String userlogs() {
StringBuffer userLogBuffer = new StringBuffer("");
int[] unregisteredUsers = new int[]{1, 2, 3, 4, 5, 6, 7, 8};
long timestamp = new Date().getTime();
Long userID = 0L;
long pageID = 0L;
//随机生成的用户ID
if(unregisteredUsers[random.nextInt(8)] == 1) {
userID = null;
} else {
userID = (long) random.nextInt(2000);
}
//随机生成的页面ID
pageID = random.nextInt(2000);
//随机生成Channel
String channel = channelNames[random.nextInt(10)];
//随机生成action行为
String action = actionNames[random.nextInt(2)];
userLogBuffer.append(dateToday)
.append("\t")
.append(timestamp)
.append("\t")
.append(userID)
.append("\t")
.append(pageID)
.append("\t")
.append(channel)
.append("\t")
.append(action);
// .append("\n");
System.out.println(userLogBuffer.toString());
return userLogBuffer.toString();
}
}
supervise
v. 监督;管理;指导;主管;照看
Direct 模式
package com.bjsxt.sparkstreaming.util;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.UUID;
/**
* 此复制文件的程序是模拟在data目录下动态生成相同格式的txt文件,用于给sparkstreaming 中 textFileStream提供输入流。
* @author root
*
*/
public class CopyFile_data {
public static void main(String[] args) throws IOException, InterruptedException {
while(true){
Thread.sleep(5000);
String uuid = UUID.randomUUID().toString();
System.out.println(uuid);
copyFile(new File("words.txt"),new File(".\\data\\"+uuid+"----words.txt"));
}
}
public static void copyFile(File fromFile, File toFile) throws IOException {
FileInputStream ins = new FileInputStream(fromFile);
FileOutputStream out = new FileOutputStream(toFile);
byte[] b = new byte[1024*1024];
@SuppressWarnings("unused")
int n = 0;
while ((n = ins.read(b)) != -1) {
out.write(b, 0, b.length);
}
ins.close();
out.close();
}
}
// 监控checkpoint 目录,运行如下代码一次,停止,再运行,读取checkpoint目录恢复数据,不在打印new context
package com.bjsxt.sparkstreaming;
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
import scala.Tuple2;
/**
*
* Spark standalone or Mesos with cluster deploy mode only:
* 在提交application的时候 添加 --supervise 选项 如果Driver挂掉 会自动启动一个Driver
*
*/
public class SparkStreamingOnHDFS {
public static void main(String[] args) {
final SparkConf conf = new SparkConf().setMaster("local").setAppName("SparkStreamingOnHDFS");
// final String checkpointDirectory = "hdfs://node1:9000/spark/SparkStreaming/CheckPoint2017";
final String checkpointDirectory = "./checkpoint";
JavaStreamingContextFactory factory = new JavaStreamingContextFactory() {
@Override
public JavaStreamingContext create() {
return createContext(checkpointDirectory,conf);
}
};
/**
* 获取JavaStreamingContext 先去指定的checkpoint目录中去恢复JavaStreamingContext
* 如果恢复不到,通过factory创建
*/
JavaStreamingContext jsc = JavaStreamingContext.getOrCreate(checkpointDirectory, factory);
jsc.start();
jsc.awaitTermination();
jsc.close();
}
// @SuppressWarnings("deprecation")
private static JavaStreamingContext createContext(String checkpointDirectory,SparkConf conf) {
// If you do not see this printed, that means the StreamingContext has
// been loaded
// from the new checkpoint
System.out.println("Creating new context");
SparkConf sparkConf = conf;
// Create the context with a 1 second batch size
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(5));
// ssc.sparkContext().setLogLevel("WARN");
/**
* checkpoint 保存:
* 1.配置信息
* 2.DStream操作逻辑
* 3.job的执行进度
* 4.offset
*/
ssc.checkpoint(checkpointDirectory);
/**
* 监控的是HDFS上的一个目录,监控文件数量的变化 文件内容如果追加监控不到。
* 只监控文件夹下新增的文件,减少的文件时监控不到的,文件的内容有改动也监控不到。
*/
// JavaDStream<String> lines = ssc.textFileStream("hdfs://node1:9000/spark/sparkstreaming");
JavaDStream<String> lines = ssc.textFileStream("./data");
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
/**
*
*/
private static final long serialVersionUID = 1L;
@Override
public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
});
JavaPairDStream<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {
/**
*
*/
private static final long serialVersionUID = 1L;
@Override
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s.trim(), 1);
}
});
JavaPairDStream<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
/**
*
*/
private static final long serialVersionUID = 1L;
@Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
counts.print();
// counts.filter(new Function<Tuple2<String,Integer>, Boolean>() {
//
// /**
// *
// */
// private static final long serialVersionUID = 1L;
//
// @Override
// public Boolean call(Tuple2<String, Integer> v1) throws Exception {
// System.out.println("*************************");
// return true;
// }
// }).print();
return ssc;
}
}
package com.bjsxt.sparkstreaming;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import kafka.serializer.DefaultEncoder;
import kafka.serializer.StringDecoder;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import scala.Tuple2;
/**
* 并行度:
* 1、linesDStram里面封装到的是RDD, RDD里面有partition与读取topic的parititon数是一致的。
* 2、从kafka中读来的数据封装一个DStram里面,可以对这个DStream重分区 reaprtitions(numpartition)
*
* @author root
*
*/
public class SparkStreamingOnKafkaDirected {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local").setAppName("SparkStreamingOnKafkaDirected");
// conf.set("spark.streaming.backpressure.enabled", "false");
// conf.set("spark.streaming.kafka.maxRatePerPartition ", "100");
JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(5));
/**
* 可以不设置checkpoint 不设置不保存offset,offset默认在内存中有一份,如果设置checkpoint在checkpoint也有一份offset, 一般要设置。
*/
jsc.checkpoint("./checkpoint");
Map<String, String> kafkaParameters = new HashMap<String, String>();
kafkaParameters.put("metadata.broker.list", "node2:9092,node3:9092,node4:9092");
// kafkaParameters.put("auto.offset.reset", "smallest");
HashSet<String> topics = new HashSet<String>();
topics.add("t0426");
JavaPairInputDStream<String,String> lines = KafkaUtils.createDirectStream(jsc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParameters,
topics);
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<Tuple2<String,String>, String>() { //如果是Scala,由于SAM转换,所以可以写成val words = lines.flatMap { line => line.split(" ")}
/**
*
*/
private static final long serialVersionUID = 1L;
public Iterable<String> call(Tuple2<String,String> tuple) throws Exception {
return Arrays.asList(tuple._2.split("\t"));
}
});
JavaPairDStream<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
/**
*
*/
private static final long serialVersionUID = 1L;
public Tuple2<String, Integer> call(String word) throws Exception {
return new Tuple2<String, Integer>(word, 1);
}
});
JavaPairDStream<String, Integer> wordsCount = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() { //对相同的Key,进行Value的累计(包括Local和Reducer级别同时Reduce)
/**
*
*/
private static final long serialVersionUID = 1L;
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
});
wordsCount.print();
jsc.start();
jsc.awaitTermination();
jsc.close();
}
}
如何搭建Driver Ha ? * Spark standalone or Mesos with cluster deploy mode only: * 在提交application的时候 添加 --supervise 选项 如果Driver挂掉 会自动启动一个Driver Direct 模式,如何管理SparkStreaming 读取的zookeeper的消息offset
Direct 自动管理
生产部份数据
package com.bjsxt.sparkstreaming;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Properties;
import java.util.Random;
import kafka.javaapi.producer.Producer;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
import kafka.serializer.StringEncoder;
/**
* 向kafka中生产数据
* @author root
*
*/
public class SparkStreamingDataManuallyProducerForKafka extends Thread{
static String[] channelNames = new String[]{
"Spark","Scala","Kafka","Flink","Hadoop","Storm",
"Hive","Impala","HBase","ML"
};
static String[] actionNames = new String[]{"View", "Register"};
private String topic; //发送给Kafka的数据,topic
private Producer<String, String> producerForKafka;
private static String dateToday;
private static Random random;
public SparkStreamingDataManuallyProducerForKafka(String topic){
dateToday = new SimpleDateFormat("yyyy-MM-dd").format(new Date());
this.topic = topic;
random = new Random();
Properties properties = new Properties();
properties.put("metadata.broker.list","node2:9092,node3:9092,node4:9092");
//发送消息key的编码格式
properties.put("key.serializer.class", StringEncoder.class.getName());
//发送消息value的编码格式
properties.put("serializer.class", StringEncoder.class.getName());
producerForKafka = new Producer<String, String>(new ProducerConfig(properties)) ;
}
@Override
public void run() {
int counter = 0;
while(true){
counter++;
String userLog = userlogs();
// System.out.println("product:"+userLog+" ");
producerForKafka.send(new KeyedMessage<String, String>(topic,userLog));
producerForKafka.send(new KeyedMessage<String, String>(topic,"key-" + counter,userLog));
//每两条数据暂停2秒
if(0 == counter%2){
// counter = 0;
try {
Thread.sleep(2000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
public static void main( String[] args ){
new SparkStreamingDataManuallyProducerForKafka("t0426").start();
}
//生成随机数据
private static String userlogs() {
StringBuffer userLogBuffer = new StringBuffer("");
int[] unregisteredUsers = new int[]{1, 2, 3, 4, 5, 6, 7, 8};
long timestamp = new Date().getTime();
Long userID = 0L;
long pageID = 0L;
//随机生成的用户ID
if(unregisteredUsers[random.nextInt(8)] == 1) {
userID = null;
} else {
userID = (long) random.nextInt(2000);
}
//随机生成的页面ID
pageID = random.nextInt(2000);
//随机生成Channel
String channel = channelNames[random.nextInt(10)];
//随机生成action行为
String action = actionNames[random.nextInt(2)];
userLogBuffer.append(dateToday)
.append("\t")
.append(timestamp)
.append("\t")
.append(userID)
.append("\t")
.append(pageID)
.append("\t")
.append(channel)
.append("\t")
.append(action);
// .append("\n");
System.out.println(userLogBuffer.toString());
return userLogBuffer.toString();
}
}
自定义管理offset 主函数
package com.manage;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.atomic.AtomicReference;
import com.google.common.collect.ImmutableMap;
import com.manage.getOffset.GetTopicOffsetFromKafkaBroker;
import com.manage.getOffset.GetTopicOffsetFromZookeeper;
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.retry.RetryUntilElapsed;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
import org.apache.spark.streaming.kafka.HasOffsetRanges;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.kafka.OffsetRange;
import kafka.cluster.Broker;
import com.fasterxml.jackson.databind.ObjectMapper;
import kafka.api.PartitionOffsetRequestInfo;
import kafka.common.TopicAndPartition;
import kafka.javaapi.OffsetRequest;
import kafka.javaapi.OffsetResponse;
import kafka.javaapi.PartitionMetadata;
import kafka.javaapi.TopicMetadata;
import kafka.javaapi.TopicMetadataRequest;
import kafka.javaapi.TopicMetadataResponse;
import kafka.javaapi.consumer.SimpleConsumer;
import kafka.message.MessageAndMetadata;
import kafka.serializer.StringDecoder;
import scala.Tuple2;
public class UseZookeeperManageOffset {
/**
* 使用log4j打印日志,“UseZookeeper.class” 设置日志的产生类
*/
static final Logger logger = Logger.getLogger(UseZookeeperManageOffset.class);
public static void main(String[] args) {
/**
* 加载log4j的配置文件,方便打印日志
*/
ProjectUtil.LoadLogConfig();
logger.info("project is starting...");
/**
* 从kafka集群中得到topic每个分区中生产消息的最大偏移量位置
*/
Map<TopicAndPartition, Long> topicOffsets = GetTopicOffsetFromKafkaBroker.getTopicOffsets("node2:9092,node3:9092,node4:9092", "t0426");
/**
* 从zookeeper中获取当前topic每个分区 consumer 消费的offset位置
*/
Map<TopicAndPartition, Long> consumerOffsets =
GetTopicOffsetFromZookeeper.getConsumerOffsets("node2:2181,node3:2181,node4:2181","ConsumerGroup","t0426");
/**
* 合并以上得到的两个offset ,
* 思路是:
* 如果zookeeper中读取到consumer的消费者偏移量,那么就zookeeper中当前的offset为准。
* 否则,如果在zookeeper中读取不到当前消费者组消费当前topic的offset,就是当前消费者组第一次消费当前的topic,
* offset设置为topic中消息的最大位置。
*/
if(null!=consumerOffsets && consumerOffsets.size()>0){
topicOffsets.putAll(consumerOffsets);
}
/**
* 如果将下面的代码解开,是将topicOffset 中当前topic对应的每个partition中消费的消息设置为0,就是从头开始。
*/
// for(Map.Entry<TopicAndPartition, Long> item:topicOffsets.entrySet()){
// item.setValue(0l);
// }
/**
* 构建SparkStreaming程序,从当前的offset消费消息
*/
JavaStreamingContext jsc = SparkStreamingDirect.getStreamingContext(topicOffsets,"ConsumerGroup");
jsc.start();
jsc.awaitTermination();
jsc.close();
}
}
package com.manage;
import java.io.IOException;
import java.io.InputStream;
import java.util.Properties;
import org.apache.log4j.Logger;
import org.apache.log4j.PropertyConfigurator;
public class ProjectUtil {
/**
* 使用log4j配置打印日志
*/
static final Logger logger = Logger.getLogger(UseZookeeperManageOffset.class);
/**
* 加载配置的log4j.properties,默认读取的路径在src下,如果将log4j.properties放在别的路径中要手动加载
*/
public static void LoadLogConfig() {
PropertyConfigurator.configure("./resource/log4j.properties");
// PropertyConfigurator.configure("d:/eclipse4.7WS/SparkStreaming_Kafka_Manage/resource/log4j.properties");
}
/**
* 加载配置文件
* 需要将放config.properties的目录设置成资源目录
* @return
*/
public static Properties loadProperties() {
Properties props = new Properties();
InputStream inputStream = Thread.currentThread().getContextClassLoader().getResourceAsStream("config.properties");
if(null != inputStream) {
try {
props.load(inputStream);
} catch (IOException e) {
logger.error(String.format("Config.properties file not found in the classpath"));
}
}
return props;
}
public static void main(String[] args) {
Properties props = loadProperties();
String value = props.getProperty("hello");
System.out.println(value);
}
}
package com.manage.getOffset;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import com.google.common.collect.ImmutableMap;
import kafka.api.PartitionOffsetRequestInfo;
import kafka.cluster.Broker;
import kafka.common.TopicAndPartition;
import kafka.javaapi.OffsetRequest;
import kafka.javaapi.OffsetResponse;
import kafka.javaapi.PartitionMetadata;
import kafka.javaapi.TopicMetadata;
import kafka.javaapi.TopicMetadataRequest;
import kafka.javaapi.TopicMetadataResponse;
import kafka.javaapi.consumer.SimpleConsumer;
/**
* 测试之前需要启动kafka
* @author root
*
*/
public class GetTopicOffsetFromKafkaBroker {
public static void main(String[] args) {
Map<TopicAndPartition, Long> topicOffsets = getTopicOffsets("node2:9092,node3:9092,node4:9092", "t0426");
Set<Entry<TopicAndPartition, Long>> entrySet = topicOffsets.entrySet();
for(Entry<TopicAndPartition, Long> entry : entrySet) {
TopicAndPartition topicAndPartition = entry.getKey();
Long offset = entry.getValue();
String topic = topicAndPartition.topic();
int partition = topicAndPartition.partition();
System.out.println("topic = "+topic+",partition = "+partition+",offset = "+offset);
}
}
/**
* 从kafka集群中得到当前topic,生产者在每个分区中生产消息的偏移量位置
* @param KafkaBrokerServer
* @param topic
* @return
*/
public static Map<TopicAndPartition,Long> getTopicOffsets(String KafkaBrokerServer, String topic){
Map<TopicAndPartition,Long> retVals = new HashMap<TopicAndPartition,Long>();
for(String broker:KafkaBrokerServer.split(",")){
SimpleConsumer simpleConsumer = new SimpleConsumer(broker.split(":")[0],Integer.valueOf(broker.split(":")[1]), 64*10000,1024,"consumer");
TopicMetadataRequest topicMetadataRequest = new TopicMetadataRequest(Arrays.asList(topic));
TopicMetadataResponse topicMetadataResponse = simpleConsumer.send(topicMetadataRequest);
for (TopicMetadata metadata : topicMetadataResponse.topicsMetadata()) {
for (PartitionMetadata part : metadata.partitionsMetadata()) {
Broker leader = part.leader();
if (leader != null) {
TopicAndPartition topicAndPartition = new TopicAndPartition(topic, part.partitionId());
PartitionOffsetRequestInfo partitionOffsetRequestInfo = new PartitionOffsetRequestInfo(kafka.api.OffsetRequest.LatestTime(), 10000);
OffsetRequest offsetRequest = new OffsetRequest(ImmutableMap.of(topicAndPartition, partitionOffsetRequestInfo), kafka.api.OffsetRequest.CurrentVersion(), simpleConsumer.clientId());
OffsetResponse offsetResponse = simpleConsumer.getOffsetsBefore(offsetRequest);
if (!offsetResponse.hasError()) {
long[] offsets = offsetResponse.offsets(topic, part.partitionId());
retVals.put(topicAndPartition, offsets[0]);
}
}
}
}
simpleConsumer.close();
}
return retVals;
}
}
package com.manage;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.atomic.AtomicReference;
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.retry.RetryUntilElapsed;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.HasOffsetRanges;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.kafka.OffsetRange;
import com.fasterxml.jackson.databind.ObjectMapper;
import kafka.common.TopicAndPartition;
import kafka.message.MessageAndMetadata;
import kafka.serializer.StringDecoder;
import scala.Tuple2;
public class SparkStreamingDirect {
public static JavaStreamingContext getStreamingContext(Map<TopicAndPartition, Long> topicOffsets,String groupID){
// local == local[1]
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingOnKafkaDirect");
conf.set("spark.streaming.kafka.maxRatePerPartition", "10"); // 每个分区每批次读取10条
JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(5));
// jsc.checkpoint("/checkpoint");
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list","node2:9092,node3:9092,node4:9092");
// kafkaParams.put("group.id","MyFirstConsumerGroup");
for(Map.Entry<TopicAndPartition,Long> entry:topicOffsets.entrySet()){
System.out.println(entry.getKey().topic()+"\t"+entry.getKey().partition()+"\t"+entry.getValue());
}
JavaInputDStream<String> message = KafkaUtils.createDirectStream(
jsc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
String.class,
kafkaParams,
topicOffsets,
new Function<MessageAndMetadata<String,String>,String>() {
/**
*
*/
private static final long serialVersionUID = 1L;
public String call(MessageAndMetadata<String, String> v1)throws Exception {
return v1.message();
}
}
);
final AtomicReference<OffsetRange[]> offsetRanges = new AtomicReference<>();
JavaDStream<String> lines = message.transform(new Function<JavaRDD<String>, JavaRDD<String>>() {
/**
*
*/
private static final long serialVersionUID = 1L;
@Override
public JavaRDD<String> call(JavaRDD<String> rdd) throws Exception {
OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
offsetRanges.set(offsets);
return rdd;
}
}
);
message.foreachRDD(new VoidFunction<JavaRDD<String>>(){
/**
*
*/
private static final long serialVersionUID = 1L;
@Override
public void call(JavaRDD<String> t) throws Exception {
ObjectMapper objectMapper = new ObjectMapper();
CuratorFramework curatorFramework = CuratorFrameworkFactory.builder()
.connectString("node2:2181,node3:2181,node4:2181").connectionTimeoutMs(1000)
.sessionTimeoutMs(10000).retryPolicy(new RetryUntilElapsed(1000, 1000)).build();
curatorFramework.start();
for (OffsetRange offsetRange : offsetRanges.get()) {
String topic = offsetRange.topic();
int partition = offsetRange.partition();
long fromOffset = offsetRange.fromOffset();
long untilOffset = offsetRange.untilOffset();
final byte[] offsetBytes = objectMapper.writeValueAsBytes(offsetRange.untilOffset());
String nodePath = "/consumers/"+groupID+"/offsets/" + offsetRange.topic()+ "/" + offsetRange.partition();
System.out.println("nodePath = "+nodePath);
System.out.println("topic="+topic + ",partition + "+partition + ",fromOffset = "+fromOffset+",untilOffset="+untilOffset);
if(curatorFramework.checkExists().forPath(nodePath)!=null){
curatorFramework.setData().forPath(nodePath,offsetBytes);
}else{
curatorFramework.create().creatingParentsIfNeeded().forPath(nodePath, offsetBytes);
}
}
curatorFramework.close();
}
});
lines.print();
return jsc;
}
}
package com.manage;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.atomic.AtomicReference;
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.retry.RetryUntilElapsed;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.HasOffsetRanges;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.kafka.OffsetRange;
import com.fasterxml.jackson.databind.ObjectMapper;
import kafka.common.TopicAndPartition;
import kafka.message.MessageAndMetadata;
import kafka.serializer.StringDecoder;
import scala.Tuple2;
public class SparkStreamingDirect {
public static JavaStreamingContext getStreamingContext(Map<TopicAndPartition, Long> topicOffsets,String groupID){
// local == local[1]
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingOnKafkaDirect");
conf.set("spark.streaming.kafka.maxRatePerPartition", "10"); // 每个分区每批次读取10条
JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(5));
// jsc.checkpoint("/checkpoint");
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list","node2:9092,node3:9092,node4:9092");
// kafkaParams.put("group.id","MyFirstConsumerGroup");
for(Map.Entry<TopicAndPartition,Long> entry:topicOffsets.entrySet()){
System.out.println(entry.getKey().topic()+"\t"+entry.getKey().partition()+"\t"+entry.getValue());
}
JavaInputDStream<String> message = KafkaUtils.createDirectStream(
jsc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
String.class,
kafkaParams,
topicOffsets,
new Function<MessageAndMetadata<String,String>,String>() {
/**
*
*/
private static final long serialVersionUID = 1L;
public String call(MessageAndMetadata<String, String> v1)throws Exception {
return v1.message();
}
}
);
final AtomicReference<OffsetRange[]> offsetRanges = new AtomicReference<>();
JavaDStream<String> lines = message.transform(new Function<JavaRDD<String>, JavaRDD<String>>() {
/**
*
*/
private static final long serialVersionUID = 1L;
@Override
public JavaRDD<String> call(JavaRDD<String> rdd) throws Exception {
OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
offsetRanges.set(offsets);
return rdd;
}
}
);
message.foreachRDD(new VoidFunction<JavaRDD<String>>(){
/**
*
*/
private static final long serialVersionUID = 1L;
@Override
public void call(JavaRDD<String> t) throws Exception {
ObjectMapper objectMapper = new ObjectMapper();
CuratorFramework curatorFramework = CuratorFrameworkFactory.builder()
.connectString("node2:2181,node3:2181,node4:2181").connectionTimeoutMs(1000)
.sessionTimeoutMs(10000).retryPolicy(new RetryUntilElapsed(1000, 1000)).build();
curatorFramework.start();
for (OffsetRange offsetRange : offsetRanges.get()) {
String topic = offsetRange.topic();
int partition = offsetRange.partition();
long fromOffset = offsetRange.fromOffset();
long untilOffset = offsetRange.untilOffset();
final byte[] offsetBytes = objectMapper.writeValueAsBytes(offsetRange.untilOffset());
String nodePath = "/consumers/"+groupID+"/offsets/" + offsetRange.topic()+ "/" + offsetRange.partition();
System.out.println("nodePath = "+nodePath);
System.out.println("topic="+topic + ",partition + "+partition + ",fromOffset = "+fromOffset+",untilOffset="+untilOffset);
if(curatorFramework.checkExists().forPath(nodePath)!=null){
curatorFramework.setData().forPath(nodePath,offsetBytes);
}else{
curatorFramework.create().creatingParentsIfNeeded().forPath(nodePath, offsetBytes);
}
}
curatorFramework.close();
}
});
lines.print();
return jsc;
}
}
自己手动管理,offset, 如保存到zookeeper,mysql.当sparkstreamimg逻辑变化,代码修改后,可以从自己保存的位置在顺延消费数据。



浙公网安备 33010602011771号