【左扬精讲】| kafka_exporter v1.9.0 源码专题 | 架构设计与实现原理全解析
【左扬精讲】| kafka_exporter v1.9.0 源码专题 | 架构设计与实现原理全解析
https://github.com/danielqsj/kafka_exporter/releases/tag/v1.9.0
Kafka 作为分布式消息队列的事实标准,其可观测性是保障集群稳定运行的核心。Kafka Exporter 作为 Prometheus 生态中最主流的 Kafka 指标采集工具,能够全面采集集群、Topic、消费组等核心指标。本文将从源码角度深度剖析 Kafka Exporter 的架构设计与实现原理,帮助开发者理解其核心逻辑,快速上手二次开发或问题排查。
一、Kafka Exporter 核心定位
Kafka Exporter 的核心目标是遵循 Prometheus 规范采集 Kafka 集群指标,并通过 HTTP/HTTPS 暴露指标接口,核心能力包括:
- 集群层:Broker 数量、Broker 信息等基础指标
- Topic 层:分区数、Offset、副本数、ISR、欠复制等指标
- 消费组层:Offset、消费延迟(Lag)、成员数等核心指标
- 多认证支持:SASL/PLAIN、SCRAM、Kerberos、AWS IAM、TLS
- 兼容适配:支持 ZooKeeper 模式(老版本 Kafka)和新版本消费组 API
二、代码整体架构设计
Kafka Exporter 的架构遵循“模块化、接口化、配置驱动”的设计原则,整体可分为 5 大核心模块,架构图如下:
┌─────────────────────────────────────────────────────────────┐
│ 命令行参数解析层 │
│ (kingpin/flag) 处理配置 → 生成kafkaOpts配置结构体 │
└───────────────────────────┬─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ Kafka客户端层 │
│ (sarama) 封装Kafka连接、认证、元数据管理 │
└───────────────────────────┬─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ 指标采集核心层 │
│ (Exporter结构体) 实现Prometheus Collector接口,采集所有指标 │
└───────────────────────────┬─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ 并发控制层 │
│ (协程池/互斥锁) 控制采集并发,避免集群压力 │
└───────────────────────────┬─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ HTTP服务层 │
│ (net/http) 暴露/metrics接口、健康检查、HTTPS支持 │
└─────────────────────────────────────────────────────────────┘
-
- 接口化:核心 Exporter 结构体实现 Prometheus 的Collector接口,天然兼容 Prometheus 生态
- 配置驱动:所有行为通过命令行参数控制,无硬编码,适配不同环境
- 性能优先:元数据缓存、协程池、非并发共享采集结果等优化,适配大集群
- 兼容适配:支持 Kafka 多版本协议、多种认证方式、ZooKeeper / 新版本消费组双模式
三、代码核心模块源码解读
3.1、包导入与常量定义
https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L45
package main import ( "context" // 上下文管理,用于AWS IAM token生成 "crypto/tls" // TLS加密 "crypto/x509" // X.509证书解析 "flag" // 命令行参数解析(基础) "fmt" // 格式化输出 "log" // 基础日志 "net/http" // HTTP服务 "os" // 操作系统交互 "regexp" // 正则表达式(过滤topic/消费组) "strconv" // 字符串/数值转换 "strings" // 字符串处理 "sync" // 同步原语(锁、WaitGroup) "time" // 时间处理 "github.com/IBM/sarama" // Kafka Go客户端核心库 kingpin "github.com/alecthomas/kingpin/v2" // 增强型命令行参数解析 "github.com/aws/aws-msk-iam-sasl-signer-go/signer" // AWS MSK IAM认证 "github.com/krallistic/kazoo-go" // ZooKeeper客户端(兼容老版本消费组) "github.com/pkg/errors" // 错误包装(增强错误信息) "github.com/prometheus/client_golang/prometheus" // Prometheus指标核心库 "github.com/prometheus/client_golang/prometheus/promhttp" // Prometheus HTTP handler plog "github.com/prometheus/common/promlog" // Prometheus日志配置 plogflag "github.com/prometheus/common/promlog/flag" // Prometheus日志命令行参数 versionCollector "github.com/prometheus/client_golang/prometheus/collectors/version" // 版本指标采集器 "github.com/prometheus/common/version" // 版本信息 "github.com/rcrowley/go-metrics" // 指标库(此处禁用) "k8s.io/klog/v2" // K8s风格日志(支持分级日志) ) // 常量定义:指标命名空间、客户端ID const ( namespace = "kafka" // Prometheus指标前缀(如kafka_brokers) clientID = "kafka_exporter" // Kafka客户端标识 ) // 日志级别常量 const ( INFO = 0 // 普通日志 DEBUG = 1 // 调试日志 TRACE = 2 // 追踪日志(最详细) )
关键性讲解:
-
-
- sarama是 Kafka Go 客户端的事实标准,几乎所有 Kafka Go 项目都基于它
- kingpin比原生flag更强大,支持参数默认值、类型校验、子命令等
- klog支持分级日志(V (1)、V (2)),适合不同环境的日志输出控制
- namespace是 Prometheus 指标的命名规范,确保指标唯一性
-
3.2、Prometheus 指标描述符定义
https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L45
var ( // 集群级别指标 clusterBrokers *prometheus.Desc // Kafka broker总数 clusterBrokerInfo *prometheus.Desc // Broker信息(ID、地址) // Topic级别指标 topicPartitions *prometheus.Desc // Topic分区数 topicCurrentOffset *prometheus.Desc // Topic分区最新offset topicOldestOffset *prometheus.Desc // Topic分区最旧offset topicPartitionLeader *prometheus.Desc // Topic分区leader broker ID topicPartitionReplicas *prometheus.Desc // Topic分区副本数 topicPartitionInSyncReplicas *prometheus.Desc // Topic分区ISR(同步副本)数 topicPartitionUsesPreferredReplica *prometheus.Desc // 是否使用首选副本(1=是,0=否) topicUnderReplicatedPartition *prometheus.Desc // 是否欠复制(1=是,0=否) // 消费组级别指标 consumergroupCurrentOffset *prometheus.Desc // 消费组当前offset consumergroupCurrentOffsetSum *prometheus.Desc // 消费组某topic所有分区offset总和 consumergroupLag *prometheus.Desc // 消费组lag(消费延迟) consumergroupLagSum *prometheus.Desc // 消费组某topic所有分区lag总和 consumergroupLagZookeeper *prometheus.Desc // ZooKeeper模式下的消费组lag consumergroupMembers *prometheus.Desc // 消费组成员数 )
关键讲解:
-
-
- prometheus.Desc是 Prometheus 指标的身份说明元数据描述(里面会写清楚指标叫啥名、有啥用,还有带哪些标签)
- 指标标签设计遵循 Prometheus 最佳实践:
- 集群级别:无标签 / 基础标签
- Topic 级别:topic、partition标签
- 消费组级别:consumergroup、topic、partition标签
-
3.3、核心结构体定义
https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L65
// Exporter:Kafka指标导出器核心结构体(实现prometheus.Collector接口)
type Exporter struct {
client sarama.Client // Kafka客户端(sarama核心)
topicFilter *regexp.Regexp // Topic包含过滤正则
topicExclude *regexp.Regexp // Topic排除过滤正则
groupFilter *regexp.Regexp // 消费组包含过滤正则
groupExclude *regexp.Regexp // 消费组排除过滤正则
mu sync.Mutex // 通用互斥锁(保护共享变量)
useZooKeeperLag bool // 是否使用ZooKeeper获取消费组lag
zookeeperClient *kazoo.Kazoo // ZooKeeper客户端
nextMetadataRefresh time.Time // 下次元数据刷新时间
metadataRefreshInterval time.Duration // 元数据刷新间隔
offsetShowAll bool // 是否显示所有消费组offset(包括未连接的)
topicWorkers int // 采集topic指标的协程数
allowConcurrent bool // 是否允许并发采集(大集群禁用)
sgMutex sync.Mutex // 并发控制专用锁
sgWaitCh chan struct{} // 并发等待通道
sgChans []chan<- prometheus.Metric // 并发采集的指标通道
consumerGroupFetchAll bool // 是否支持全量消费组抓取(Kafka 2.0+)
}
// kafkaOpts:Kafka连接配置结构体(整合所有命令行参数)
type kafkaOpts struct {
uri []string // Kafka broker地址列表
useSASL bool // 是否启用SASL认证
useSASLHandshake bool // SASL握手(非Kafka代理需关闭)
saslUsername string // SASL用户名
saslPassword string // SASL密码
saslMechanism string // SASL机制(scram-sha256/512、gssapi、awsiam、plain)
saslDisablePAFXFast bool // Kerberos禁用PA_FX_FAST
saslAwsRegion string // AWS MSK区域
useTLS bool // Kafka客户端启用TLS
tlsServerName string // TLS服务器名(证书校验)
tlsCAFile string // TLS CA文件
tlsCertFile string // TLS客户端证书
tlsKeyFile string // TLS客户端私钥
serverUseTLS bool // Exporter自身HTTP服务启用TLS
serverMutualAuthEnabled bool // Exporter启用TLS双向认证
serverTlsCAFile string // Exporter TLS CA文件
serverTlsCertFile string // Exporter TLS证书
serverTlsKeyFile string // Exporter TLS私钥
tlsInsecureSkipTLSVerify bool // 跳过TLS证书校验(测试用)
kafkaVersion string // Kafka版本(兼容不同协议)
useZooKeeperLag bool // 使用ZooKeeper获取lag
uriZookeeper []string // ZooKeeper地址列表
labels string // 自定义标签(如集群名)
metadataRefreshInterval string // 元数据刷新间隔(字符串格式)
serviceName string // Kerberos服务名
kerberosConfigPath string // Kerberos配置路径
realm string // Kerberos域
keyTabPath string // Kerberos keytab文件
kerberosAuthType string // Kerberos认证类型(keytab/user)
offsetShowAll bool // 显示所有消费组offset
topicWorkers int // Topic采集协程数
allowConcurrent bool // 允许并发采集
allowAutoTopicCreation bool // 允许自动创建topic
verbosityLogLevel int // 日志级别
}
// MSKAccessTokenProvider:AWS MSK IAM认证的Token提供者(实现sarama.AccessTokenProvider接口)
type MSKAccessTokenProvider struct {
region string // AWS区域
}
关键讲解:
-
-
- Exporter 实现了prometheus.Collector接口(Describe+Collect方法),是 Prometheus 采集器的核心规范
- kafkaOpts是典型的 "配置聚合" 模式,将所有命令行参数整合为结构体,避免函数参数爆炸
- MSKAccessTokenProvider适配 sarama 的AccessTokenProvider接口,实现 AWS IAM 动态生成认证 Token
-
3.4、证书校验
https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L133
// CanReadCertAndKey:校验证书和私钥文件是否可读(成对存在)
func CanReadCertAndKey(certPath, keyPath string) (bool, error) {
certReadable := canReadFile(certPath) // 检查证书文件
keyReadable := canReadFile(keyPath) // 检查私钥文件
if !certReadable && !keyReadable {
return false, nil // 都不存在,返回false(无需报错)
}
if !certReadable {
return false, fmt.Errorf("error reading %s, certificate and key must be supplied as a pair", certPath)
}
if !keyReadable {
return false, fmt.Errorf("error reading %s, certificate and key must be supplied as a pair", keyPath)
}
return true, nil
}
// canReadFile:检查文件是否存在且可读
func canReadFile(path string) bool {
f, err := os.Open(path)
if err != nil {
return false
}
defer f.Close() // 必须关闭文件句柄,避免泄漏
return true
}
关键讲解:
-
-
- 证书校验是 TLS 认证的基础,确保客户端 / 服务端证书成对存在
- defer f.Close()是 Go 的经典用法,确保资源释放(即使函数出错)
-
3.5、Exporter 初始化(核心)
https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L166
// NewExporter:创建并初始化Exporter实例
func NewExporter(opts kafkaOpts, topicFilter string, topicExclude string, groupFilter string, groupExclude string) (*Exporter, error) {
var zookeeperClient *kazoo.Kazoo // ZooKeeper客户端(按需初始化)
config := sarama.NewConfig() // 创建sarama配置
config.ClientID = clientID // 设置Kafka客户端ID
// 解析Kafka版本(兼容不同协议)
kafkaVersion, err := sarama.ParseKafkaVersion(opts.kafkaVersion)
if err != nil {
return nil, err
}
config.Version = kafkaVersion
// ========== SASL认证配置 ==========
if opts.useSASL {
opts.saslMechanism = strings.ToLower(opts.saslMechanism) // 统一转为小写(兼容大小写输入)
saslPassword := opts.saslPassword
if saslPassword == "" {
saslPassword = os.Getenv("SASL_USER_PASSWORD") // 环境变量优先级更高(安全)
}
// 根据SASL机制配置
switch opts.saslMechanism {
case "scram-sha512": // SCRAM-SHA512
config.Net.SASL.SCRAMClientGeneratorFunc = func() sarama.SCRAMClient { return &XDGSCRAMClient{HashGeneratorFcn: SHA512} }
config.Net.SASL.Mechanism = sarama.SASLMechanism(sarama.SASLTypeSCRAMSHA512)
case "scram-sha256": // SCRAM-SHA256
config.Net.SASL.SCRAMClientGeneratorFunc = func() sarama.SCRAMClient { return &XDGSCRAMClient{HashGeneratorFcn: SHA256} }
config.Net.SASL.Mechanism = sarama.SASLMechanism(sarama.SASLTypeSCRAMSHA256)
case "gssapi": // Kerberos
config.Net.SASL.Mechanism = sarama.SASLMechanism(sarama.SASLTypeGSSAPI)
config.Net.SASL.GSSAPI.ServiceName = opts.serviceName
config.Net.SASL.GSSAPI.KerberosConfigPath = opts.kerberosConfigPath
config.Net.SASL.GSSAPI.Realm = opts.realm
config.Net.SASL.GSSAPI.Username = opts.saslUsername
// Kerberos认证类型(keytab/用户密码)
if opts.kerberosAuthType == "keytabAuth" {
config.Net.SASL.GSSAPI.AuthType = sarama.KRB5_KEYTAB_AUTH
config.Net.SASL.GSSAPI.KeyTabPath = opts.keyTabPath
} else {
config.Net.SASL.GSSAPI.AuthType = sarama.KRB5_USER_AUTH
config.Net.SASL.GSSAPI.Password = saslPassword
}
if opts.saslDisablePAFXFast {
config.Net.SASL.GSSAPI.DisablePAFXFAST = true
}
case "awsiam": // AWS MSK IAM认证
config.Net.SASL.Mechanism = sarama.SASLMechanism(sarama.SASLTypeOAuth)
config.Net.SASL.TokenProvider = &MSKAccessTokenProvider{region: opts.saslAwsRegion}
case "plain": // PLAIN(基础认证)
default:
return nil, fmt.Errorf(
`invalid sasl mechanism "%s": can only be "scram-sha256", "scram-sha512", "gssapi", "awsiam" or "plain"`,
opts.saslMechanism,
)
}
// 启用SASL基础配置
config.Net.SASL.Enable = true
config.Net.SASL.Handshake = opts.useSASLHandshake
if opts.saslUsername != "" {
config.Net.SASL.User = opts.saslUsername
}
if saslPassword != "" {
config.Net.SASL.Password = saslPassword
}
}
// ========== TLS配置 ==========
if opts.useTLS {
config.Net.TLS.Enable = true
config.Net.TLS.Config = &tls.Config{
ServerName: opts.tlsServerName, // 证书校验的服务器名
InsecureSkipVerify: opts.tlsInsecureSkipTLSVerify, // 跳过证书校验(测试用)
}
// 加载CA证书(自定义CA)
if opts.tlsCAFile != "" {
if ca, err := os.ReadFile(opts.tlsCAFile); err == nil {
config.Net.TLS.Config.RootCAs = x509.NewCertPool()
config.Net.TLS.Config.RootCAs.AppendCertsFromPEM(ca)
} else {
return nil, err
}
}
// 加载客户端证书(双向认证)
canReadCertAndKey, err := CanReadCertAndKey(opts.tlsCertFile, opts.tlsKeyFile)
if err != nil {
return nil, errors.Wrap(err, "error reading cert and key")
}
if canReadCertAndKey {
cert, err := tls.LoadX509KeyPair(opts.tlsCertFile, opts.tlsKeyFile)
if err == nil {
config.Net.TLS.Config.Certificates = []tls.Certificate{cert}
} else {
return nil, err
}
}
}
// ========== ZooKeeper配置(兼容老版本消费组) ==========
if opts.useZooKeeperLag {
klog.V(DEBUG).Infoln("Using zookeeper lag, so connecting to zookeeper")
zookeeperClient, err = kazoo.NewKazoo(opts.uriZookeeper, nil)
if err != nil {
return nil, errors.Wrap(err, "error connecting to zookeeper")
}
}
// ========== 元数据刷新配置 ==========
interval, err := time.ParseDuration(opts.metadataRefreshInterval)
if err != nil {
return nil, errors.Wrap(err, "Cannot parse metadata refresh interval")
}
config.Metadata.RefreshFrequency = interval // sarama元数据刷新频率
config.Metadata.AllowAutoTopicCreation = opts.allowAutoTopicCreation // 禁止自动创建topic(生产环境推荐)
// ========== 创建Kafka客户端 ==========
client, err := sarama.NewClient(opts.uri, config)
if err != nil {
return nil, errors.Wrap(err, "Error Init Kafka Client")
}
klog.V(TRACE).Infoln("Done Init Clients")
// ========== 初始化Exporter ==========
return &Exporter{
client: client,
topicFilter: regexp.MustCompile(topicFilter), // 编译正则(预编译提升性能)
topicExclude: regexp.MustCompile(topicExclude),
groupFilter: regexp.MustCompile(groupFilter),
groupExclude: regexp.MustCompile(groupExclude),
useZooKeeperLag: opts.useZooKeeperLag,
zookeeperClient: zookeeperClient,
nextMetadataRefresh: time.Now(), // 下次刷新时间(初始为当前时间)
metadataRefreshInterval: interval,
offsetShowAll: opts.offsetShowAll,
topicWorkers: opts.topicWorkers,
allowConcurrent: opts.allowConcurrent,
sgMutex: sync.Mutex{},
sgWaitCh: nil,
sgChans: []chan<- prometheus.Metric{},
consumerGroupFetchAll: config.Version.IsAtLeast(sarama.V2_0_0_0), // Kafka 2.0+支持全量消费组抓取
}, nil
}
// fetchOffsetVersion:根据Kafka版本选择OffsetFetchRequest版本(兼容不同协议)
func (e *Exporter) fetchOffsetVersion() int16 {
version := e.client.Config().Version
if e.client.Config().Version.IsAtLeast(sarama.V2_0_0_0) {
return 4 // Kafka 2.0+使用版本4
} else if version.IsAtLeast(sarama.V0_10_2_0) {
return 2 // Kafka 0.10.2+使用版本2
} else if version.IsAtLeast(sarama.V0_8_2_2) {
return 1 // Kafka 0.8.2+使用版本1
}
return 0 // 老版本使用默认
}
关键讲解:
-
-
- 配置优先级:环境变量(如SASL_USER_PASSWORD)> 命令行参数,符合 12Factor 应用规范
- 正则预编译:regexp.MustCompile预编译正则表达式,避免每次采集时重复编译(性能优化)
- 错误包装:errors.Wrap添加上下文信息,便于问题定位(如 "error connecting to zookeeper")
- 版本兼容:Kafka 不同版本的协议差异通过fetchOffsetVersion适配,确保跨版本兼容性
-
3.6、Prometheus Collector 接口实现
https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L320
// Describe:暴露所有指标描述符(Prometheus规范)
func (e *Exporter) Describe(ch chan<- *prometheus.Desc) {
ch <- clusterBrokers
ch <- topicCurrentOffset
ch <- topicOldestOffset
ch <- topicPartitions
ch <- topicPartitionLeader
ch <- topicPartitionReplicas
ch <- topicPartitionInSyncReplicas
ch <- topicPartitionUsesPreferredReplica
ch <- topicUnderReplicatedPartition
ch <- consumergroupCurrentOffset
ch <- consumergroupCurrentOffsetSum
ch <- consumergroupLag
ch <- consumergroupLagZookeeper
ch <- consumergroupLagSum
}
// Collect:核心采集逻辑(Prometheus调用此方法抓取指标)
func (e *Exporter) Collect(ch chan<- prometheus.Metric) {
if e.allowConcurrent {
e.collect(ch) // 并发模式(大集群禁用)
return
}
// 非并发模式:所有采集请求共享结果(避免重复采集)
e.sgMutex.Lock()
e.sgChans = append(e.sgChans, ch)
if len(e.sgChans) == 1 {
e.sgWaitCh = make(chan struct{})
go e.collectChans(e.sgWaitCh) // 启动单例采集
} else {
klog.V(TRACE).Info("concurrent calls detected, waiting for first to finish")
}
waiter := e.sgWaitCh
e.sgMutex.Unlock()
<-waiter // 等待采集完成
}
// collectChans:非并发模式的采集结果分发
func (e *Exporter) collectChans(quit chan struct{}) {
original := make(chan prometheus.Metric)
container := make([]prometheus.Metric, 0, 100) // 缓存指标(避免重复采集)
// 异步接收指标
go func() {
for metric := range original {
container = append(container, metric)
}
}()
// 执行采集
e.collect(original)
close(original)
// 分发指标到所有等待的通道
e.sgMutex.Lock()
for _, ch := range e.sgChans {
for _, metric := range container {
ch <- metric
}
}
e.sgChans = e.sgChans[:0] // 重置通道列表
close(quit) // 通知所有等待的请求
e.sgMutex.Unlock()
}
// collect:核心采集逻辑(真正的指标采集)
func (e *Exporter) collect(ch chan<- prometheus.Metric) {
var wg = sync.WaitGroup{} // 协程等待组
// ========== 1. 采集Broker指标 ==========
ch <- prometheus.MustNewConstMetric(
clusterBrokers, prometheus.GaugeValue, float64(len(e.client.Brokers())),
)
// 采集每个Broker的信息
for _, b := range e.client.Brokers() {
ch <- prometheus.MustNewConstMetric(
clusterBrokerInfo, prometheus.GaugeValue, 1, strconv.Itoa(int(b.ID())), b.Addr(),
)
}
// ========== 2. 元数据刷新 ==========
offset := make(map[string]map[int32]int64) // 缓存topic-partition的最新offset
now := time.Now()
if now.After(e.nextMetadataRefresh) {
klog.V(DEBUG).Info("Refreshing client metadata")
if err := e.client.RefreshMetadata(); err != nil {
klog.Errorf("Cannot refresh topics, using cached data: %v", err)
}
e.nextMetadataRefresh = now.Add(e.metadataRefreshInterval) // 更新下次刷新时间
}
// ========== 3. 获取Topic列表 ==========
topics, err := e.client.Topics()
if err != nil {
klog.Errorf("Cannot get topics: %v", err)
return
}
// ========== 4. 并发采集Topic指标 ==========
topicChannel := make(chan string) // Topic分发通道
// getTopicMetrics:单个Topic的指标采集逻辑
getTopicMetrics := func(topic string) {
defer wg.Done() // 协程完成通知
// 过滤Topic(包含+排除)
if !e.topicFilter.MatchString(topic) || e.topicExclude.MatchString(topic) {
return
}
// 获取Topic分区列表
partitions, err := e.client.Partitions(topic)
if err != nil {
klog.Errorf("Cannot get partitions of topic %s: %v", topic, err)
return
}
// 采集Topic分区数
ch <- prometheus.MustNewConstMetric(
topicPartitions, prometheus.GaugeValue, float64(len(partitions)), topic,
)
// 初始化offset缓存
e.mu.Lock()
offset[topic] = make(map[int32]int64, len(partitions))
e.mu.Unlock()
// 遍历分区采集指标
for _, partition := range partitions {
// 分区Leader Broker
broker, err := e.client.Leader(topic, partition)
if err != nil {
klog.Errorf("Cannot get leader of topic %s partition %d: %v", topic, partition, err)
} else {
ch <- prometheus.MustNewConstMetric(
topicPartitionLeader, prometheus.GaugeValue, float64(broker.ID()), topic, strconv.FormatInt(int64(partition), 10),
)
}
// 分区最新Offset(OffsetNewest)
currentOffset, err := e.client.GetOffset(topic, partition, sarama.OffsetNewest)
if err != nil {
klog.Errorf("Cannot get current offset of topic %s partition %d: %v", topic, partition, err)
} else {
e.mu.Lock()
offset[topic][partition] = currentOffset // 缓存offset
e.mu.Unlock()
ch <- prometheus.MustNewConstMetric(
topicCurrentOffset, prometheus.GaugeValue, float64(currentOffset), topic, strconv.FormatInt(int64(partition), 10),
)
}
// 分区最旧Offset(OffsetOldest)
oldestOffset, err := e.client.GetOffset(topic, partition, sarama.OffsetOldest)
if err != nil {
klog.Errorf("Cannot get oldest offset of topic %s partition %d: %v", topic, partition, err)
} else {
ch <- prometheus.MustNewConstMetric(
topicOldestOffset, prometheus.GaugeValue, float64(oldestOffset), topic, strconv.FormatInt(int64(partition), 10),
)
}
// 分区副本数
replicas, err := e.client.Replicas(topic, partition)
if err != nil {
klog.Errorf("Cannot get replicas of topic %s partition %d: %v", topic, partition, err)
} else {
ch <- prometheus.MustNewConstMetric(
topicPartitionReplicas, prometheus.GaugeValue, float64(len(replicas)), topic, strconv.FormatInt(int64(partition), 10),
)
}
// 分区ISR(同步副本)数
inSyncReplicas, err := e.client.InSyncReplicas(topic, partition)
if err != nil {
klog.Errorf("Cannot get in-sync replicas of topic %s partition %d: %v", topic, partition, err)
} else {
ch <- prometheus.MustNewConstMetric(
topicPartitionInSyncReplicas, prometheus.GaugeValue, float64(len(inSyncReplicas)), topic, strconv.FormatInt(int64(partition), 10),
)
}
// 是否使用首选副本
if broker != nil && replicas != nil && len(replicas) > 0 && broker.ID() == replicas[0] {
ch <- prometheus.MustNewConstMetric(
topicPartitionUsesPreferredReplica, prometheus.GaugeValue, float64(1), topic, strconv.FormatInt(int64(partition), 10),
)
} else {
ch <- prometheus.MustNewConstMetric(
topicPartitionUsesPreferredReplica, prometheus.GaugeValue, float64(0), topic, strconv.FormatInt(int64(partition), 10),
)
}
// 是否欠复制(ISR数 < 副本数)
if replicas != nil && inSyncReplicas != nil && len(inSyncReplicas) < len(replicas) {
ch <- prometheus.MustNewConstMetric(
topicUnderReplicatedPartition, prometheus.GaugeValue, float64(1), topic, strconv.FormatInt(int64(partition), 10),
)
} else {
ch <- prometheus.MustNewConstMetric(
topicUnderReplicatedPartition, prometheus.GaugeValue, float64(0), topic, strconv.FormatInt(int64(partition), 10),
)
}
// ZooKeeper模式下的消费组Lag
if e.useZooKeeperLag {
ConsumerGroups, err := e.zookeeperClient.Consumergroups()
if err != nil {
klog.Errorf("Cannot get consumer group %v", err)
}
for _, group := range ConsumerGroups {
offset, _ := group.FetchOffset(topic, partition)
if offset > 0 {
consumerGroupLag := currentOffset - offset
ch <- prometheus.MustNewConstMetric(
consumergroupLagZookeeper, prometheus.GaugeValue, float64(consumerGroupLag), group.Name, topic, strconv.FormatInt(int64(partition), 10),
)
}
}
}
}
}
// loopTopics:协程循环处理Topic通道
loopTopics := func() {
ok := true
for ok {
topic, open := <-topicChannel
ok = open
if open {
getTopicMetrics(topic)
}
}
}
// 协程数控制(避免协程爆炸)
minx := func(x int, y int) int {
if x < y {
return x
} else {
return y
}
}
N := len(topics)
if N > 1 {
N = minx(N/2, e.topicWorkers) // 协程数 = min(Topic数/2, 配置的worker数)
}
// 启动Topic采集协程
for w := 1; w <= N; w++ {
go loopTopics()
}
// 分发Topic到协程通道
for _, topic := range topics {
if e.topicFilter.MatchString(topic) && !e.topicExclude.MatchString(topic) {
wg.Add(1)
topicChannel <- topic
}
}
close(topicChannel) // 关闭通道,协程退出
wg.Wait() // 等待所有Topic采集完成
// ========== 5. 采集消费组指标 ==========
getConsumerGroupMetrics := func(broker *sarama.Broker) {
defer wg.Done()
// 连接Broker(采集消费组需要直接连接Broker)
if err := broker.Open(e.client.Config()); err != nil && err != sarama.ErrAlreadyConnected {
klog.Errorf("Cannot connect to broker %d: %v", broker.ID(), err)
return
}
defer broker.Close() // 确保关闭连接
// 列出所有消费组
groups, err := broker.ListGroups(&sarama.ListGroupsRequest{})
if err != nil {
klog.Errorf("Cannot get consumer group: %v", err)
return
}
// 过滤消费组
groupIds := make([]string, 0)
for groupId := range groups.Groups {
if e.groupFilter.MatchString(groupId) && !e.groupExclude.MatchString(groupId) {
groupIds = append(groupIds, groupId)
}
}
// 描述消费组(获取成员、分配的topic/分区)
describeGroups, err := broker.DescribeGroups(&sarama.DescribeGroupsRequest{Groups: groupIds})
if err != nil {
klog.Errorf("Cannot get describe groups: %v", err)
return
}
// 遍历消费组采集指标
for _, group := range describeGroups.Groups {
if group.Err != 0 {
klog.Errorf("Cannot describe for the group %s with error code %d", group.GroupId, group.Err)
continue
}
// 构建OffsetFetchRequest(获取消费组offset)
offsetFetchRequest := sarama.OffsetFetchRequest{ConsumerGroup: group.GroupId, Version: e.fetchOffsetVersion()}
if e.offsetShowAll {
// 显示所有Topic/分区的offset(包括未消费的)
for topic, partitions := range offset {
for partition := range partitions {
offsetFetchRequest.AddPartition(topic, partition)
}
}
} else {
// 仅显示消费组已分配的Topic/分区
for _, member := range group.Members {
if len(member.MemberAssignment) == 0 {
klog.Warningf("MemberAssignment is empty for group member: %v in group: %v", member.MemberId, group.GroupId)
continue
}
// 解析成员分配的Topic/分区
assignment, err := member.GetMemberAssignment()
if err != nil {
klog.Errorf("Cannot get GetMemberAssignment of group member %v : %v", member, err)
continue
}
for topic, partions := range assignment.Topics {
for _, partition := range partions {
offsetFetchRequest.AddPartition(topic, partition)
}
}
}
}
// 采集消费组成员数
ch <- prometheus.MustNewConstMetric(
consumergroupMembers, prometheus.GaugeValue, float64(len(group.Members)), group.GroupId,
)
// 获取消费组offset
offsetFetchResponse, err := broker.FetchOffset(&offsetFetchRequest)
if err != nil {
klog.Errorf("Cannot get offset of group %s: %v", group.GroupId, err)
continue
}
// 遍历消费组的Topic/分区
for topic, partitions := range offsetFetchResponse.Blocks {
// 过滤未消费的Topic
topicConsumed := false
for _, offsetFetchResponseBlock := range partitions {
if offsetFetchResponseBlock.Offset != -1 { // Kafka返回-1表示无offset
topicConsumed = true
break
}
}
if !topicConsumed {
continue
}
// 计算总和指标
var currentOffsetSum int64
var lagSum int64
for partition, offsetFetchResponseBlock := range partitions {
err := offsetFetchResponseBlock.Err
if err != sarama.ErrNoError {
klog.Errorf("Error for partition %d :%v", partition, err.Error())
continue
}
// 消费组当前offset
currentOffset := offsetFetchResponseBlock.Offset
currentOffsetSum += currentOffset
ch <- prometheus.MustNewConstMetric(
consumergroupCurrentOffset, prometheus.GaugeValue, float64(currentOffset), group.GroupId, topic, strconv.FormatInt(int64(partition), 10),
)
// 计算消费lag(最新offset - 消费组offset)
e.mu.Lock()
currentPartitionOffset, currentPartitionOffsetError := e.client.GetOffset(topic, partition, sarama.OffsetNewest)
if currentPartitionOffsetError != nil {
klog.Errorf("Cannot get current offset of topic %s partition %d: %v", topic, partition, currentPartitionOffsetError)
} else {
var lag int64
if offsetFetchResponseBlock.Offset == -1 {
lag = -1 // 无offset时lag为-1
} else {
if offset, ok := offset[topic][partition]; ok {
if currentPartitionOffset == -1 {
currentPartitionOffset = offset
}
}
lag = currentPartitionOffset - offsetFetchResponseBlock.Offset
lagSum += lag
}
// 采集lag指标
ch <- prometheus.MustNewConstMetric(
consumergroupLag, prometheus.GaugeValue, float64(lag), group.GroupId, topic, strconv.FormatInt(int64(partition), 10),
)
}
e.mu.Unlock()
}
// 采集总和指标
ch <- prometheus.MustNewConstMetric(
consumergroupCurrentOffsetSum, prometheus.GaugeValue, float64(currentOffsetSum), group.GroupId, topic,
)
ch <- prometheus.MustNewConstMetric(
consumergroupLagSum, prometheus.GaugeValue, float64(lagSum), group.GroupId, topic,
)
}
}
}
// 采集消费组指标(遍历Broker)
klog.V(DEBUG).Info("Fetching consumer group metrics")
if len(e.client.Brokers()) > 0 {
// 去重Broker地址(避免重复采集)
uniqueBrokerAddresses := make(map[string]bool)
var servers []string
for _, broker := range e.client.Brokers() {
normalizedAddress := strings.ToLower(broker.Addr())
if !uniqueBrokerAddresses[normalizedAddress] {
uniqueBrokerAddresses[normalizedAddress] = true
servers = append(servers, broker.Addr())
}
}
klog.Info(servers)
// 遍历Broker采集消费组
for _, broker := range e.client.Brokers() {
for _, server := range servers {
if server == broker.Addr() {
wg.Add(1)
go getConsumerGroupMetrics(broker)
}
}
}
wg.Wait()
} else {
klog.Errorln("No valid broker, cannot get consumer group metrics")
}
}
关键讲解:
-
-
- Collect/Describe:Prometheus Collector 接口的两个核心方法,Describe 暴露指标元数据,Collect 采集指标值
- 并发控制:
- 非并发模式:所有 Prometheus 抓取请求共享一次采集结果(避免重复请求 Kafka,减轻集群压力)
- 并发模式:每个请求独立采集(适合小集群)
- 协程池:Topic 采集使用协程池(topicWorkers),避免协程爆炸(大集群关键优化)
- 消费组 Lag 计算:lag = 最新offset - 消费组offset,是 Kafka 消费延迟的核心指标
- 去重 Broker:消费组采集时去重 Broker 地址,避免重复采集(Kafka 集群中多个 Broker 返回相同消费组信息)
-
3.7、初始化与 main 函数
https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L716 init() 函数位置
https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L757 main() 函数位置
// init:包初始化(优先于main执行)
func init() {
metrics.UseNilMetrics = true // 禁用go-metrics(避免资源占用)
prometheus.MustRegister(versionCollector.NewCollector("kafka_exporter")) // 注册版本指标
}
// 命令行参数工具函数(兼容flag和kingpin)
func toFlagString(name string, help string, value string) *string {
flag.CommandLine.String(name, value, help) // 兼容原生flag
return kingpin.Flag(name, help).Default(value).String()
}
func toFlagBool(name string, help string, value bool, valueString string) *bool {
flag.CommandLine.Bool(name, value, help)
return kingpin.Flag(name, help).Default(valueString).Bool()
}
func toFlagStringsVar(name string, help string, value string, target *[]string) {
flag.CommandLine.String(name, value, help)
kingpin.Flag(name, help).Default(value).StringsVar(target)
}
func toFlagStringVar(name string, help string, value string, target *string) {
flag.CommandLine.String(name, value, help)
kingpin.Flag(name, help).Default(value).StringVar(target)
}
func toFlagBoolVar(name string, help string, value bool, valueString string, target *bool) {
flag.CommandLine.Bool(name, value, help)
kingpin.Flag(name, help).Default(valueString).BoolVar(target)
}
func toFlagIntVar(name string, help string, value int, valueString string, target *int) {
flag.CommandLine.Int(name, value, help)
kingpin.Flag(name, help).Default(valueString).IntVar(target)
}
// main:程序入口
func main() {
var (
listenAddress = toFlagString("web.listen-address", "Address to listen on for web interface and telemetry.", ":9308")
metricsPath = toFlagString("web.telemetry-path", "Path under which to expose metrics.", "/metrics")
topicFilter = toFlagString("topic.filter", "Regex that determines which topics to collect.", ".*")
topicExclude = toFlagString("topic.exclude", "Regex that determines which topics to exclude.", "^$")
groupFilter = toFlagString("group.filter", "Regex that determines which consumer groups to collect.", ".*")
groupExclude = toFlagString("group.exclude", "Regex that determines which consumer groups to exclude.", "^$")
logSarama = toFlagBool("log.enable-sarama", "Turn on Sarama logging, default is false.", false, "false")
opts = kafkaOpts{} // 初始化Kafka配置
)
// 绑定命令行参数到opts结构体
toFlagStringsVar("kafka.server", "Address (host:port) of Kafka server.", "kafka:9092", &opts.uri)
toFlagBoolVar("sasl.enabled", "Connect using SASL/PLAIN, default is false.", false, "false", &opts.useSASL)
toFlagBoolVar("sasl.handshake", "Only set this to false if using a non-Kafka SASL proxy, default is true.", true, "true", &opts.useSASLHandshake)
toFlagStringVar("sasl.username", "SASL user name.", "", &opts.saslUsername)
toFlagStringVar("sasl.password", "SASL user password.", "", &opts.saslPassword)
toFlagStringVar("sasl.aws-region", "The AWS region for IAM SASL authentication", os.Getenv("AWS_REGION"), &opts.saslAwsRegion)
toFlagStringVar("sasl.mechanism", "SASL SCRAM SHA algorithm: sha256 or sha512 or SASL mechanism: gssapi or awsiam", "", &opts.saslMechanism)
toFlagStringVar("sasl.service-name", "Service name when using kerberos Auth", "", &opts.serviceName)
toFlagStringVar("sasl.kerberos-config-path", "Kerberos config path", "", &opts.kerberosConfigPath)
toFlagStringVar("sasl.realm", "Kerberos realm", "", &opts.realm)
toFlagStringVar("sasl.kerberos-auth-type", "Kerberos auth type. Either 'keytabAuth' or 'userAuth'", "", &opts.kerberosAuthType)
toFlagStringVar("sasl.keytab-path", "Kerberos keytab file path", "", &opts.keyTabPath)
toFlagBoolVar("sasl.disable-PA-FX-FAST", "Configure the Kerberos client to not use PA_FX_FAST, default is false.", false, "false", &opts.saslDisablePAFXFast)
toFlagBoolVar("tls.enabled", "Connect to Kafka using TLS, default is false.", false, "false", &opts.useTLS)
toFlagStringVar("tls.server-name", "Used to verify the hostname on the returned certificates unless tls.insecure-skip-tls-verify is given. The kafka server's name should be given.", "", &opts.tlsServerName)
toFlagStringVar("tls.ca-file", "The optional certificate authority file for Kafka TLS client authentication.", "", &opts.tlsCAFile)
toFlagStringVar("tls.cert-file", "The optional certificate file for Kafka client authentication.", "", &opts.tlsCertFile)
toFlagStringVar("tls.key-file", "The optional key file for Kafka client authentication.", "", &opts.tlsKeyFile)
toFlagBoolVar("server.tls.enabled", "Enable TLS for web server, default is false.", false, "false", &opts.serverUseTLS)
toFlagBoolVar("server.tls.mutual-auth-enabled", "Enable TLS client mutual authentication, default is false.", false, "false", &opts.serverMutualAuthEnabled)
toFlagStringVar("server.tls.ca-file", "The certificate authority file for the web server.", "", &opts.serverTlsCAFile)
toFlagStringVar("server.tls.cert-file", "The certificate file for the web server.", "", &opts.serverTlsCertFile)
toFlagStringVar("server.tls.key-file", "The key file for the web server.", "", &opts.serverTlsKeyFile)
toFlagBoolVar("tls.insecure-skip-tls-verify", "If true, the server's certificate will not be checked for validity. This will make your HTTPS connections insecure. Default is false", false, "false", &opts.tlsInsecureSkipTLSVerify)
toFlagStringVar("kafka.version", "Kafka broker version", sarama.V2_0_0_0.String(), &opts.kafkaVersion)
toFlagBoolVar("use.consumelag.zookeeper", "if you need to use a group from zookeeper, default is false", false, "false", &opts.useZooKeeperLag)
toFlagStringsVar("zookeeper.server", "Address (hosts) of zookeeper server.", "localhost:2181", &opts.uriZookeeper)
toFlagStringVar("kafka.labels", "Kafka cluster name", "", &opts.labels)
toFlagStringVar("refresh.metadata", "Metadata refresh interval", "30s", &opts.metadataRefreshInterval)
toFlagBoolVar("offset.show-all", "Whether show the offset/lag for all consumer group, otherwise, only show connected consumer groups, default is true", true, "true", &opts.offsetShowAll)
toFlagBoolVar("concurrent.enable", "If true, all scrapes will trigger kafka operations otherwise, they will share results. WARN: This should be disabled on large clusters. Default is false", false, "false", &opts.allowConcurrent)
toFlagIntVar("topic.workers", "Number of topic workers", 100, "100", &opts.topicWorkers)
toFlagBoolVar("kafka.allow-auto-topic-creation", "If true, the broker may auto-create topics that we requested which do not already exist, default is false.", false, "false", &opts.allowAutoTopicCreation)
toFlagIntVar("verbosity", "Verbosity log level", 0, "0", &opts.verbosityLogLevel)
plConfig := plog.Config{}
plogflag.AddFlags(kingpin.CommandLine, &plConfig) // 绑定Prometheus日志配置
kingpin.Version(version.Print("kafka_exporter")) // 版本信息
kingpin.HelpFlag.Short('h') // 短参数-h
kingpin.Parse() // 解析命令行参数
// 解析自定义标签(如cluster=kafka-prod)
labels := make(map[string]string)
if opts.labels != "" {
for _, label := range strings.Split(opts.labels, ",") {
splitted := strings.Split(label, "=")
if len(splitted) >= 2 {
labels[splitted[0]] = splitted[1]
}
}
}
// 启动Exporter
setup(*listenAddress, *metricsPath, *topicFilter, *topicExclude, *groupFilter, *groupExclude, *logSarama, opts, labels)
}
// setup:初始化并启动HTTP服务
func setup(
listenAddress string,
metricsPath string,
topicFilter string,
topicExclude string,
groupFilter string,
groupExclude string,
logSarama bool,
opts kafkaOpts,
labels map[string]string,
) {
// 初始化klog
klog.InitFlags(flag.CommandLine)
if err := flag.Set("logtostderr", "true"); err != nil {
klog.Errorf("Error on setting logtostderr to true: %v", err)
}
err := flag.Set("v", strconv.Itoa(opts.verbosityLogLevel))
if err != nil {
klog.Errorf("Error on setting v to %v: %v", strconv.Itoa(opts.verbosityLogLevel), err)
}
defer klog.Flush() // 确保日志刷出
// 打印启动信息
klog.V(INFO).Infoln("Starting kafka_exporter", version.Info())
klog.V(DEBUG).Infoln("Build context", version.BuildContext())
// ========== 初始化Prometheus指标描述符 ==========
clusterBrokers = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "brokers"),
"Number of Brokers in the Kafka Cluster.",
nil, labels,
)
clusterBrokerInfo = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "broker_info"),
"Information about the Kafka Broker.",
[]string{"id", "address"}, labels,
)
topicPartitions = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "topic", "partitions"),
"Number of partitions for this Topic",
[]string{"topic"}, labels,
)
topicCurrentOffset = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "topic", "partition_current_offset"),
"Current Offset of a Broker at Topic/Partition",
[]string{"topic", "partition"}, labels,
)
topicOldestOffset = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "topic", "partition_oldest_offset"),
"Oldest Offset of a Broker at Topic/Partition",
[]string{"topic", "partition"}, labels,
)
topicPartitionLeader = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "topic", "partition_leader"),
"Leader Broker ID of this Topic/Partition",
[]string{"topic", "partition"}, labels,
)
topicPartitionReplicas = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "topic", "partition_replicas"),
"Number of Replicas for this Topic/Partition",
[]string{"topic", "partition"}, labels,
)
topicPartitionInSyncReplicas = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "topic", "partition_in_sync_replica"),
"Number of In-Sync Replicas for this Topic/Partition",
[]string{"topic", "partition"}, labels,
)
topicPartitionUsesPreferredReplica = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "topic", "partition_leader_is_preferred"),
"1 if Topic/Partition is using the Preferred Broker",
[]string{"topic", "partition"}, labels,
)
topicUnderReplicatedPartition = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "topic", "partition_under_replicated_partition"),
"1 if Topic/Partition is under Replicated",
[]string{"topic", "partition"}, labels,
)
consumergroupCurrentOffset = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "consumergroup", "current_offset"),
"Current Offset of a ConsumerGroup at Topic/Partition",
[]string{"consumergroup", "topic", "partition"}, labels,
)
consumergroupCurrentOffsetSum = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "consumergroup", "current_offset_sum"),
"Current Offset of a ConsumerGroup at Topic for all partitions",
[]string{"consumergroup", "topic"}, labels,
)
consumergroupLag = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "consumergroup", "lag"),
"Current Approximate Lag of a ConsumerGroup at Topic/Partition",
[]string{"consumergroup", "topic", "partition"}, labels,
)
consumergroupLagZookeeper = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "consumergroupzookeeper", "lag_zookeeper"),
"Current Approximate Lag(zookeeper) of a ConsumerGroup at Topic/Partition",
[]string{"consumergroup", "topic", "partition"}, nil,
)
consumergroupLagSum = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "consumergroup", "lag_sum"),
"Current Approximate Lag of a ConsumerGroup at Topic for all partitions",
[]string{"consumergroup", "topic"}, labels,
)
consumergroupMembers = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "consumergroup", "members"),
"Amount of members in a consumer group",
[]string{"consumergroup"}, labels,
)
// 启用Sarama日志(调试用)
if logSarama {
sarama.Logger = log.New(os.Stdout, "[sarama] ", log.LstdFlags)
}
// 创建Exporter实例
exporter, err := NewExporter(opts, topicFilter, topicExclude, groupFilter, groupExclude)
if err != nil {
klog.Fatalln(err)
}
defer exporter.client.Close() // 确保关闭Kafka客户端
prometheus.MustRegister(exporter) // 注册Exporter到Prometheus
// ========== 配置HTTP路由 ==========
http.Handle(metricsPath, promhttp.Handler()) // 指标暴露接口
// 首页(引导到指标页面)
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
_, err := w.Write([]byte(`<html>
<head><title>Kafka Exporter</title></head>
<body>
<h1>Kafka Exporter</h1>
<p><a href='` + metricsPath + `'>Metrics</a></p>
</body>
</html>`))
if err != nil {
klog.Error("Error handle / request", err)
}
})
// 健康检查接口
http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
_, err := w.Write([]byte("ok"))
if err != nil {
klog.Error("Error handle /healthz request", err)
}
})
// ========== 启动HTTP/HTTPS服务 ==========
if opts.serverUseTLS {
klog.V(INFO).Infoln("Listening on HTTPS", listenAddress)
// 校验服务端证书
_, err := CanReadCertAndKey(opts.serverTlsCertFile, opts.serverTlsKeyFile)
if err != nil {
klog.Error("error reading server cert and key")
}
// 配置TLS双向认证
clientAuthType := tls.NoClientCert
if opts.serverMutualAuthEnabled {
clientAuthType = tls.RequireAndVerifyClientCert
}
// 加载CA证书
certPool := x509.NewCertPool()
if opts.serverTlsCAFile != "" {
if caCert, err := os.ReadFile(opts.serverTlsCAFile); err == nil {
certPool.AppendCertsFromPEM(caCert)
} else {
klog.Error("error reading server ca")
}
}
// TLS配置(安全最佳实践)
tlsConfig := &tls.Config{
ClientCAs: certPool,
ClientAuth: clientAuthType,
MinVersion: tls.VersionTLS12, // 禁用老版本TLS
CurvePreferences: []tls.CurveID{tls.CurveP521, tls.CurveP384, tls.CurveP256}, // 安全曲线
PreferServerCipherSuites: true,
CipherSuites: []uint16{ // 安全加密套件
tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,
tls.TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,
tls.TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,
tls.TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256,
tls.TLS_RSA_WITH_AES_256_GCM_SHA384,
tls.TLS_RSA_WITH_AES_256_CBC_SHA,
tls.TLS_RSA_WITH_AES_128_CBC_SHA256,
},
}
// 启动HTTPS服务
server := &http.Server{
Addr: listenAddress,
TLSConfig: tlsConfig,
}
klog.Fatal(server.ListenAndServeTLS(opts.serverTlsCertFile, opts.serverTlsKeyFile))
} else {
// 启动HTTP服务
klog.V(INFO).Infoln("Listening on HTTP", listenAddress)
klog.Fatal(http.ListenAndServe(listenAddress, nil))
}
}
关键讲解:
-
-
- 命令行参数适配:兼容原生flag和kingpin,兼顾不同使用习惯
- 指标命名规范:prometheus.BuildFQName生成规范的指标名(如kafka_topic_partitions)
- 健康检查接口:/healthz是云原生应用的标准,用于 k8s 存活探针
- TLS 最佳实践:禁用老版本 TLS、使用安全加密套件、支持双向认证(生产环境必备)
-
四、核心设计思路总结
4.1、核心流程:指标采集实现
collect方法 是指标采集的核心,按“Broker → Topic → 消费组”的顺序采集所有指标:
4.1.1、Broker 指标采集
https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L387
// 采集Broker总数
ch <- prometheus.MustNewConstMetric(
clusterBrokers, prometheus.GaugeValue, float64(len(e.client.Brokers())),
)
// 采集每个Broker的ID和地址
for _, b := range e.client.Brokers() {
ch <- prometheus.MustNewConstMetric(
clusterBrokerInfo, prometheus.GaugeValue, 1, strconv.Itoa(int(b.ID())), b.Addr(),
)
}
4.1.2、Topic 指标采集(协程池优化)
https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L418
Topic 采集是性能敏感点,kafka Exporter 使用协程池控制并发数,避免协程爆炸:
// Topic分发通道
topicChannel := make(chan string)
// 单个Topic采集逻辑
getTopicMetrics := func(topic string) {
defer wg.Done()
// 过滤Topic(包含+排除正则)
if !e.topicFilter.MatchString(topic) || e.topicExclude.MatchString(topic) {
return
}
// 获取Topic分区列表
partitions, err := e.client.Partitions(topic)
// 采集分区数
ch <- prometheus.MustNewConstMetric(
topicPartitions, prometheus.GaugeValue, float64(len(partitions)), topic,
)
// 遍历分区采集Offset、副本数、ISR、欠复制等指标
for _, partition := range partitions {
// 最新Offset
currentOffset, err := e.client.GetOffset(topic, partition, sarama.OffsetNewest)
// 最旧Offset
oldestOffset, err := e.client.GetOffset(topic, partition, sarama.OffsetOldest)
// 副本数
replicas, err := e.client.Replicas(topic, partition)
// ISR数
inSyncReplicas, err := e.client.InSyncReplicas(topic, partition)
// 欠复制判断(ISR数 < 副本数)
if len(inSyncReplicas) < len(replicas) {
ch <- prometheus.MustNewConstMetric(
topicUnderReplicatedPartition, prometheus.GaugeValue, 1, topic, strconv.FormatInt(int64(partition), 10),
)
}
// 其他Topic指标采集...
}
}
// 协程池控制:启动N个协程处理Topic
N := minx(len(topics)/2, e.topicWorkers) // 协程数 = min(Topic数/2, 配置的worker数)
for w := 1; w <= N; w++ {
go loopTopics() // 循环处理Topic通道
}
// 分发Topic到协程
for _, topic := range topics {
wg.Add(1)
topicChannel <- topic
}
这里有个重要的优化点:
-
-
-
- 正则预编译:topicFilter/topicExclude在 Exporter 初始化时预编译,避免每次采集重复编译
- 协程数控制:通过topicWorkers参数限制协程数,默认 100,可根据集群大小调整
-
-
4.1.3、消费组指标采集(核心:Lag 计算)
https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L565
消费组 Lag(消费延迟)是 kafka 监控的核心指标,计算公式为:Lag = Topic分区最新Offset - 消费组当前Offset
getConsumerGroupMetrics := func(broker *sarama.Broker) {
defer wg.Done()
// 连接Broker(消费组信息需要直接连接Broker)
if err := broker.Open(e.client.Config()); err != nil && err != sarama.ErrAlreadyConnected {
return
}
defer broker.Close()
// 列出所有消费组并过滤
groups, err := broker.ListGroups(&sarama.ListGroupsRequest{})
groupIds := make([]string, 0)
for groupId := range groups.Groups {
if e.groupFilter.MatchString(groupId) && !e.groupExclude.MatchString(groupId) {
groupIds = append(groupIds, groupId)
}
}
// 遍历消费组采集指标
for _, group := range describeGroups.Groups {
// 构建OffsetFetchRequest获取消费组Offset
offsetFetchRequest := sarama.OffsetFetchRequest{ConsumerGroup: group.GroupId, Version: e.fetchOffsetVersion()}
// 获取消费组Offset响应
offsetFetchResponse, err := broker.FetchOffset(&offsetFetchRequest)
// 遍历消费组的Topic/分区计算Lag
for topic, partitions := range offsetFetchResponse.Blocks {
var lagSum int64
for partition, offsetBlock := range partitions {
// 消费组当前Offset
currentOffset := offsetBlock.Offset
// Topic分区最新Offset
currentPartitionOffset, _ := e.client.GetOffset(topic, partition, sarama.OffsetNewest)
// 计算Lag
lag := currentPartitionOffset - currentOffset
lagSum += lag
// 暴露Lag指标
ch <- prometheus.MustNewConstMetric(
consumergroupLag, prometheus.GaugeValue, float64(lag), group.GroupId, topic, strconv.FormatInt(int64(partition), 10),
)
}
// 暴露Topic级别的Lag总和
ch <- prometheus.MustNewConstMetric(
consumergroupLagSum, prometheus.GaugeValue, float64(lagSum), group.GroupId, topic,
)
}
}
}
兼容适配:
-
-
-
- ZooKeeper 模式:兼容老版本 Kafka(0.8.x),从 ZooKeeper 读取消费组 Offset
- 新版本模式:使用 Kafka 2.0 + 的ListGroups/DescribeGroups/FetchOffset API,性能更好
-
-
4.1.4、认证模块:多场景适配
kafka Exporter 支持多种认证方式,核心通过 sarama.Config配置实现:
4.1.4.1、SASL/PLAIN 认证
https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L219
config.Net.SASL.Enable = true config.Net.SASL.User = opts.saslUsername config.Net.SASL.Password = opts.saslPassword config.Net.SASL.Mechanism = sarama.SASLMechanism(sarama.SASLTypePlaintext)
4.1.4.2、AWS MSK IAM 认证
https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L209
config.Net.SASL.Mechanism = sarama.SASLMechanism(sarama.SASLTypeOAuth)
config.Net.SASL.TokenProvider = &MSKAccessTokenProvider{region: opts.saslAwsRegion}
4.1.4.3、TLS 认证
https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L232
config.Net.TLS.Enable = true
config.Net.TLS.Config = &tls.Config{
ServerName: opts.tlsServerName,
InsecureSkipVerify: opts.tlsInsecureSkipTLSVerify, // 生产环境禁用
}
// 加载CA证书
if opts.tlsCAFile != "" {
ca, _ := os.ReadFile(opts.tlsCAFile)
config.Net.TLS.Config.RootCAs = x509.NewCertPool()
config.Net.TLS.Config.RootCAs.AppendCertsFromPEM(ca)
}、
下班了,后面有空再说....

浙公网安备 33010602011771号