【左扬精讲】| kafka_exporter v1.9.0 源码专题 | 架构设计与实现原理全解析

【左扬精讲】| kafka_exporter v1.9.0 源码专题 | 架构设计与实现原理全解析

https://github.com/danielqsj/kafka_exporter/releases/tag/v1.9.0

        Kafka 作为分布式消息队列的事实标准,其可观测性是保障集群稳定运行的核心。Kafka Exporter 作为 Prometheus 生态中最主流的 Kafka 指标采集工具,能够全面采集集群、Topic、消费组等核心指标。本文将从源码角度深度剖析 Kafka Exporter 的架构设计与实现原理,帮助开发者理解其核心逻辑,快速上手二次开发或问题排查。 

一、Kafka Exporter 核心定位 

Kafka Exporter 的核心目标是遵循 Prometheus 规范采集 Kafka 集群指标,并通过 HTTP/HTTPS 暴露指标接口,核心能力包括:

    • 集群层:Broker 数量、Broker 信息等基础指标
    • Topic 层:分区数、Offset、副本数、ISR、欠复制等指标
    • 消费组层:Offset、消费延迟(Lag)、成员数等核心指标
    • 多认证支持:SASL/PLAIN、SCRAM、Kerberos、AWS IAM、TLS
    • 兼容适配:支持 ZooKeeper 模式(老版本 Kafka)和新版本消费组 API

二、代码整体架构设计

Kafka Exporter 的架构遵循“模块化、接口化、配置驱动”的设计原则,整体可分为 5 大核心模块,架构图如下:

┌─────────────────────────────────────────────────────────────┐
│                     命令行参数解析层                        │
│ (kingpin/flag) 处理配置 → 生成kafkaOpts配置结构体            │
└───────────────────────────┬─────────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────────┐
│                     Kafka客户端层                           │
│ (sarama) 封装Kafka连接、认证、元数据管理                    │
└───────────────────────────┬─────────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────────┐
│                     指标采集核心层                          │
│ (Exporter结构体) 实现Prometheus Collector接口,采集所有指标  │
└───────────────────────────┬─────────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────────┐
│                     并发控制层                              │
│ (协程池/互斥锁) 控制采集并发,避免集群压力                  │
└───────────────────────────┬─────────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────────┐
│                     HTTP服务层                              │
│ (net/http) 暴露/metrics接口、健康检查、HTTPS支持            │
└─────────────────────────────────────────────────────────────┘
    • 接口化:核心 Exporter 结构体实现 Prometheus 的Collector接口,天然兼容 Prometheus 生态
    • 配置驱动:所有行为通过命令行参数控制,无硬编码,适配不同环境
    • 性能优先:元数据缓存、协程池、非并发共享采集结果等优化,适配大集群
    • 兼容适配:支持 Kafka 多版本协议、多种认证方式、ZooKeeper / 新版本消费组双模式

 三、代码核心模块源码解读

3.1、包导入与常量定义

https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L45

package main

import (
	"context"          // 上下文管理,用于AWS IAM token生成
	"crypto/tls"       // TLS加密
	"crypto/x509"      // X.509证书解析
	"flag"             // 命令行参数解析(基础)
	"fmt"              // 格式化输出
	"log"              // 基础日志
	"net/http"         // HTTP服务
	"os"               // 操作系统交互
	"regexp"           // 正则表达式(过滤topic/消费组)
	"strconv"          // 字符串/数值转换
	"strings"          // 字符串处理
	"sync"             // 同步原语(锁、WaitGroup)
	"time"             // 时间处理

	"github.com/IBM/sarama"               // Kafka Go客户端核心库
	kingpin "github.com/alecthomas/kingpin/v2" // 增强型命令行参数解析
	"github.com/aws/aws-msk-iam-sasl-signer-go/signer" // AWS MSK IAM认证
	"github.com/krallistic/kazoo-go"      // ZooKeeper客户端(兼容老版本消费组)
	"github.com/pkg/errors"               // 错误包装(增强错误信息)
	"github.com/prometheus/client_golang/prometheus" // Prometheus指标核心库
	"github.com/prometheus/client_golang/prometheus/promhttp" // Prometheus HTTP handler
	plog "github.com/prometheus/common/promlog" // Prometheus日志配置
	plogflag "github.com/prometheus/common/promlog/flag" // Prometheus日志命令行参数

	versionCollector "github.com/prometheus/client_golang/prometheus/collectors/version" // 版本指标采集器
	"github.com/prometheus/common/version" // 版本信息
	"github.com/rcrowley/go-metrics"       // 指标库(此处禁用)
	"k8s.io/klog/v2" // K8s风格日志(支持分级日志)
)

// 常量定义:指标命名空间、客户端ID
const (
	namespace = "kafka"       // Prometheus指标前缀(如kafka_brokers)
	clientID  = "kafka_exporter" // Kafka客户端标识
)

// 日志级别常量
const (
	INFO  = 0 // 普通日志
	DEBUG = 1 // 调试日志
	TRACE = 2 // 追踪日志(最详细)
)

关键性讲解: 

      • sarama是 Kafka Go 客户端的事实标准,几乎所有 Kafka Go 项目都基于它
      • kingpin比原生flag更强大,支持参数默认值、类型校验、子命令等
      • klog支持分级日志(V (1)、V (2)),适合不同环境的日志输出控制
      • namespace是 Prometheus 指标的命名规范,确保指标唯一性

3.2、Prometheus 指标描述符定义

https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L45

var (
	// 集群级别指标
	clusterBrokers                     *prometheus.Desc // Kafka broker总数
	clusterBrokerInfo                  *prometheus.Desc // Broker信息(ID、地址)
	// Topic级别指标
	topicPartitions                    *prometheus.Desc // Topic分区数
	topicCurrentOffset                 *prometheus.Desc // Topic分区最新offset
	topicOldestOffset                  *prometheus.Desc // Topic分区最旧offset
	topicPartitionLeader               *prometheus.Desc // Topic分区leader broker ID
	topicPartitionReplicas             *prometheus.Desc // Topic分区副本数
	topicPartitionInSyncReplicas       *prometheus.Desc // Topic分区ISR(同步副本)数
	topicPartitionUsesPreferredReplica *prometheus.Desc // 是否使用首选副本(1=是,0=否)
	topicUnderReplicatedPartition      *prometheus.Desc // 是否欠复制(1=是,0=否)
	// 消费组级别指标
	consumergroupCurrentOffset         *prometheus.Desc // 消费组当前offset
	consumergroupCurrentOffsetSum      *prometheus.Desc // 消费组某topic所有分区offset总和
	consumergroupLag                   *prometheus.Desc // 消费组lag(消费延迟)
	consumergroupLagSum                *prometheus.Desc // 消费组某topic所有分区lag总和
	consumergroupLagZookeeper          *prometheus.Desc // ZooKeeper模式下的消费组lag
	consumergroupMembers               *prometheus.Desc // 消费组成员数
) 

关键讲解:

      • prometheus.Desc是 Prometheus 指标的身份说明元数据描述(里面会写清楚指标叫啥名、有啥用,还有带哪些标签)
      • 指标标签设计遵循 Prometheus 最佳实践:
      • 集群级别:无标签 / 基础标签
      • Topic 级别:topic、partition标签
      • 消费组级别:consumergroup、topic、partition标签  

3.3、核心结构体定义

https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L65

// Exporter:Kafka指标导出器核心结构体(实现prometheus.Collector接口)
type Exporter struct {
	client                  sarama.Client       // Kafka客户端(sarama核心)
	topicFilter             *regexp.Regexp      // Topic包含过滤正则
	topicExclude            *regexp.Regexp      // Topic排除过滤正则
	groupFilter             *regexp.Regexp      // 消费组包含过滤正则
	groupExclude            *regexp.Regexp      // 消费组排除过滤正则
	mu                      sync.Mutex          // 通用互斥锁(保护共享变量)
	useZooKeeperLag         bool                // 是否使用ZooKeeper获取消费组lag
	zookeeperClient         *kazoo.Kazoo        // ZooKeeper客户端
	nextMetadataRefresh     time.Time           // 下次元数据刷新时间
	metadataRefreshInterval time.Duration       // 元数据刷新间隔
	offsetShowAll           bool                // 是否显示所有消费组offset(包括未连接的)
	topicWorkers            int                 // 采集topic指标的协程数
	allowConcurrent         bool                // 是否允许并发采集(大集群禁用)
	sgMutex                 sync.Mutex          // 并发控制专用锁
	sgWaitCh                chan struct{}       // 并发等待通道
	sgChans                 []chan<- prometheus.Metric // 并发采集的指标通道
	consumerGroupFetchAll   bool                // 是否支持全量消费组抓取(Kafka 2.0+)
}

// kafkaOpts:Kafka连接配置结构体(整合所有命令行参数)
type kafkaOpts struct {
	uri                      []string // Kafka broker地址列表
	useSASL                  bool     // 是否启用SASL认证
	useSASLHandshake         bool     // SASL握手(非Kafka代理需关闭)
	saslUsername             string   // SASL用户名
	saslPassword             string   // SASL密码
	saslMechanism            string   // SASL机制(scram-sha256/512、gssapi、awsiam、plain)
	saslDisablePAFXFast      bool     // Kerberos禁用PA_FX_FAST
	saslAwsRegion            string   // AWS MSK区域
	useTLS                   bool     // Kafka客户端启用TLS
	tlsServerName            string   // TLS服务器名(证书校验)
	tlsCAFile                string   // TLS CA文件
	tlsCertFile              string   // TLS客户端证书
	tlsKeyFile               string   // TLS客户端私钥
	serverUseTLS             bool     // Exporter自身HTTP服务启用TLS
	serverMutualAuthEnabled  bool     // Exporter启用TLS双向认证
	serverTlsCAFile          string   // Exporter TLS CA文件
	serverTlsCertFile        string   // Exporter TLS证书
	serverTlsKeyFile         string   // Exporter TLS私钥
	tlsInsecureSkipTLSVerify bool     // 跳过TLS证书校验(测试用)
	kafkaVersion             string   // Kafka版本(兼容不同协议)
	useZooKeeperLag          bool     // 使用ZooKeeper获取lag
	uriZookeeper             []string // ZooKeeper地址列表
	labels                   string   // 自定义标签(如集群名)
	metadataRefreshInterval  string   // 元数据刷新间隔(字符串格式)
	serviceName              string   // Kerberos服务名
	kerberosConfigPath       string   // Kerberos配置路径
	realm                    string   // Kerberos域
	keyTabPath               string   // Kerberos keytab文件
	kerberosAuthType         string   // Kerberos认证类型(keytab/user)
	offsetShowAll            bool     // 显示所有消费组offset
	topicWorkers             int      // Topic采集协程数
	allowConcurrent          bool     // 允许并发采集
	allowAutoTopicCreation   bool     // 允许自动创建topic
	verbosityLogLevel        int      // 日志级别
}

// MSKAccessTokenProvider:AWS MSK IAM认证的Token提供者(实现sarama.AccessTokenProvider接口)
type MSKAccessTokenProvider struct {
	region string // AWS区域
}

关键讲解:

      • Exporter 实现了prometheus.Collector接口(Describe+Collect方法),是 Prometheus 采集器的核心规范
      • kafkaOpts是典型的 "配置聚合" 模式,将所有命令行参数整合为结构体,避免函数参数爆炸
      • MSKAccessTokenProvider适配 sarama 的AccessTokenProvider接口,实现 AWS IAM 动态生成认证 Token

3.4、证书校验

https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L133

// CanReadCertAndKey:校验证书和私钥文件是否可读(成对存在)
func CanReadCertAndKey(certPath, keyPath string) (bool, error) {
	certReadable := canReadFile(certPath) // 检查证书文件
	keyReadable := canReadFile(keyPath)   // 检查私钥文件

	if !certReadable && !keyReadable {
		return false, nil // 都不存在,返回false(无需报错)
	}
	if !certReadable {
		return false, fmt.Errorf("error reading %s, certificate and key must be supplied as a pair", certPath)
	}
	if !keyReadable {
		return false, fmt.Errorf("error reading %s, certificate and key must be supplied as a pair", keyPath)
	}
	return true, nil
}

// canReadFile:检查文件是否存在且可读
func canReadFile(path string) bool {
	f, err := os.Open(path)
	if err != nil {
		return false
	}
	defer f.Close() // 必须关闭文件句柄,避免泄漏
	return true
}  

关键讲解:

      • 证书校验是 TLS 认证的基础,确保客户端 / 服务端证书成对存在
      • defer f.Close()是 Go 的经典用法,确保资源释放(即使函数出错)

3.5、Exporter 初始化(核心)

https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L166

// NewExporter:创建并初始化Exporter实例
func NewExporter(opts kafkaOpts, topicFilter string, topicExclude string, groupFilter string, groupExclude string) (*Exporter, error) {
	var zookeeperClient *kazoo.Kazoo // ZooKeeper客户端(按需初始化)
	config := sarama.NewConfig()     // 创建sarama配置
	config.ClientID = clientID       // 设置Kafka客户端ID

	// 解析Kafka版本(兼容不同协议)
	kafkaVersion, err := sarama.ParseKafkaVersion(opts.kafkaVersion)
	if err != nil {
		return nil, err
	}
	config.Version = kafkaVersion

	// ========== SASL认证配置 ==========
	if opts.useSASL {
		opts.saslMechanism = strings.ToLower(opts.saslMechanism) // 统一转为小写(兼容大小写输入)
		saslPassword := opts.saslPassword
		if saslPassword == "" {
			saslPassword = os.Getenv("SASL_USER_PASSWORD") // 环境变量优先级更高(安全)
		}

		// 根据SASL机制配置
		switch opts.saslMechanism {
		case "scram-sha512": // SCRAM-SHA512
			config.Net.SASL.SCRAMClientGeneratorFunc = func() sarama.SCRAMClient { return &XDGSCRAMClient{HashGeneratorFcn: SHA512} }
			config.Net.SASL.Mechanism = sarama.SASLMechanism(sarama.SASLTypeSCRAMSHA512)
		case "scram-sha256": // SCRAM-SHA256
			config.Net.SASL.SCRAMClientGeneratorFunc = func() sarama.SCRAMClient { return &XDGSCRAMClient{HashGeneratorFcn: SHA256} }
			config.Net.SASL.Mechanism = sarama.SASLMechanism(sarama.SASLTypeSCRAMSHA256)
		case "gssapi": // Kerberos
			config.Net.SASL.Mechanism = sarama.SASLMechanism(sarama.SASLTypeGSSAPI)
			config.Net.SASL.GSSAPI.ServiceName = opts.serviceName
			config.Net.SASL.GSSAPI.KerberosConfigPath = opts.kerberosConfigPath
			config.Net.SASL.GSSAPI.Realm = opts.realm
			config.Net.SASL.GSSAPI.Username = opts.saslUsername
			// Kerberos认证类型(keytab/用户密码)
			if opts.kerberosAuthType == "keytabAuth" {
				config.Net.SASL.GSSAPI.AuthType = sarama.KRB5_KEYTAB_AUTH
				config.Net.SASL.GSSAPI.KeyTabPath = opts.keyTabPath
			} else {
				config.Net.SASL.GSSAPI.AuthType = sarama.KRB5_USER_AUTH
				config.Net.SASL.GSSAPI.Password = saslPassword
			}
			if opts.saslDisablePAFXFast {
				config.Net.SASL.GSSAPI.DisablePAFXFAST = true
			}
		case "awsiam": // AWS MSK IAM认证
			config.Net.SASL.Mechanism = sarama.SASLMechanism(sarama.SASLTypeOAuth)
			config.Net.SASL.TokenProvider = &MSKAccessTokenProvider{region: opts.saslAwsRegion}
		case "plain": // PLAIN(基础认证)
		default:
			return nil, fmt.Errorf(
				`invalid sasl mechanism "%s": can only be "scram-sha256", "scram-sha512", "gssapi", "awsiam" or "plain"`,
				opts.saslMechanism,
			)
		}

		// 启用SASL基础配置
		config.Net.SASL.Enable = true
		config.Net.SASL.Handshake = opts.useSASLHandshake
		if opts.saslUsername != "" {
			config.Net.SASL.User = opts.saslUsername
		}
		if saslPassword != "" {
			config.Net.SASL.Password = saslPassword
		}
	}

	// ========== TLS配置 ==========
	if opts.useTLS {
		config.Net.TLS.Enable = true
		config.Net.TLS.Config = &tls.Config{
			ServerName:         opts.tlsServerName,         // 证书校验的服务器名
			InsecureSkipVerify: opts.tlsInsecureSkipTLSVerify, // 跳过证书校验(测试用)
		}

		// 加载CA证书(自定义CA)
		if opts.tlsCAFile != "" {
			if ca, err := os.ReadFile(opts.tlsCAFile); err == nil {
				config.Net.TLS.Config.RootCAs = x509.NewCertPool()
				config.Net.TLS.Config.RootCAs.AppendCertsFromPEM(ca)
			} else {
				return nil, err
			}
		}

		// 加载客户端证书(双向认证)
		canReadCertAndKey, err := CanReadCertAndKey(opts.tlsCertFile, opts.tlsKeyFile)
		if err != nil {
			return nil, errors.Wrap(err, "error reading cert and key")
		}
		if canReadCertAndKey {
			cert, err := tls.LoadX509KeyPair(opts.tlsCertFile, opts.tlsKeyFile)
			if err == nil {
				config.Net.TLS.Config.Certificates = []tls.Certificate{cert}
			} else {
				return nil, err
			}
		}
	}

	// ========== ZooKeeper配置(兼容老版本消费组) ==========
	if opts.useZooKeeperLag {
		klog.V(DEBUG).Infoln("Using zookeeper lag, so connecting to zookeeper")
		zookeeperClient, err = kazoo.NewKazoo(opts.uriZookeeper, nil)
		if err != nil {
			return nil, errors.Wrap(err, "error connecting to zookeeper")
		}
	}

	// ========== 元数据刷新配置 ==========
	interval, err := time.ParseDuration(opts.metadataRefreshInterval)
	if err != nil {
		return nil, errors.Wrap(err, "Cannot parse metadata refresh interval")
	}
	config.Metadata.RefreshFrequency = interval // sarama元数据刷新频率
	config.Metadata.AllowAutoTopicCreation = opts.allowAutoTopicCreation // 禁止自动创建topic(生产环境推荐)

	// ========== 创建Kafka客户端 ==========
	client, err := sarama.NewClient(opts.uri, config)
	if err != nil {
		return nil, errors.Wrap(err, "Error Init Kafka Client")
	}

	klog.V(TRACE).Infoln("Done Init Clients")
	// ========== 初始化Exporter ==========
	return &Exporter{
		client:                  client,
		topicFilter:             regexp.MustCompile(topicFilter), // 编译正则(预编译提升性能)
		topicExclude:            regexp.MustCompile(topicExclude),
		groupFilter:             regexp.MustCompile(groupFilter),
		groupExclude:            regexp.MustCompile(groupExclude),
		useZooKeeperLag:         opts.useZooKeeperLag,
		zookeeperClient:         zookeeperClient,
		nextMetadataRefresh:     time.Now(), // 下次刷新时间(初始为当前时间)
		metadataRefreshInterval: interval,
		offsetShowAll:           opts.offsetShowAll,
		topicWorkers:            opts.topicWorkers,
		allowConcurrent:         opts.allowConcurrent,
		sgMutex:                 sync.Mutex{},
		sgWaitCh:                nil,
		sgChans:                 []chan<- prometheus.Metric{},
		consumerGroupFetchAll:   config.Version.IsAtLeast(sarama.V2_0_0_0), // Kafka 2.0+支持全量消费组抓取
	}, nil
}

// fetchOffsetVersion:根据Kafka版本选择OffsetFetchRequest版本(兼容不同协议)
func (e *Exporter) fetchOffsetVersion() int16 {
	version := e.client.Config().Version
	if e.client.Config().Version.IsAtLeast(sarama.V2_0_0_0) {
		return 4 // Kafka 2.0+使用版本4
	} else if version.IsAtLeast(sarama.V0_10_2_0) {
		return 2 // Kafka 0.10.2+使用版本2
	} else if version.IsAtLeast(sarama.V0_8_2_2) {
		return 1 // Kafka 0.8.2+使用版本1
	}
	return 0 // 老版本使用默认
}  

关键讲解:

      • 配置优先级:环境变量(如SASL_USER_PASSWORD)> 命令行参数,符合 12Factor 应用规范
      • 正则预编译:regexp.MustCompile预编译正则表达式,避免每次采集时重复编译(性能优化)
      • 错误包装:errors.Wrap添加上下文信息,便于问题定位(如 "error connecting to zookeeper")
      • 版本兼容:Kafka 不同版本的协议差异通过fetchOffsetVersion适配,确保跨版本兼容性

3.6、Prometheus Collector 接口实现

https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L320

// Describe:暴露所有指标描述符(Prometheus规范)
func (e *Exporter) Describe(ch chan<- *prometheus.Desc) {
	ch <- clusterBrokers
	ch <- topicCurrentOffset
	ch <- topicOldestOffset
	ch <- topicPartitions
	ch <- topicPartitionLeader
	ch <- topicPartitionReplicas
	ch <- topicPartitionInSyncReplicas
	ch <- topicPartitionUsesPreferredReplica
	ch <- topicUnderReplicatedPartition
	ch <- consumergroupCurrentOffset
	ch <- consumergroupCurrentOffsetSum
	ch <- consumergroupLag
	ch <- consumergroupLagZookeeper
	ch <- consumergroupLagSum
}

// Collect:核心采集逻辑(Prometheus调用此方法抓取指标)
func (e *Exporter) Collect(ch chan<- prometheus.Metric) {
	if e.allowConcurrent {
		e.collect(ch) // 并发模式(大集群禁用)
		return
	}
	// 非并发模式:所有采集请求共享结果(避免重复采集)
	e.sgMutex.Lock()
	e.sgChans = append(e.sgChans, ch)
	if len(e.sgChans) == 1 {
		e.sgWaitCh = make(chan struct{})
		go e.collectChans(e.sgWaitCh) // 启动单例采集
	} else {
		klog.V(TRACE).Info("concurrent calls detected, waiting for first to finish")
	}
	waiter := e.sgWaitCh
	e.sgMutex.Unlock()
	<-waiter // 等待采集完成
}

// collectChans:非并发模式的采集结果分发
func (e *Exporter) collectChans(quit chan struct{}) {
	original := make(chan prometheus.Metric)
	container := make([]prometheus.Metric, 0, 100) // 缓存指标(避免重复采集)
	// 异步接收指标
	go func() {
		for metric := range original {
			container = append(container, metric)
		}
	}()
	// 执行采集
	e.collect(original)
	close(original)
	// 分发指标到所有等待的通道
	e.sgMutex.Lock()
	for _, ch := range e.sgChans {
		for _, metric := range container {
			ch <- metric
		}
	}
	e.sgChans = e.sgChans[:0] // 重置通道列表
	close(quit)               // 通知所有等待的请求
	e.sgMutex.Unlock()
}

// collect:核心采集逻辑(真正的指标采集)
func (e *Exporter) collect(ch chan<- prometheus.Metric) {
	var wg = sync.WaitGroup{} // 协程等待组

	// ========== 1. 采集Broker指标 ==========
	ch <- prometheus.MustNewConstMetric(
		clusterBrokers, prometheus.GaugeValue, float64(len(e.client.Brokers())),
	)
	// 采集每个Broker的信息
	for _, b := range e.client.Brokers() {
		ch <- prometheus.MustNewConstMetric(
			clusterBrokerInfo, prometheus.GaugeValue, 1, strconv.Itoa(int(b.ID())), b.Addr(),
		)
	}

	// ========== 2. 元数据刷新 ==========
	offset := make(map[string]map[int32]int64) // 缓存topic-partition的最新offset
	now := time.Now()
	if now.After(e.nextMetadataRefresh) {
		klog.V(DEBUG).Info("Refreshing client metadata")
		if err := e.client.RefreshMetadata(); err != nil {
			klog.Errorf("Cannot refresh topics, using cached data: %v", err)
		}
		e.nextMetadataRefresh = now.Add(e.metadataRefreshInterval) // 更新下次刷新时间
	}

	// ========== 3. 获取Topic列表 ==========
	topics, err := e.client.Topics()
	if err != nil {
		klog.Errorf("Cannot get topics: %v", err)
		return
	}

	// ========== 4. 并发采集Topic指标 ==========
	topicChannel := make(chan string) // Topic分发通道

	// getTopicMetrics:单个Topic的指标采集逻辑
	getTopicMetrics := func(topic string) {
		defer wg.Done() // 协程完成通知
		// 过滤Topic(包含+排除)
		if !e.topicFilter.MatchString(topic) || e.topicExclude.MatchString(topic) {
			return
		}

		// 获取Topic分区列表
		partitions, err := e.client.Partitions(topic)
		if err != nil {
			klog.Errorf("Cannot get partitions of topic %s: %v", topic, err)
			return
		}
		// 采集Topic分区数
		ch <- prometheus.MustNewConstMetric(
			topicPartitions, prometheus.GaugeValue, float64(len(partitions)), topic,
		)
		// 初始化offset缓存
		e.mu.Lock()
		offset[topic] = make(map[int32]int64, len(partitions))
		e.mu.Unlock()

		// 遍历分区采集指标
		for _, partition := range partitions {
			// 分区Leader Broker
			broker, err := e.client.Leader(topic, partition)
			if err != nil {
				klog.Errorf("Cannot get leader of topic %s partition %d: %v", topic, partition, err)
			} else {
				ch <- prometheus.MustNewConstMetric(
					topicPartitionLeader, prometheus.GaugeValue, float64(broker.ID()), topic, strconv.FormatInt(int64(partition), 10),
				)
			}

			// 分区最新Offset(OffsetNewest)
			currentOffset, err := e.client.GetOffset(topic, partition, sarama.OffsetNewest)
			if err != nil {
				klog.Errorf("Cannot get current offset of topic %s partition %d: %v", topic, partition, err)
			} else {
				e.mu.Lock()
				offset[topic][partition] = currentOffset // 缓存offset
				e.mu.Unlock()
				ch <- prometheus.MustNewConstMetric(
					topicCurrentOffset, prometheus.GaugeValue, float64(currentOffset), topic, strconv.FormatInt(int64(partition), 10),
				)
			}

			// 分区最旧Offset(OffsetOldest)
			oldestOffset, err := e.client.GetOffset(topic, partition, sarama.OffsetOldest)
			if err != nil {
				klog.Errorf("Cannot get oldest offset of topic %s partition %d: %v", topic, partition, err)
			} else {
				ch <- prometheus.MustNewConstMetric(
					topicOldestOffset, prometheus.GaugeValue, float64(oldestOffset), topic, strconv.FormatInt(int64(partition), 10),
				)
			}

			// 分区副本数
			replicas, err := e.client.Replicas(topic, partition)
			if err != nil {
				klog.Errorf("Cannot get replicas of topic %s partition %d: %v", topic, partition, err)
			} else {
				ch <- prometheus.MustNewConstMetric(
					topicPartitionReplicas, prometheus.GaugeValue, float64(len(replicas)), topic, strconv.FormatInt(int64(partition), 10),
				)
			}

			// 分区ISR(同步副本)数
			inSyncReplicas, err := e.client.InSyncReplicas(topic, partition)
			if err != nil {
				klog.Errorf("Cannot get in-sync replicas of topic %s partition %d: %v", topic, partition, err)
			} else {
				ch <- prometheus.MustNewConstMetric(
					topicPartitionInSyncReplicas, prometheus.GaugeValue, float64(len(inSyncReplicas)), topic, strconv.FormatInt(int64(partition), 10),
				)
			}

			// 是否使用首选副本
			if broker != nil && replicas != nil && len(replicas) > 0 && broker.ID() == replicas[0] {
				ch <- prometheus.MustNewConstMetric(
					topicPartitionUsesPreferredReplica, prometheus.GaugeValue, float64(1), topic, strconv.FormatInt(int64(partition), 10),
				)
			} else {
				ch <- prometheus.MustNewConstMetric(
					topicPartitionUsesPreferredReplica, prometheus.GaugeValue, float64(0), topic, strconv.FormatInt(int64(partition), 10),
				)
			}

			// 是否欠复制(ISR数 < 副本数)
			if replicas != nil && inSyncReplicas != nil && len(inSyncReplicas) < len(replicas) {
				ch <- prometheus.MustNewConstMetric(
					topicUnderReplicatedPartition, prometheus.GaugeValue, float64(1), topic, strconv.FormatInt(int64(partition), 10),
				)
			} else {
				ch <- prometheus.MustNewConstMetric(
					topicUnderReplicatedPartition, prometheus.GaugeValue, float64(0), topic, strconv.FormatInt(int64(partition), 10),
				)
			}

			// ZooKeeper模式下的消费组Lag
			if e.useZooKeeperLag {
				ConsumerGroups, err := e.zookeeperClient.Consumergroups()
				if err != nil {
					klog.Errorf("Cannot get consumer group %v", err)
				}
				for _, group := range ConsumerGroups {
					offset, _ := group.FetchOffset(topic, partition)
					if offset > 0 {
						consumerGroupLag := currentOffset - offset
						ch <- prometheus.MustNewConstMetric(
							consumergroupLagZookeeper, prometheus.GaugeValue, float64(consumerGroupLag), group.Name, topic, strconv.FormatInt(int64(partition), 10),
						)
					}
				}
			}
		}
	}

	// loopTopics:协程循环处理Topic通道
	loopTopics := func() {
		ok := true
		for ok {
			topic, open := <-topicChannel
			ok = open
			if open {
				getTopicMetrics(topic)
			}
		}
	}

	// 协程数控制(避免协程爆炸)
	minx := func(x int, y int) int {
		if x < y {
			return x
		} else {
			return y
		}
	}
	N := len(topics)
	if N > 1 {
		N = minx(N/2, e.topicWorkers) // 协程数 = min(Topic数/2, 配置的worker数)
	}

	// 启动Topic采集协程
	for w := 1; w <= N; w++ {
		go loopTopics()
	}

	// 分发Topic到协程通道
	for _, topic := range topics {
		if e.topicFilter.MatchString(topic) && !e.topicExclude.MatchString(topic) {
			wg.Add(1)
			topicChannel <- topic
		}
	}
	close(topicChannel) // 关闭通道,协程退出
	wg.Wait()           // 等待所有Topic采集完成

	// ========== 5. 采集消费组指标 ==========
	getConsumerGroupMetrics := func(broker *sarama.Broker) {
		defer wg.Done()
		// 连接Broker(采集消费组需要直接连接Broker)
		if err := broker.Open(e.client.Config()); err != nil && err != sarama.ErrAlreadyConnected {
			klog.Errorf("Cannot connect to broker %d: %v", broker.ID(), err)
			return
		}
		defer broker.Close() // 确保关闭连接

		// 列出所有消费组
		groups, err := broker.ListGroups(&sarama.ListGroupsRequest{})
		if err != nil {
			klog.Errorf("Cannot get consumer group: %v", err)
			return
		}
		// 过滤消费组
		groupIds := make([]string, 0)
		for groupId := range groups.Groups {
			if e.groupFilter.MatchString(groupId) && !e.groupExclude.MatchString(groupId) {
				groupIds = append(groupIds, groupId)
			}
		}

		// 描述消费组(获取成员、分配的topic/分区)
		describeGroups, err := broker.DescribeGroups(&sarama.DescribeGroupsRequest{Groups: groupIds})
		if err != nil {
			klog.Errorf("Cannot get describe groups: %v", err)
			return
		}

		// 遍历消费组采集指标
		for _, group := range describeGroups.Groups {
			if group.Err != 0 {
				klog.Errorf("Cannot describe for the group %s with error code %d", group.GroupId, group.Err)
				continue
			}

			// 构建OffsetFetchRequest(获取消费组offset)
			offsetFetchRequest := sarama.OffsetFetchRequest{ConsumerGroup: group.GroupId, Version: e.fetchOffsetVersion()}
			if e.offsetShowAll {
				// 显示所有Topic/分区的offset(包括未消费的)
				for topic, partitions := range offset {
					for partition := range partitions {
						offsetFetchRequest.AddPartition(topic, partition)
					}
				}
			} else {
				// 仅显示消费组已分配的Topic/分区
				for _, member := range group.Members {
					if len(member.MemberAssignment) == 0 {
						klog.Warningf("MemberAssignment is empty for group member: %v in group: %v", member.MemberId, group.GroupId)
						continue
					}
					// 解析成员分配的Topic/分区
					assignment, err := member.GetMemberAssignment()
					if err != nil {
						klog.Errorf("Cannot get GetMemberAssignment of group member %v : %v", member, err)
						continue
					}
					for topic, partions := range assignment.Topics {
						for _, partition := range partions {
							offsetFetchRequest.AddPartition(topic, partition)
						}
					}
				}
			}

			// 采集消费组成员数
			ch <- prometheus.MustNewConstMetric(
				consumergroupMembers, prometheus.GaugeValue, float64(len(group.Members)), group.GroupId,
			)

			// 获取消费组offset
			offsetFetchResponse, err := broker.FetchOffset(&offsetFetchRequest)
			if err != nil {
				klog.Errorf("Cannot get offset of group %s: %v", group.GroupId, err)
				continue
			}

			// 遍历消费组的Topic/分区
			for topic, partitions := range offsetFetchResponse.Blocks {
				// 过滤未消费的Topic
				topicConsumed := false
				for _, offsetFetchResponseBlock := range partitions {
					if offsetFetchResponseBlock.Offset != -1 { // Kafka返回-1表示无offset
						topicConsumed = true
						break
					}
				}
				if !topicConsumed {
					continue
				}

				// 计算总和指标
				var currentOffsetSum int64
				var lagSum int64
				for partition, offsetFetchResponseBlock := range partitions {
					err := offsetFetchResponseBlock.Err
					if err != sarama.ErrNoError {
						klog.Errorf("Error for  partition %d :%v", partition, err.Error())
						continue
					}

					// 消费组当前offset
					currentOffset := offsetFetchResponseBlock.Offset
					currentOffsetSum += currentOffset
					ch <- prometheus.MustNewConstMetric(
						consumergroupCurrentOffset, prometheus.GaugeValue, float64(currentOffset), group.GroupId, topic, strconv.FormatInt(int64(partition), 10),
					)

					// 计算消费lag(最新offset - 消费组offset)
					e.mu.Lock()
					currentPartitionOffset, currentPartitionOffsetError := e.client.GetOffset(topic, partition, sarama.OffsetNewest)
					if currentPartitionOffsetError != nil {
						klog.Errorf("Cannot get current offset of topic %s partition %d: %v", topic, partition, currentPartitionOffsetError)
					} else {
						var lag int64
						if offsetFetchResponseBlock.Offset == -1 {
							lag = -1 // 无offset时lag为-1
						} else {
							if offset, ok := offset[topic][partition]; ok {
								if currentPartitionOffset == -1 {
									currentPartitionOffset = offset
								}
							}
							lag = currentPartitionOffset - offsetFetchResponseBlock.Offset
							lagSum += lag
						}
						// 采集lag指标
						ch <- prometheus.MustNewConstMetric(
							consumergroupLag, prometheus.GaugeValue, float64(lag), group.GroupId, topic, strconv.FormatInt(int64(partition), 10),
						)
					}
					e.mu.Unlock()
				}

				// 采集总和指标
				ch <- prometheus.MustNewConstMetric(
					consumergroupCurrentOffsetSum, prometheus.GaugeValue, float64(currentOffsetSum), group.GroupId, topic,
				)
				ch <- prometheus.MustNewConstMetric(
					consumergroupLagSum, prometheus.GaugeValue, float64(lagSum), group.GroupId, topic,
				)
			}
		}
	}

	// 采集消费组指标(遍历Broker)
	klog.V(DEBUG).Info("Fetching consumer group metrics")
	if len(e.client.Brokers()) > 0 {
		// 去重Broker地址(避免重复采集)
		uniqueBrokerAddresses := make(map[string]bool)
		var servers []string
		for _, broker := range e.client.Brokers() {
			normalizedAddress := strings.ToLower(broker.Addr())
			if !uniqueBrokerAddresses[normalizedAddress] {
				uniqueBrokerAddresses[normalizedAddress] = true
				servers = append(servers, broker.Addr())
			}
		}
		klog.Info(servers)
		// 遍历Broker采集消费组
		for _, broker := range e.client.Brokers() {
			for _, server := range servers {
				if server == broker.Addr() {
					wg.Add(1)
					go getConsumerGroupMetrics(broker)
				}
			}
		}
		wg.Wait()
	} else {
		klog.Errorln("No valid broker, cannot get consumer group metrics")
	}
}

关键讲解:

      • Collect/Describe:Prometheus Collector 接口的两个核心方法,Describe 暴露指标元数据,Collect 采集指标值
      • 并发控制:
        • 非并发模式:所有 Prometheus 抓取请求共享一次采集结果(避免重复请求 Kafka,减轻集群压力)
        • 并发模式:每个请求独立采集(适合小集群)
      • 协程池:Topic 采集使用协程池(topicWorkers),避免协程爆炸(大集群关键优化)
      • 消费组 Lag 计算:lag = 最新offset - 消费组offset,是 Kafka 消费延迟的核心指标
      • 去重 Broker:消费组采集时去重 Broker 地址,避免重复采集(Kafka 集群中多个 Broker 返回相同消费组信息)

3.7、初始化与 main 函数

https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L716 init() 函数位置

https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L757 main() 函数位置

// init:包初始化(优先于main执行)
func init() {
	metrics.UseNilMetrics = true // 禁用go-metrics(避免资源占用)
	prometheus.MustRegister(versionCollector.NewCollector("kafka_exporter")) // 注册版本指标
}

// 命令行参数工具函数(兼容flag和kingpin)
func toFlagString(name string, help string, value string) *string {
	flag.CommandLine.String(name, value, help) // 兼容原生flag
	return kingpin.Flag(name, help).Default(value).String()
}

func toFlagBool(name string, help string, value bool, valueString string) *bool {
	flag.CommandLine.Bool(name, value, help)
	return kingpin.Flag(name, help).Default(valueString).Bool()
}

func toFlagStringsVar(name string, help string, value string, target *[]string) {
	flag.CommandLine.String(name, value, help)
	kingpin.Flag(name, help).Default(value).StringsVar(target)
}

func toFlagStringVar(name string, help string, value string, target *string) {
	flag.CommandLine.String(name, value, help)
	kingpin.Flag(name, help).Default(value).StringVar(target)
}

func toFlagBoolVar(name string, help string, value bool, valueString string, target *bool) {
	flag.CommandLine.Bool(name, value, help)
	kingpin.Flag(name, help).Default(valueString).BoolVar(target)
}

func toFlagIntVar(name string, help string, value int, valueString string, target *int) {
	flag.CommandLine.Int(name, value, help)
	kingpin.Flag(name, help).Default(valueString).IntVar(target)
}

// main:程序入口
func main() {
	var (
		listenAddress = toFlagString("web.listen-address", "Address to listen on for web interface and telemetry.", ":9308")
		metricsPath   = toFlagString("web.telemetry-path", "Path under which to expose metrics.", "/metrics")
		topicFilter   = toFlagString("topic.filter", "Regex that determines which topics to collect.", ".*")
		topicExclude  = toFlagString("topic.exclude", "Regex that determines which topics to exclude.", "^$")
		groupFilter   = toFlagString("group.filter", "Regex that determines which consumer groups to collect.", ".*")
		groupExclude  = toFlagString("group.exclude", "Regex that determines which consumer groups to exclude.", "^$")
		logSarama     = toFlagBool("log.enable-sarama", "Turn on Sarama logging, default is false.", false, "false")

		opts = kafkaOpts{} // 初始化Kafka配置
	)

	// 绑定命令行参数到opts结构体
	toFlagStringsVar("kafka.server", "Address (host:port) of Kafka server.", "kafka:9092", &opts.uri)
	toFlagBoolVar("sasl.enabled", "Connect using SASL/PLAIN, default is false.", false, "false", &opts.useSASL)
	toFlagBoolVar("sasl.handshake", "Only set this to false if using a non-Kafka SASL proxy, default is true.", true, "true", &opts.useSASLHandshake)
	toFlagStringVar("sasl.username", "SASL user name.", "", &opts.saslUsername)
	toFlagStringVar("sasl.password", "SASL user password.", "", &opts.saslPassword)
	toFlagStringVar("sasl.aws-region", "The AWS region for IAM SASL authentication", os.Getenv("AWS_REGION"), &opts.saslAwsRegion)
	toFlagStringVar("sasl.mechanism", "SASL SCRAM SHA algorithm: sha256 or sha512 or SASL mechanism: gssapi or awsiam", "", &opts.saslMechanism)
	toFlagStringVar("sasl.service-name", "Service name when using kerberos Auth", "", &opts.serviceName)
	toFlagStringVar("sasl.kerberos-config-path", "Kerberos config path", "", &opts.kerberosConfigPath)
	toFlagStringVar("sasl.realm", "Kerberos realm", "", &opts.realm)
	toFlagStringVar("sasl.kerberos-auth-type", "Kerberos auth type. Either 'keytabAuth' or 'userAuth'", "", &opts.kerberosAuthType)
	toFlagStringVar("sasl.keytab-path", "Kerberos keytab file path", "", &opts.keyTabPath)
	toFlagBoolVar("sasl.disable-PA-FX-FAST", "Configure the Kerberos client to not use PA_FX_FAST, default is false.", false, "false", &opts.saslDisablePAFXFast)
	toFlagBoolVar("tls.enabled", "Connect to Kafka using TLS, default is false.", false, "false", &opts.useTLS)
	toFlagStringVar("tls.server-name", "Used to verify the hostname on the returned certificates unless tls.insecure-skip-tls-verify is given. The kafka server's name should be given.", "", &opts.tlsServerName)
	toFlagStringVar("tls.ca-file", "The optional certificate authority file for Kafka TLS client authentication.", "", &opts.tlsCAFile)
	toFlagStringVar("tls.cert-file", "The optional certificate file for Kafka client authentication.", "", &opts.tlsCertFile)
	toFlagStringVar("tls.key-file", "The optional key file for Kafka client authentication.", "", &opts.tlsKeyFile)
	toFlagBoolVar("server.tls.enabled", "Enable TLS for web server, default is false.", false, "false", &opts.serverUseTLS)
	toFlagBoolVar("server.tls.mutual-auth-enabled", "Enable TLS client mutual authentication, default is false.", false, "false", &opts.serverMutualAuthEnabled)
	toFlagStringVar("server.tls.ca-file", "The certificate authority file for the web server.", "", &opts.serverTlsCAFile)
	toFlagStringVar("server.tls.cert-file", "The certificate file for the web server.", "", &opts.serverTlsCertFile)
	toFlagStringVar("server.tls.key-file", "The key file for the web server.", "", &opts.serverTlsKeyFile)
	toFlagBoolVar("tls.insecure-skip-tls-verify", "If true, the server's certificate will not be checked for validity. This will make your HTTPS connections insecure. Default is false", false, "false", &opts.tlsInsecureSkipTLSVerify)
	toFlagStringVar("kafka.version", "Kafka broker version", sarama.V2_0_0_0.String(), &opts.kafkaVersion)
	toFlagBoolVar("use.consumelag.zookeeper", "if you need to use a group from zookeeper, default is false", false, "false", &opts.useZooKeeperLag)
	toFlagStringsVar("zookeeper.server", "Address (hosts) of zookeeper server.", "localhost:2181", &opts.uriZookeeper)
	toFlagStringVar("kafka.labels", "Kafka cluster name", "", &opts.labels)
	toFlagStringVar("refresh.metadata", "Metadata refresh interval", "30s", &opts.metadataRefreshInterval)
	toFlagBoolVar("offset.show-all", "Whether show the offset/lag for all consumer group, otherwise, only show connected consumer groups, default is true", true, "true", &opts.offsetShowAll)
	toFlagBoolVar("concurrent.enable", "If true, all scrapes will trigger kafka operations otherwise, they will share results. WARN: This should be disabled on large clusters. Default is false", false, "false", &opts.allowConcurrent)
	toFlagIntVar("topic.workers", "Number of topic workers", 100, "100", &opts.topicWorkers)
	toFlagBoolVar("kafka.allow-auto-topic-creation", "If true, the broker may auto-create topics that we requested which do not already exist, default is false.", false, "false", &opts.allowAutoTopicCreation)
	toFlagIntVar("verbosity", "Verbosity log level", 0, "0", &opts.verbosityLogLevel)

	plConfig := plog.Config{}
	plogflag.AddFlags(kingpin.CommandLine, &plConfig) // 绑定Prometheus日志配置
	kingpin.Version(version.Print("kafka_exporter"))  // 版本信息
	kingpin.HelpFlag.Short('h')                       // 短参数-h
	kingpin.Parse()                                   // 解析命令行参数

	// 解析自定义标签(如cluster=kafka-prod)
	labels := make(map[string]string)
	if opts.labels != "" {
		for _, label := range strings.Split(opts.labels, ",") {
			splitted := strings.Split(label, "=")
			if len(splitted) >= 2 {
				labels[splitted[0]] = splitted[1]
			}
		}
	}

	// 启动Exporter
	setup(*listenAddress, *metricsPath, *topicFilter, *topicExclude, *groupFilter, *groupExclude, *logSarama, opts, labels)
}

// setup:初始化并启动HTTP服务
func setup(
	listenAddress string,
	metricsPath string,
	topicFilter string,
	topicExclude string,
	groupFilter string,
	groupExclude string,
	logSarama bool,
	opts kafkaOpts,
	labels map[string]string,
) {
	// 初始化klog
	klog.InitFlags(flag.CommandLine)
	if err := flag.Set("logtostderr", "true"); err != nil {
		klog.Errorf("Error on setting logtostderr to true: %v", err)
	}
	err := flag.Set("v", strconv.Itoa(opts.verbosityLogLevel))
	if err != nil {
		klog.Errorf("Error on setting v to %v: %v", strconv.Itoa(opts.verbosityLogLevel), err)
	}
	defer klog.Flush() // 确保日志刷出

	// 打印启动信息
	klog.V(INFO).Infoln("Starting kafka_exporter", version.Info())
	klog.V(DEBUG).Infoln("Build context", version.BuildContext())

	// ========== 初始化Prometheus指标描述符 ==========
	clusterBrokers = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "", "brokers"),
		"Number of Brokers in the Kafka Cluster.",
		nil, labels,
	)
	clusterBrokerInfo = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "", "broker_info"),
		"Information about the Kafka Broker.",
		[]string{"id", "address"}, labels,
	)
	topicPartitions = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "topic", "partitions"),
		"Number of partitions for this Topic",
		[]string{"topic"}, labels,
	)
	topicCurrentOffset = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "topic", "partition_current_offset"),
		"Current Offset of a Broker at Topic/Partition",
		[]string{"topic", "partition"}, labels,
	)
	topicOldestOffset = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "topic", "partition_oldest_offset"),
		"Oldest Offset of a Broker at Topic/Partition",
		[]string{"topic", "partition"}, labels,
	)
	topicPartitionLeader = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "topic", "partition_leader"),
		"Leader Broker ID of this Topic/Partition",
		[]string{"topic", "partition"}, labels,
	)
	topicPartitionReplicas = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "topic", "partition_replicas"),
		"Number of Replicas for this Topic/Partition",
		[]string{"topic", "partition"}, labels,
	)
	topicPartitionInSyncReplicas = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "topic", "partition_in_sync_replica"),
		"Number of In-Sync Replicas for this Topic/Partition",
		[]string{"topic", "partition"}, labels,
	)
	topicPartitionUsesPreferredReplica = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "topic", "partition_leader_is_preferred"),
		"1 if Topic/Partition is using the Preferred Broker",
		[]string{"topic", "partition"}, labels,
	)
	topicUnderReplicatedPartition = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "topic", "partition_under_replicated_partition"),
		"1 if Topic/Partition is under Replicated",
		[]string{"topic", "partition"}, labels,
	)
	consumergroupCurrentOffset = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "consumergroup", "current_offset"),
		"Current Offset of a ConsumerGroup at Topic/Partition",
		[]string{"consumergroup", "topic", "partition"}, labels,
	)
	consumergroupCurrentOffsetSum = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "consumergroup", "current_offset_sum"),
		"Current Offset of a ConsumerGroup at Topic for all partitions",
		[]string{"consumergroup", "topic"}, labels,
	)
	consumergroupLag = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "consumergroup", "lag"),
		"Current Approximate Lag of a ConsumerGroup at Topic/Partition",
		[]string{"consumergroup", "topic", "partition"}, labels,
	)
	consumergroupLagZookeeper = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "consumergroupzookeeper", "lag_zookeeper"),
		"Current Approximate Lag(zookeeper) of a ConsumerGroup at Topic/Partition",
		[]string{"consumergroup", "topic", "partition"}, nil,
	)
	consumergroupLagSum = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "consumergroup", "lag_sum"),
		"Current Approximate Lag of a ConsumerGroup at Topic for all partitions",
		[]string{"consumergroup", "topic"}, labels,
	)
	consumergroupMembers = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "consumergroup", "members"),
		"Amount of members in a consumer group",
		[]string{"consumergroup"}, labels,
	)

	// 启用Sarama日志(调试用)
	if logSarama {
		sarama.Logger = log.New(os.Stdout, "[sarama] ", log.LstdFlags)
	}

	// 创建Exporter实例
	exporter, err := NewExporter(opts, topicFilter, topicExclude, groupFilter, groupExclude)
	if err != nil {
		klog.Fatalln(err)
	}
	defer exporter.client.Close() // 确保关闭Kafka客户端
	prometheus.MustRegister(exporter) // 注册Exporter到Prometheus

	// ========== 配置HTTP路由 ==========
	http.Handle(metricsPath, promhttp.Handler()) // 指标暴露接口
	// 首页(引导到指标页面)
	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		_, err := w.Write([]byte(`<html>
	        <head><title>Kafka Exporter</title></head>
	        <body>
	        <h1>Kafka Exporter</h1>
	        <p><a href='` + metricsPath + `'>Metrics</a></p>
	        </body>
	        </html>`))
		if err != nil {
			klog.Error("Error handle / request", err)
		}
	})
	// 健康检查接口
	http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
		_, err := w.Write([]byte("ok"))
		if err != nil {
			klog.Error("Error handle /healthz request", err)
		}
	})

	// ========== 启动HTTP/HTTPS服务 ==========
	if opts.serverUseTLS {
		klog.V(INFO).Infoln("Listening on HTTPS", listenAddress)
		// 校验服务端证书
		_, err := CanReadCertAndKey(opts.serverTlsCertFile, opts.serverTlsKeyFile)
		if err != nil {
			klog.Error("error reading server cert and key")
		}
		// 配置TLS双向认证
		clientAuthType := tls.NoClientCert
		if opts.serverMutualAuthEnabled {
			clientAuthType = tls.RequireAndVerifyClientCert
		}
		// 加载CA证书
		certPool := x509.NewCertPool()
		if opts.serverTlsCAFile != "" {
			if caCert, err := os.ReadFile(opts.serverTlsCAFile); err == nil {
				certPool.AppendCertsFromPEM(caCert)
			} else {
				klog.Error("error reading server ca")
			}
		}
		// TLS配置(安全最佳实践)
		tlsConfig := &tls.Config{
			ClientCAs:                certPool,
			ClientAuth:               clientAuthType,
			MinVersion:               tls.VersionTLS12, // 禁用老版本TLS
			CurvePreferences:         []tls.CurveID{tls.CurveP521, tls.CurveP384, tls.CurveP256}, // 安全曲线
			PreferServerCipherSuites: true,
			CipherSuites: []uint16{ // 安全加密套件
				tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,
				tls.TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,
				tls.TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,
				tls.TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256,
				tls.TLS_RSA_WITH_AES_256_GCM_SHA384,
				tls.TLS_RSA_WITH_AES_256_CBC_SHA,
				tls.TLS_RSA_WITH_AES_128_CBC_SHA256,
			},
		}
		// 启动HTTPS服务
		server := &http.Server{
			Addr:      listenAddress,
			TLSConfig: tlsConfig,
		}
		klog.Fatal(server.ListenAndServeTLS(opts.serverTlsCertFile, opts.serverTlsKeyFile))
	} else {
		// 启动HTTP服务
		klog.V(INFO).Infoln("Listening on HTTP", listenAddress)
		klog.Fatal(http.ListenAndServe(listenAddress, nil))
	}
}

关键讲解:

      • 命令行参数适配:兼容原生flag和kingpin,兼顾不同使用习惯
      • 指标命名规范:prometheus.BuildFQName生成规范的指标名(如kafka_topic_partitions)
      • 健康检查接口:/healthz是云原生应用的标准,用于 k8s 存活探针
      • TLS 最佳实践:禁用老版本 TLS、使用安全加密套件、支持双向认证(生产环境必备)

四、核心设计思路总结

4.1、核心流程:指标采集实现

collect方法 是指标采集的核心,按“Broker → Topic → 消费组”的顺序采集所有指标: 

4.1.1、Broker 指标采集

https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L387

// 采集Broker总数
ch <- prometheus.MustNewConstMetric(
    clusterBrokers, prometheus.GaugeValue, float64(len(e.client.Brokers())),
)
// 采集每个Broker的ID和地址
for _, b := range e.client.Brokers() {
    ch <- prometheus.MustNewConstMetric(
        clusterBrokerInfo, prometheus.GaugeValue, 1, strconv.Itoa(int(b.ID())), b.Addr(),
    )
}

4.1.2、Topic 指标采集(协程池优化) 

https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L418

Topic 采集是性能敏感点,kafka Exporter 使用协程池控制并发数,避免协程爆炸:

// Topic分发通道
topicChannel := make(chan string)
// 单个Topic采集逻辑
getTopicMetrics := func(topic string) {
    defer wg.Done()
    // 过滤Topic(包含+排除正则)
    if !e.topicFilter.MatchString(topic) || e.topicExclude.MatchString(topic) {
        return
    }
    // 获取Topic分区列表
    partitions, err := e.client.Partitions(topic)
    // 采集分区数
    ch <- prometheus.MustNewConstMetric(
        topicPartitions, prometheus.GaugeValue, float64(len(partitions)), topic,
    )
    // 遍历分区采集Offset、副本数、ISR、欠复制等指标
    for _, partition := range partitions {
        // 最新Offset
        currentOffset, err := e.client.GetOffset(topic, partition, sarama.OffsetNewest)
        // 最旧Offset
        oldestOffset, err := e.client.GetOffset(topic, partition, sarama.OffsetOldest)
        // 副本数
        replicas, err := e.client.Replicas(topic, partition)
        // ISR数
        inSyncReplicas, err := e.client.InSyncReplicas(topic, partition)
        // 欠复制判断(ISR数 < 副本数)
        if len(inSyncReplicas) < len(replicas) {
            ch <- prometheus.MustNewConstMetric(
                topicUnderReplicatedPartition, prometheus.GaugeValue, 1, topic, strconv.FormatInt(int64(partition), 10),
            )
        }
        // 其他Topic指标采集...
    }
}

// 协程池控制:启动N个协程处理Topic
N := minx(len(topics)/2, e.topicWorkers) // 协程数 = min(Topic数/2, 配置的worker数)
for w := 1; w <= N; w++ {
    go loopTopics() // 循环处理Topic通道
}
// 分发Topic到协程
for _, topic := range topics {
    wg.Add(1)
    topicChannel <- topic
}

这里有个重要的优化点:

        • 正则预编译:topicFilter/topicExclude在 Exporter 初始化时预编译,避免每次采集重复编译
        • 协程数控制:通过topicWorkers参数限制协程数,默认 100,可根据集群大小调整

4.1.3、消费组指标采集(核心:Lag 计算)

https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L565

消费组 Lag(消费延迟)是 kafka 监控的核心指标,计算公式为:Lag = Topic分区最新Offset - 消费组当前Offset

getConsumerGroupMetrics := func(broker *sarama.Broker) {
    defer wg.Done()
    // 连接Broker(消费组信息需要直接连接Broker)
    if err := broker.Open(e.client.Config()); err != nil && err != sarama.ErrAlreadyConnected {
        return
    }
    defer broker.Close()

    // 列出所有消费组并过滤
    groups, err := broker.ListGroups(&sarama.ListGroupsRequest{})
    groupIds := make([]string, 0)
    for groupId := range groups.Groups {
        if e.groupFilter.MatchString(groupId) && !e.groupExclude.MatchString(groupId) {
            groupIds = append(groupIds, groupId)
        }
    }

    // 遍历消费组采集指标
    for _, group := range describeGroups.Groups {
        // 构建OffsetFetchRequest获取消费组Offset
        offsetFetchRequest := sarama.OffsetFetchRequest{ConsumerGroup: group.GroupId, Version: e.fetchOffsetVersion()}
        // 获取消费组Offset响应
        offsetFetchResponse, err := broker.FetchOffset(&offsetFetchRequest)

        // 遍历消费组的Topic/分区计算Lag
        for topic, partitions := range offsetFetchResponse.Blocks {
            var lagSum int64
            for partition, offsetBlock := range partitions {
                // 消费组当前Offset
                currentOffset := offsetBlock.Offset
                // Topic分区最新Offset
                currentPartitionOffset, _ := e.client.GetOffset(topic, partition, sarama.OffsetNewest)
                // 计算Lag
                lag := currentPartitionOffset - currentOffset
                lagSum += lag

                // 暴露Lag指标
                ch <- prometheus.MustNewConstMetric(
                    consumergroupLag, prometheus.GaugeValue, float64(lag), group.GroupId, topic, strconv.FormatInt(int64(partition), 10),
                )
            }
            // 暴露Topic级别的Lag总和
            ch <- prometheus.MustNewConstMetric(
                consumergroupLagSum, prometheus.GaugeValue, float64(lagSum), group.GroupId, topic,
            )
        }
    }
}

兼容适配:

        • ZooKeeper 模式:兼容老版本 Kafka(0.8.x),从 ZooKeeper 读取消费组 Offset
        • 新版本模式:使用 Kafka 2.0 + 的ListGroups/DescribeGroups/FetchOffset API,性能更好

4.1.4、认证模块:多场景适配

kafka Exporter 支持多种认证方式,核心通过 sarama.Config配置实现:

4.1.4.1、SASL/PLAIN 认证

https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L219

config.Net.SASL.Enable = true
config.Net.SASL.User = opts.saslUsername
config.Net.SASL.Password = opts.saslPassword
config.Net.SASL.Mechanism = sarama.SASLMechanism(sarama.SASLTypePlaintext)
4.1.4.2、AWS MSK IAM 认证

https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L209

config.Net.SASL.Mechanism = sarama.SASLMechanism(sarama.SASLTypeOAuth)
config.Net.SASL.TokenProvider = &MSKAccessTokenProvider{region: opts.saslAwsRegion}  
4.1.4.3、TLS 认证

https://github.com/danielqsj/kafka_exporter/blob/master/kafka_exporter.go#L232

config.Net.TLS.Enable = true
config.Net.TLS.Config = &tls.Config{
    ServerName:         opts.tlsServerName,
    InsecureSkipVerify: opts.tlsInsecureSkipTLSVerify, // 生产环境禁用
}
// 加载CA证书
if opts.tlsCAFile != "" {
    ca, _ := os.ReadFile(opts.tlsCAFile)
    config.Net.TLS.Config.RootCAs = x509.NewCertPool()
    config.Net.TLS.Config.RootCAs.AppendCertsFromPEM(ca)
}、 

下班了,后面有空再说....

posted @ 2025-12-30 18:08  左扬  阅读(0)  评论(0)    收藏  举报