OpenTelemetry可观测性实战:统一Metrics、Logs、Traces

前言

可观测性这个词这两年被说烂了,但很多团队的实际情况是:Prometheus管指标、ELK管日志、Jaeger管链路,三套系统各自为战,排查问题时要在三个界面之间跳来跳去。

去年我们开始推OpenTelemetry(简称OTel),目标是统一数据采集标准。折腾了大半年,总算把三大支柱(Metrics、Logs、Traces)串起来了。

这篇文章分享一下我们的落地经验,包括架构设计、踩过的坑和最终效果。

为什么要用OpenTelemetry

先说现状:



+-------------+     +-------------+     +-------------+
|  Prometheus |     |   ELK Stack |     |   Jaeger    |
+------+------+     +------+------+     +------+------+
|                   |                   |
v                   v                   v
指标采集SDK         日志采集Agent        链路追踪SDK
(各种exporter)     (Filebeat/Fluentd)   (Jaeger client)

问题很明显:

  1. 技术栈割裂:三套采集方案,三种数据格式
  2. 上下文断裂:告警触发后,找不到对应的日志和链路
  3. 维护成本高:每种语言都要适配三套SDK

OpenTelemetry要解决的就是这个问题——统一采集标准:


+------------------+
|   OpenTelemetry  |
|    Collector     |
+--------+---------+
|
统一采集格式
(OTLP协议)
|
+--------+---------+
|   OTel SDK       |
| (Metrics+Logs+   |
|  Traces一套搞定) |
+------------------+

架构设计

我们的最终架构:


+-----------------+
| Grafana |
| (统一展示) |
+-------+---------+
|
+---------------+---------------+---------------+
| | | |
v v v v
+-----------+ +-----------+ +-----------+ +-----------+
| Prometheus| | Loki | | Tempo | | Jaeger |
| (指标) | | (日志) | | (链路) | | (链路备选)|
+-----------+ +-----------+ +-----------+ +-----------+
^ ^ ^ ^
| | | |
+---------------+-------+-------+---------------+
|
+---------+---------+
| OTel Collector |
| (Gateway模式) |
+---------+---------+
^
| OTLP
+---------------+---------------+
| | |
+-----+-----+ +-----+-----+ +-----+-----+
| Service A | | Service B | | Service C |
| (OTel SDK)| | (OTel SDK)| | (OTel SDK)|
+-----------+ +-----------+ +-----------+


核心思路:

  1. 应用集成OTel SDK,通过OTLP协议上报数据
  2. Collector作为网关,统一接收、处理、分发
  3. 后端存储可以替换,不锁定特定厂商
  4. Grafana统一展示,Metrics/Logs/Traces互相关联

Collector部署

OpenTelemetry Collector是核心组件,负责数据的接收、处理和导出。

Docker部署

# docker-compose.yml
version: '3.8'
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.92.0
    container_name: otel-collector
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "8888:8888"   # Collector自身指标
      - "8889:8889"   # Prometheus exporter
    restart: unless-stopped

Collector配置

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  
  # 同时支持Prometheus格式(兼容现有监控)
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 10s
          static_configs:
            - targets: ['localhost:8888']

processors:
  # 批量处理,减少网络开销
  batch:
    timeout: 5s
    send_batch_size: 1000
  
  # 内存限制,防止OOM
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200
  
  # 添加通用属性
  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: upsert

exporters:
  # 指标 -> Prometheus
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: otel
  
  # 链路 -> Tempo
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  
  # 日志 -> Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    labels:
      attributes:
        service.name: "service_name"
        level: "severity"
  
  # 调试用
  logging:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]
    
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [loki]

应用接入

Go服务接入

package main

import (
    "context"
    "log"
    "net/http"
    "time"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
    "go.opentelemetry.io/otel/sdk/metric"
    "go.opentelemetry.io/otel/sdk/resource"
    "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)

func initTracer() (*trace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(
        context.Background(),
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    res := resource.NewWithAttributes(
        semconv.SchemaURL,
        semconv.ServiceName("user-service"),
        semconv.ServiceVersion("1.0.0"),
        attribute.String("environment", "production"),
    )

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(res),
        trace.WithSampler(trace.TraceIDRatioBased(0.1)), // 采样率10%
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

func initMeter() (*metric.MeterProvider, error) {
    exporter, err := otlpmetricgrpc.New(
        context.Background(),
        otlpmetricgrpc.WithEndpoint("otel-collector:4317"),
        otlpmetricgrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    mp := metric.NewMeterProvider(
        metric.WithReader(metric.NewPeriodicReader(exporter, metric.WithInterval(15*time.Second))),
    )
    otel.SetMeterProvider(mp)
    return mp, nil
}

func main() {
    tp, _ := initTracer()
    defer tp.Shutdown(context.Background())

    mp, _ := initMeter()
    defer mp.Shutdown(context.Background())

    tracer := otel.Tracer("user-service")
    meter := otel.Meter("user-service")

    // 创建指标
    requestCounter, _ := meter.Int64Counter("http_requests_total")
    requestDuration, _ := meter.Float64Histogram("http_request_duration_seconds")

    http.HandleFunc("/api/user", func(w http.ResponseWriter, r *http.Request) {
        ctx, span := tracer.Start(r.Context(), "GetUser")
        defer span.End()

        start := time.Now()
        
        // 业务逻辑
        span.SetAttributes(attribute.String("user.id", r.URL.Query().Get("id")))
        
        // 模拟数据库查询
        _, dbSpan := tracer.Start(ctx, "DB.Query")
        time.Sleep(50 * time.Millisecond)
        dbSpan.End()

        // 记录指标
        requestCounter.Add(ctx, 1, attribute.String("method", r.Method))
        requestDuration.Record(ctx, time.Since(start).Seconds())

        w.Write([]byte(`{"name": "test"}`))
    })

    log.Println("Server starting on :8080")
    http.ListenAndServe(":8080", nil)
}

Java服务接入

Java用Agent方式更方便,不用改代码:

# 下载Agent
wget https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.1.0/opentelemetry-javaagent.jar

# 启动时加参数
java -javaagent:opentelemetry-javaagent.jar \
     -Dotel.service.name=order-service \
     -Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
     -Dotel.traces.sampler=traceidratio \
     -Dotel.traces.sampler.arg=0.1 \
     -jar order-service.jar

自动埋点支持:HTTP请求、数据库调用、Redis、Kafka等,开箱即用。

Python服务接入

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource

resource = Resource.create({"service.name": "payment-service"})

# 配置Tracer
trace_provider = TracerProvider(resource=resource)
trace_exporter = OTLPSpanExporter(endpoint="otel-collector:4317", insecure=True)
trace_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
trace.set_tracer_provider(trace_provider)

# 配置Meter
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="otel-collector:4317", insecure=True),
    export_interval_millis=15000
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)

tracer = trace.get_tracer("payment-service")
meter = metrics.get_meter("payment-service")

# 使用
@tracer.start_as_current_span("process_payment")
def process_payment(order_id):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)
    # 业务逻辑...

关联Metrics、Logs、Traces

这是OTel最有价值的部分——三大支柱的关联。

TraceID注入日志

import (
    "go.opentelemetry.io/otel/trace"
    "go.uber.org/zap"
)

func LogWithTrace(ctx context.Context, logger *zap.Logger) *zap.Logger {
    span := trace.SpanFromContext(ctx)
    if span.SpanContext().IsValid() {
        return logger.With(
            zap.String("trace_id", span.SpanContext().TraceID().String()),
            zap.String("span_id", span.SpanContext().SpanID().String()),
        )
    }
    return logger
}

// 使用
func handleRequest(ctx context.Context) {
    logger := LogWithTrace(ctx, zap.L())
    logger.Info("Processing request", zap.String("user_id", "123"))
}

日志里带上trace_id后,在Grafana里可以直接从日志跳转到对应的链路。

Exemplar关联

Prometheus 2.25+支持Exemplar,把指标和TraceID关联:

// 记录指标时带上TraceID
requestDuration.Record(ctx, duration,
    metric.WithAttributes(attribute.String("method", "GET")),
)

Grafana看到指标异常时,可以直接跳转到具体的链路追踪。

Grafana配置

数据源配置

# grafana/provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true
    
  - name: Tempo
    type: tempo
    url: http://tempo:3200
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        tags: ['service.name']
        mappedTags: [{ key: 'service.name', value: 'service_name' }]
        mapTagNamesEnabled: true
        
  - name: Loki
    type: loki
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: '"trace_id":"(\w+)"'
          name: TraceID
          url: '$${__value.raw}'

效果

配置好后,排查问题的体验:

  1. Prometheus告警:某服务P99延迟飙升
  2. 点击Exemplar:跳转到具体的慢请求链路
  3. 在Tempo看链路:发现DB查询耗时异常
  4. 从链路跳转日志:看到具体的SQL和错误信息

整个链路打通,效率提升太多。

生产经验

采样策略

全量采集不现实,要设置采样率:

// 尾部采样:异常请求一定采集
trace.NewTracerProvider(
    trace.WithSampler(trace.ParentBased(
        trace.TraceIDRatioBased(0.1), // 正常请求10%采样
    )),
)

更智能的做法是用Collector做尾部采样:

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      # 错误一定采集
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      # 慢请求一定采集
      - name: slow-requests
        type: latency
        latency: {threshold_ms: 1000}
      # 其他随机采样
      - name: randomized
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

资源控制

Collector本身也需要监控和限制:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 2000
    spike_limit_mib: 400

extensions:
  health_check:
    endpoint: :13133
  zpages:
    endpoint: :55679  # 调试页面

多集群管理

我们有三个Kubernetes集群,每个集群部署一个Collector。管理这些Collector时,我用星空组网把三个集群的内网打通,Grafana统一查询所有集群的数据。不然每个集群单独配一套Grafana,运维成本太高。

踩过的坑

坑1:Collector内存暴涨

刚上线时Collector经常OOM。原因是batch processor积攒太多数据。

解决:加memory_limiter,调小batch size

坑2:SDK版本不一致

不同服务用的OTel SDK版本不一样,导致数据格式有差异。

解决:统一SDK版本,在Collector用transform processor做兼容

坑3:日志量太大

OTel日志采集默认全量,Loki扛不住。

解决:在应用层过滤,只采集ERROR及以上级别;或者在Collector用filter processor

processors:
  filter:
    logs:
      exclude:
        match_type: strict
        severity_texts: ["DEBUG", "INFO"]

总结

OpenTelemetry带来的改变:

  1. 统一标准:一套SDK搞定三大支柱
  2. 数据关联:从指标到链路到日志,一键跳转
  3. 厂商中立:后端存储可以随时换
  4. 社区活跃:主流语言和框架都有官方支持

落地成本确实不低,但长期收益明显。特别是排查线上问题时,能快速定位到具体代码,这个效率提升是实打实的。

建议新项目直接用OTel,老项目可以逐步迁移——先接Collector,再慢慢替换各服务的SDK。


posted @ 2025-12-26 14:35  花宝宝  阅读(0)  评论(0)    收藏  举报