第08章-运维监控与故障排除

第八章:运维监控与故障排除

8.1 引言

在生产环境中运行 GeoServer Cloud,有效的监控和快速的故障排除能力至关重要。本章将全面介绍 GeoServer Cloud 的运维监控体系,包括日志管理、指标监控、告警配置以及常见故障的诊断和解决方法。

一个健康的 GeoServer Cloud 系统需要持续的监控和及时的维护。通过本章的学习,您将掌握建立完善监控体系的方法,以及在问题发生时快速定位和解决问题的技能。

8.2 日志管理

8.2.1 日志配置

GeoServer Cloud 使用 Spring Boot 的日志框架,支持多种日志输出配置:

# application.yml 日志配置
logging:
  level:
    root: INFO
    org.geoserver: INFO
    org.geotools: WARN
    org.springframework.cloud: INFO
    org.springframework.security: INFO
  pattern:
    console: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} - %msg%n"
    file: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} - %msg%n"
  file:
    name: /var/log/geoserver/application.log
    max-size: 100MB
    max-history: 30

8.2.2 日志级别说明

级别 说明 使用场景
ERROR 错误信息 需要立即关注的问题
WARN 警告信息 潜在问题,需要注意
INFO 一般信息 生产环境默认级别
DEBUG 调试信息 开发和问题排查
TRACE 详细追踪 深度调试

8.2.3 动态调整日志级别

运行时调整日志级别无需重启:

# 查看当前日志级别
curl http://localhost:8080/actuator/loggers/org.geoserver

# 设置日志级别为 DEBUG
curl -X POST \
    -H "Content-Type: application/json" \
    -d '{"configuredLevel": "DEBUG"}' \
    http://localhost:8080/actuator/loggers/org.geoserver

# 恢复默认级别
curl -X POST \
    -H "Content-Type: application/json" \
    -d '{"configuredLevel": null}' \
    http://localhost:8080/actuator/loggers/org.geoserver

8.2.4 结构化日志

配置 JSON 格式日志便于集中收集:

logging:
  pattern:
    console: >
      {"timestamp":"%d{ISO8601}","level":"%level","thread":"%thread",
       "logger":"%logger{36}","message":"%msg","exception":"%ex"}%n

8.2.5 集中日志收集

使用 ELK Stack

# filebeat.yml 配置
filebeat.inputs:
  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - add_docker_metadata: ~
      - decode_json_fields:
          fields: ["message"]
          target: ""
          overwrite_keys: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "geoserver-cloud-%{+yyyy.MM.dd}"

使用 Loki + Grafana

# promtail 配置
scrape_configs:
  - job_name: geoserver-cloud
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
    relabel_configs:
      - source_labels: [__meta_docker_container_label_com_docker_compose_service]
        target_label: service
    pipeline_stages:
      - json:
          expressions:
            level: level
            message: message
      - labels:
          level:

8.3 指标监控

8.3.1 暴露 Prometheus 指标

确保 Actuator 端点已配置:

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  endpoint:
    health:
      show-details: when_authorized
  metrics:
    tags:
      application: ${spring.application.name}

8.3.2 关键监控指标

JVM 指标

指标 说明 告警阈值建议
jvm_memory_used_bytes 内存使用 > 80% 最大堆内存
jvm_gc_pause_seconds GC 暂停时间 > 1s
jvm_threads_live 活跃线程数 > 500

HTTP 指标

指标 说明 告警阈值建议
http_server_requests_seconds_count 请求计数 根据基线
http_server_requests_seconds_sum 响应时间总和 -
http_server_requests_seconds_max 最大响应时间 > 10s

数据库连接池

指标 说明 告警阈值建议
hikaricp_connections_active 活跃连接数 > 80% 最大连接
hikaricp_connections_pending 等待连接数 > 0(持续)
hikaricp_connections_timeout_total 超时计数 > 0

8.3.3 Prometheus 查询示例

# 请求速率(每秒请求数)
rate(http_server_requests_seconds_count{application="wms-service"}[5m])

# 响应时间 P95
histogram_quantile(0.95, 
  rate(http_server_requests_seconds_bucket{application="wms-service"}[5m])
)

# 错误率
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) /
sum(rate(http_server_requests_seconds_count[5m])) * 100

# JVM 堆内存使用率
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} * 100

# 数据库连接池使用率
hikaricp_connections_active / hikaricp_connections_max * 100

8.3.4 Grafana Dashboard

创建 GeoServer Cloud 监控面板:

{
  "dashboard": {
    "title": "GeoServer Cloud Monitoring",
    "panels": [
      {
        "title": "Service Availability",
        "type": "stat",
        "targets": [{
          "expr": "up{job=\"geoserver-cloud\"}"
        }]
      },
      {
        "title": "Request Rate by Service",
        "type": "graph",
        "targets": [{
          "expr": "sum(rate(http_server_requests_seconds_count[5m])) by (application)",
          "legendFormat": "{{application}}"
        }]
      },
      {
        "title": "Response Time P95",
        "type": "graph",
        "targets": [{
          "expr": "histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, application))",
          "legendFormat": "{{application}}"
        }]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [{
          "expr": "sum(rate(http_server_requests_seconds_count{status=~\"5..\"}[5m])) by (application)",
          "legendFormat": "{{application}}"
        }]
      },
      {
        "title": "JVM Memory Usage",
        "type": "graph",
        "targets": [{
          "expr": "jvm_memory_used_bytes{area=\"heap\"}",
          "legendFormat": "{{application}} - {{instance}}"
        }]
      },
      {
        "title": "Database Connections",
        "type": "graph",
        "targets": [{
          "expr": "hikaricp_connections_active",
          "legendFormat": "{{application}} - active"
        }, {
          "expr": "hikaricp_connections_max",
          "legendFormat": "{{application}} - max"
        }]
      }
    ]
  }
}

8.4 健康检查

8.4.1 健康端点配置

management:
  endpoint:
    health:
      show-details: when_authorized
      probes:
        enabled: true
  health:
    livenessstate:
      enabled: true
    readinessstate:
      enabled: true
    db:
      enabled: true
    rabbit:
      enabled: true

8.4.2 健康检查端点

端点 用途 说明
/actuator/health 整体健康状态 包含所有组件
/actuator/health/liveness 存活探针 容器是否需要重启
/actuator/health/readiness 就绪探针 是否可以接收流量

8.4.3 自定义健康检查

@Component
public class GeoServerHealthIndicator implements HealthIndicator {
    
    @Autowired
    private Catalog catalog;
    
    @Override
    public Health health() {
        try {
            // 检查目录是否可访问
            int workspaces = catalog.getWorkspaces().size();
            return Health.up()
                .withDetail("workspaces", workspaces)
                .build();
        } catch (Exception e) {
            return Health.down()
                .withException(e)
                .build();
        }
    }
}

8.4.4 Kubernetes 探针配置

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 120
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 60
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /actuator/health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 30

8.5 告警配置

8.5.1 Prometheus 告警规则

# prometheus-rules.yaml
groups:
  - name: geoserver-cloud
    rules:
      # 服务不可用
      - alert: ServiceDown
        expr: up{job="geoserver-cloud"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.application }} is down"
          description: "{{ $labels.instance }} has been down for more than 1 minute."
      
      # 高错误率
      - alert: HighErrorRate
        expr: |
          sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application)
          / sum(rate(http_server_requests_seconds_count[5m])) by (application) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.application }}"
          description: "Error rate is {{ $value | printf \"%.2f\" }}%"
      
      # 响应时间过长
      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.95, 
            sum(rate(http_server_requests_seconds_bucket[5m])) by (le, application)
          ) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High response time on {{ $labels.application }}"
          description: "P95 response time is {{ $value | printf \"%.2f\" }}s"
      
      # 内存使用过高
      - alert: HighMemoryUsage
        expr: |
          jvm_memory_used_bytes{area="heap"} 
          / jvm_memory_max_bytes{area="heap"} > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.application }}"
          description: "Heap memory usage is {{ $value | printf \"%.0f\" }}%"
      
      # 数据库连接池耗尽
      - alert: DatabaseConnectionPoolExhausted
        expr: hikaricp_connections_pending > 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool exhausted"
          description: "{{ $labels.application }} has {{ $value }} pending connections"

8.5.2 Alertmanager 配置

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

route:
  receiver: 'team-email'
  group_by: ['alertname', 'application']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'team-pagerduty'
    - match:
        severity: warning
      receiver: 'team-slack'

receivers:
  - name: 'team-email'
    email_configs:
      - to: 'team@example.com'
        
  - name: 'team-slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#geoserver-alerts'
        
  - name: 'team-pagerduty'
    pagerduty_configs:
      - service_key: 'xxx'

8.6 故障排除

8.6.1 服务启动失败

问题:服务无法启动,反复重启

诊断步骤:

# 1. 查看容器日志
docker logs gscloud-wms-1 --tail 200

# 2. 检查错误信息
docker logs gscloud-wms-1 2>&1 | grep -i "error\|exception\|failed"

# 3. 检查依赖服务
docker compose ps
docker compose exec wms nc -zv database 5432
docker compose exec wms nc -zv rabbitmq 5672

# 4. 检查配置
docker compose exec wms env | sort

常见原因和解决方案:

错误信息 可能原因 解决方案
Connection refused: database 数据库未就绪 等待数据库启动或检查网络
OutOfMemoryError 内存不足 增加 JVM 堆内存
Config server not available 配置服务未就绪 检查 Config 服务状态
Eureka client failed 服务发现失败 检查 Discovery 服务

8.6.2 性能问题

问题:响应时间过长

诊断步骤:

# 1. 检查系统资源
docker stats

# 2. 查看线程转储
curl http://localhost:8080/actuator/threaddump > threaddump.txt

# 3. 查看堆转储(谨慎使用)
curl -X POST http://localhost:8080/actuator/heapdump > heapdump.hprof

# 4. 检查数据库慢查询
docker compose exec database psql -U geoserver -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '30 seconds';"

优化建议:

症状 可能原因 优化措施
CPU 使用率高 复杂渲染/大量请求 增加实例数/优化样式
内存使用率高 大量数据加载 增加内存/优化查询
数据库连接等待 连接池不足 增加连接池大小
GC 频繁 内存分配模式 调整 GC 参数

8.6.3 数据库连接问题

问题:无法连接数据库

# 1. 测试网络连通性
docker compose exec wms nc -zv database 5432

# 2. 测试数据库登录
docker compose exec database psql -U geoserver -d geoserver -c "SELECT 1"

# 3. 检查连接数
docker compose exec database psql -U geoserver -c "
SELECT count(*) FROM pg_stat_activity WHERE datname = 'geoserver';"

# 4. 检查连接池状态
curl http://localhost:8080/actuator/metrics/hikaricp.connections.active
curl http://localhost:8080/actuator/metrics/hikaricp.connections.pending

8.6.4 事件总线问题

问题:配置变更不同步

# 1. 检查 RabbitMQ 连接
docker compose exec wms nc -zv rabbitmq 5672

# 2. 查看 RabbitMQ 队列
docker compose exec rabbitmq rabbitmqctl list_queues

# 3. 查看消费者
docker compose exec rabbitmq rabbitmqctl list_consumers

# 4. 手动触发配置刷新
curl -X POST http://localhost:8080/actuator/bus-refresh

8.6.5 常用诊断命令

# 服务状态检查
curl http://localhost:8080/actuator/health | jq

# 查看配置属性
curl http://localhost:8080/actuator/configprops | jq

# 查看环境变量
curl http://localhost:8080/actuator/env | jq

# 查看 Bean 列表
curl http://localhost:8080/actuator/beans | jq

# 查看 HTTP 追踪
curl http://localhost:8080/actuator/httptrace | jq

# 查看指标
curl http://localhost:8080/actuator/metrics | jq
curl http://localhost:8080/actuator/metrics/http.server.requests | jq

8.7 备份与恢复

8.7.1 数据库备份

#!/bin/bash
# backup-pgconfig.sh

BACKUP_DIR="/backup/geoserver"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/pgconfig_${DATE}.sql.gz"

# 创建备份
docker compose exec -T database pg_dump -U geoserver geoserver | gzip > ${BACKUP_FILE}

# 保留最近 30 天的备份
find ${BACKUP_DIR} -name "pgconfig_*.sql.gz" -mtime +30 -delete

echo "Backup completed: ${BACKUP_FILE}"

8.7.2 数据库恢复

#!/bin/bash
# restore-pgconfig.sh

BACKUP_FILE=$1

if [ -z "$BACKUP_FILE" ]; then
    echo "Usage: $0 <backup_file>"
    exit 1
fi

# 停止所有 GeoServer 服务
docker compose stop wms wfs wcs rest webui gwc

# 恢复数据库
gunzip -c ${BACKUP_FILE} | docker compose exec -T database psql -U geoserver geoserver

# 重启服务
docker compose start wms wfs wcs rest webui gwc

echo "Restore completed from: ${BACKUP_FILE}"

8.7.3 自动化备份(Kubernetes)

# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: pgconfig-backup
  namespace: geoserver
spec:
  schedule: "0 2 * * *"  # 每天凌晨2点
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: postgres:15
              command:
                - /bin/sh
                - -c
                - |
                  pg_dump -h postgresql -U geoserver geoserver | gzip > /backup/pgconfig_$(date +%Y%m%d).sql.gz
              envFrom:
                - secretRef:
                    name: postgresql-secret
              volumeMounts:
                - name: backup
                  mountPath: /backup
          volumes:
            - name: backup
              persistentVolumeClaim:
                claimName: backup-pvc
          restartPolicy: OnFailure

8.8 版本升级

8.8.1 升级前检查

# 1. 检查当前版本
docker compose exec wms java -jar app.jar --version

# 2. 查看发布说明
# https://github.com/geoserver/geoserver-cloud/releases

# 3. 备份数据
./backup-pgconfig.sh

# 4. 测试环境验证
docker compose -f compose-test.yml up -d

8.8.2 滚动升级

# 1. 更新镜像版本
sed -i 's/TAG=2.28.1.0/TAG=2.28.2.0/' .env

# 2. 拉取新镜像
docker compose pull

# 3. 滚动更新(逐个服务)
for service in wms wfs wcs rest webui gwc gateway; do
    docker compose up -d --no-deps $service
    sleep 30
    # 等待服务健康
    until docker compose exec $service curl -sf http://localhost:8080/actuator/health; do
        sleep 5
    done
    echo "$service upgraded successfully"
done

8.8.3 版本回滚

# 1. 恢复版本号
sed -i 's/TAG=2.28.2.0/TAG=2.28.1.0/' .env

# 2. 重新部署
docker compose up -d

# 3. 如需恢复数据库
./restore-pgconfig.sh /backup/pgconfig_before_upgrade.sql.gz

8.9 本章小结

本章全面介绍了 GeoServer Cloud 的运维监控和故障排除:

  1. 日志管理:配置日志级别、结构化日志和集中收集。

  2. 指标监控:配置 Prometheus 指标、关键监控项和 Grafana 面板。

  3. 健康检查:配置健康端点和 Kubernetes 探针。

  4. 告警配置:设置 Prometheus 告警规则和通知。

  5. 故障排除:常见问题的诊断和解决方法。

  6. 备份恢复:数据库备份、恢复和自动化。

  7. 版本升级:升级流程和回滚策略。

在下一章中,我们将学习 GeoServer Cloud 的开发扩展和定制。

8.10 思考题

  1. 在分布式系统中,如何实现日志的关联追踪(Distributed Tracing)?

  2. 应该监控哪些业务指标来评估 GeoServer Cloud 的服务质量?

  3. 如何设计一个有效的告警策略,避免告警疲劳?

  4. 在进行数据库升级时,如何确保数据的一致性和完整性?

  5. 如何实现 GeoServer Cloud 的灾难恢复方案?

posted @ 2025-11-29 13:41  我才是银古  阅读(3)  评论(0)    收藏  举报