MonkeyCode监控告警体系:保障AI编程服务稳定运行
引言
"没有监控的系统就是黑箱。对于AI编程助手这种直接影响开发效率的核心工具,监控不是可选项,而是生存项。"
当MonkeyCode从"个人玩具"升级为"团队基础设施"后,它的稳定性直接决定了整个研发团队的效率。一次30分钟的API宕机可能导致数百名开发者等待,造成的隐性成本远超想象。
本文将系统介绍如何为MonkeyCode构建企业级监控告警体系,确保AI编程服务7×24小时稳定运行。
一、为什么AI编程服务需要专门的监控?
1.1 AI服务的独特挑战
┌─────────────────────────────────────────────────────────────┐
│ 传统Web服务 vs AI推理服务 — 监控差异 │
│ │
│ 🌐 传统Web服务 │
│ ├── 响应时间:通常 < 100ms │
│ ├── 资源消耗:CPU为主,内存相对稳定 │
│ ├── 错误模式:明确(HTTP状态码、异常堆栈) │
│ ├── 容量规划:基于QPS线性预估 │
│ └── 监控成熟度:非常成熟(Prometheus+Grafana生态) │
│ │
│ 🤖 AI推理服务(MonkeyCode) │
│ ├── 响应时间:500ms ~ 60s(取决于模型大小和请求复杂度) │
│ ├── 资源消耗:GPU显存为主,波动极大 │
│ ├── 错误模式:模糊(输出质量下降 ≠ 服务故障) │
│ ├── 容量规划:非线性(序列长度、batch size影响巨大) │
│ └── 监控成熟度:仍在探索中(需要自定义指标) │
│ │
│ 💡 结论:传统监控方案不能直接套用,需要针对性设计 │
└─────────────────────────────────────────────────────────────┘
1.2 MonkeyCode监控的四大目标
| 目标 | 说明 | 关键指标 |
|---|---|---|
| 可用性保障 | 确保服务随时可用 | 可用率 ≥ 99.9%,MTTR < 5min |
| 性能优化 | 保持响应速度在合理范围 | P50 < 2s, P99 < 10s |
| 质量监控 | 输出代码质量不退化 | AcceptRate > 70%, 修改率 < 30% |
| 成本控制 | 资源使用效率最大化 | GPU利用率 > 60%, 成本/请求趋势 |
二、监控架构总览
2.1 MonkeyCode监控体系全景图
monkeycode_monitoring_architecture:
layers:
# ===== 第一层:数据采集层 =====
data_collection:
name: "数据采集层"
components:
- name: "Metrics Collector"
tech: "Prometheus Client + Custom Exporter"
collect_interval: "15s"
metrics_count: "~200个核心指标"
- name: "Log Collector"
tech: "Fluentd / Vector"
log_sources:
- "应用日志(结构化JSON)"
- "访问日志(Nginx/网关)"
- "GPU日志(nvidia-smi / DCGM)"
- "模型推理日志(vLLM/TGI)"
- name: "Trace Collector"
tech: "OpenTelemetry + Jaeger"
trace_sampling: "1%(生产环境可调)"
- name: "Event Collector"
tech: "自定义事件总线"
events:
- "模型加载/卸载"
- "用户认证成功/失败"
- "限流触发"
- "降级切换"
# ===== 第二层:存储层 =====
storage:
metrics: "Prometheus (TSDB) — 15天热数据 + VictoriaMetrics长期存储"
logs: "Elasticsearch (8.x) + S3冷归档"
traces: "Jaeger / ClickHouse"
events: "Kafka → Flink实时处理 → ClickHouse"
# ===== 第三层:处理层 =====
processing:
real_time_alerting:
engine: "Alertmanager + 自定义规则引擎"
evaluation_interval: "10s"
analytics:
batch: "Apache Spark (每日报表)"
stream: "Apache Flink (实时聚合)"
anomaly_detection:
methods:
- "统计阈值(3-sigma)"
- "时间序列预测(Prophet)"
- "机器学习异常检测(Isolation Forest)"
# ===== 第四层:展示与告警层 =====
presentation:
dashboards:
- name: "全局概览"
url: "/d/monkeycode-overview"
refresh: "30s"
- name: "性能详情"
url: "/d/monkeycode-performance"
- name: "资源监控"
url: "/d/monkeycode-resources"
- name: "业务质量"
url: "/d/monkeycode-quality"
- name: "告警中心"
url: "/d/monkeycode-alerts"
alert_channels:
- "钉钉机器人(P0/P1)"
- "企业微信(P2)"
- "邮件(每日汇总)"
- "短信(P0值班)"
- "电话(P0紧急)"
2.2 技术栈选型
╔═══════════════════════════════════════════════════════════╗
║ MonkeyCode 监控技术栈 ║
╠═══════════════════════════════════════════════════════════╣
║ ║
║ 采集 ║
║ ┌─────────┐ ┌─────────┐ ┌──────────┐ ┌──────────┐ ║
║ │Prometheus│ │OpenTelem│ │Fluentd │ │DCGM │ ║
║ │Client │ │etry SDK │ │/Vector │ │Exporter │ ║
║ └────┬─────┘ └────┬────┘ └────┬─────┘ └────┬─────┘ ║
║ │ │ │ │ ║
║ 存储 ▼ ▼ ▼ ▼ ║
║ ┌─────────┐ ┌─────────┐ ┌──────────┐ ┌──────────┐ ║
║ │Prometheus│ │Jaeger │ │Elasticsearch│ │Prometheus│ ║
║ │(时序) │ │(链路) │ │(日志) │ │(GPU指标) │ ║
║ └────┬─────┘ └────┬────┘ └────┬─────┘ └────┬─────┘ ║
║ │ │ │ │ ║
║ 告警 ▼ ▼ ▼ ▼ ║
║ ┌─────────────────────────────────────────────────┐ ║
║ │ Alertmanager │ ║
║ │ 规则引擎 → 分级 → 路由 → 抑制 → 通知 │ ║
║ └──────────────────────┬──────────────────────────┘ ║
║ │ ║
║ 展示 ▼ ║
║ ┌─────────────────────────────────────────────────┐ ║
║ │ Grafana Dashboards │ ║
║ │ 概览 | 性能 | GPU | 质量 | 成本 | 告警 │ ║
║ └─────────────────────────────────────────────────┘ ║
╚═══════════════════════════════════════════════════════════╝
三、核心监控指标详解
3.1 基础设施层指标
3.1.1 GPU监控(最关键!)
gpu_monitoring_metrics:
# 使用 NVIDIA DCGM (Data Center GPU Manager) 导出器
exporter: "dcgm-exporter"
scrape_interval: "5s" # GPU指标需要更频繁采样
key_metrics:
# === 显存使用 ===
gpu_memory_used_bytes:
description: "已用显存(字节)"
labels: ["gpu_id", "model_name", "pod"]
alert_rules:
- condition: "> 90% of total for 5m"
severity: warning
message: "GPU显存使用超过90%,可能OOM"
- condition: "> 95% of total for 2m"
severity: critical
message: "GPU显存即将耗尽,立即排查!"
gpu_memory_utilization:
description: "显存带宽利用率(%)"
range: "0-100"
healthy_range: "40-80%"
# === 计算利用率 ===
gpu_utilization:
description: "GPU计算单元利用率(%)"
range: "0-100"
healthy_range: "50-85%"
alert_rules:
- condition: "< 20% for 15m"
severity: warning
message: "GPU计算利用率过低,存在资源浪费"
- condition: "< 10% for 30m"
severity: info
message: "考虑缩减GPU规模或增加并发"
# === 温度与功耗 ===
gpu_temperature:
description: "GPU核心温度(°C)"
critical_threshold: 85
shutdown_threshold: 95
gpu_power_draw_watts:
description: "当前功耗(瓦特)"
used_for: "成本核算和容量规划"
# === 推理特有指标 ===
inference_request_queue_length:
description: "推理请求队列长度"
source: "vLLM / TGI metrics"
alert_rules:
- condition: "> 50 for 1m"
severity: warning
message: "推理队列积压,响应延迟将上升"
- condition: "> 100 for 30s"
severity: critical
message: "严重积压!需要扩容或限流"
inference_tokens_per_second:
description: "每秒生成的token数"
used_for: "吞吐量评估和性能基线对比"
inference_time_to_first_token_p50:
description: "首token延迟P50(TTFT)"
target: "< 500ms"
inference_time_per_output_token_p99:
description: "每输出token延迟P99(TPOT)"
target: "< 100ms"
3.1.2 GPU监控Dashboard配置
{
"dashboard": {
"title": "MonkeyCode GPU Monitor",
"panels": [
{
"title": "GPU利用率概览",
"type": "stat",
"targets": [
{"expr": "avg(DCGM_FI_DEV_GPU_UTIL{namespace=\"monkeycode\"})", "legendFormat": "平均利用率"},
{"expr": "max(DCGM_FI_DEV_GPU_UTIL{namespace=\"monkeycode\"})", "legendFormat": "峰值"}
],
"thresholds": [{"color": "green", "value": 0}, {"color": "yellow", "value": 80}, {"color": "red", "value": 95}]
},
{
"title": "显存使用分布",
"type": "gauge",
"target": "avg(DCGM_FI_DEV_MEM_USED{namespace=\"monkeycode\"}) / avg(DCGM_FI_DEV_MEM_TOTAL{namespace=\"monkeycode\"}) * 100",
"max": 100,
"thresholds": [70, 85, 95]
},
{
"title": "各GPU卡利用率趋势",
"type": "timeseries",
"targets": [
{"expr": "DCGM_FI_DEV_GPU_UTIL{namespace=\"monkeycode\"}", "legendFormat": "GPU {{gpu}}"}
]
},
{
"title": "推理延迟分布(TTFT)",
"type": "histogram",
"target": "histogram_quantile(0.50, monkeycode_ttft_seconds_bucket) / histogram_quantile(0.99, monkeycode_ttft_seconds_bucket)"
},
{
"title": "Token生成速率",
"type": "stat",
"unit": "tok/s",
"target": "sum(rate(monkeycode_tokens_generated_total[5m]))"
}
]
}
}
3.2 应用层指标
# monkeycode_app_metrics.py
# MonkeyCode 应用层自定义指标定义
from prometheus_client import Counter, Histogram, Gauge, Info
import time
from functools import wraps
# ===== 请求级别指标 =====
# 总请求数(按类型分类)
REQUEST_COUNT = Counter(
'monkeycode_requests_total',
'Total API requests',
['method', 'endpoint', 'status', 'model_name']
)
# 请求延迟直方图
REQUEST_LATENCY = Histogram(
'monkeycode_request_duration_seconds',
'Request latency in seconds',
['method', 'endpoint', 'model_name'],
buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 15.0, 30.0, 60.0]
)
# 首次Token时间(TTFT)
TIME_TO_FIRST_TOKEN = Histogram(
'monkeycode_ttft_seconds',
'Time to first token',
['model_name', 'prompt_length_category'],
buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0]
)
# Token生成速率
TOKENS_GENERATED = Counter(
'monkeycode_tokens_generated_total',
'Total tokens generated',
['model_name', 'streaming'] # streaming=true/false
)
# 输入/输出Token数
INPUT_TOKENS = Counter(
'monkeycode_input_tokens_total',
'Total input tokens processed',
['model_name']
)
OUTPUT_TOKENS = Counter(
'monkeycode_output_tokens_total',
'Total output tokens generated',
['model_name']
)
# ===== 业务质量指标 =====
# 用户接受率(点击"采纳"的比例)
ACCEPTANCE_RATE = Gauge(
'monkeycode_acceptance_rate',
'Code suggestion acceptance rate (rolling 1h)',
['model_name', 'language']
)
# 平均修改次数(用户修改了几次才满意)
AVG_MODIFICATION_COUNT = Gauge(
'monkeycode_avg_modifications',
'Average modifications before acceptance',
['model_name', 'suggestion_type']
)
# 代码编译/运行通过率
CODE_SUCCESS_RATE = Gauge(
'monkeycode_code_success_rate',
'Generated code success rate (compiles/runs/tests pass)',
['model_name', 'language']
)
# ===== 资源消耗指标 =====
# 当前活跃会话数
ACTIVE_SESSIONS = Gauge(
'monkeycode_active_sessions',
'Currently active user sessions'
)
# 等待队列深度
QUEUE_DEPTH = Gauge(
'monkeycode_queue_depth',
'Current request queue depth'
)
# 上下文窗口使用率
CONTEXT_WINDOW_USAGE = Histogram(
'monkeycode_context_window_usage_ratio',
'Context window usage ratio',
['model_name'],
buckets=[0.25, 0.5, 0.75, 0.8, 0.9, 0.95, 0.99, 1.0]
)
# ===== 模型相关指标 =====
# 模型加载时间
MODEL_LOAD_TIME = Histogram(
'monkeycode_model_load_seconds',
'Model loading time',
['model_name'],
buckets=[1, 5, 10, 30, 60, 120, 300]
)
# 模型切换计数
MODEL_SWITCHES = Counter(
'monkeycode_model_switches_total',
'Total model switches',
['from_model', 'to_model', 'reason']
)
# 缓存命中率(KV Cache prefix caching)
CACHE_HIT_RATE = Gauge(
'monkeycode_cache_hit_rate',
'KV Cache hit rate',
['model_name']
)
# ===== 安全相关指标 =====
# 安全审计事件
SECURITY_EVENTS = Counter(
'monkeycode_security_events_total',
'Security-related events',
['event_type', 'severity'] # injection_attempt, prompt_leak, rate_limit
)
# 内容过滤拦截
CONTENT_FILTER_BLOCKS = Counter(
'monkeycode_content_filter_blocks_total',
'Content filter block count',
['filter_type', 'reason']
)
# ===== 装饰器:自动记录指标 =====
def track_request(model_name="default"):
"""装饰器:自动追踪请求指标"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
start_time = time.time()
status = "success"
try:
result = func(*args, **kwargs)
# 如果是流式结果,记录特殊指标
if hasattr(result, 'is_streaming') and result.is_streaming:
TOKENS_GENERATED.labels(
model_name=model_name,
streaming=True
).inc(result.token_count)
return result
except RateLimitError:
status = "rate_limited"
SECURITY_EVENTS.labels(
event_type="rate_limit",
severity="warning"
).inc()
raise
except ContentFilterError as e:
status = "filtered"
CONTENT_FILTER_BLOCKS.labels(
filter_type=e.filter_type,
reason=e.reason
).inc()
raise
except Exception:
status = "error"
raise
finally:
duration = time.time() - start_time
REQUEST_LATENCY.labels(
method="POST",
endpoint="/api/generate",
status=status,
model_name=model_name
).observe(duration)
REQUEST_COUNT.labels(
method="POST",
endpoint="/api/generate",
status=status,
model_name=model_name
).inc()
return wrapper
return decorator
# ===== 使用示例 =====
@track_request(model_name="qwen-coder-7b")
async def generate_code(request: CodeRequest):
"""生成代码接口"""
# ... 业务逻辑 ...
pass
3.3 业务质量指标(AI独有)
business_quality_metrics:
category: "AI输出质量监控"
importance: "⭐⭐⭐⭐⭐ 区别于传统监控的核心差异化能力"
metrics:
acceptance_rate:
name: "代码采纳率"
definition: "用户采纳的建议数 / 总建议数"
collection_method: |
IDE插件上报:
- 用户点击"采纳"按钮 → accept event
- 用户修改建议后采纳 → modified_accept event
- 用户拒绝建议 → reject event
- 用户忽略(未操作) → ignore event(超时判定)
baseline: "45-55%" # 行业基准
target: "> 65%" # 定制化训练后的目标
alert_condition: "< 40% for 4h → 模型可能退化"
modification_distance:
name: "修改距离(Edit Distance)"
definition: "Levenshtein距离 / 原始输出长度"
meaning: "值越小说明AI输出越接近用户最终使用的版本"
collection: "IDE插件diff对比"
baseline: "0.15-0.25"
target: "< 0.12"
code_compilation_rate:
name: "代码可编译率"
definition: "IDE中无语法错误的比例"
collection: "IDE语言服务器实时反馈"
target: "> 90%"
test_pass_rate:
name: "测试通过率"
definition: "生成代码通过已有单元测试的比例"
collection: "CI流水线集成"
target: "> 75%"
security_violation_rate:
name: "安全违规率"
definition: "被安全扫描发现问题的比例"
collection: "SonarQube/SAST集成"
target: "< 2%"
style_conformance_score:
name: "风格一致性评分"
definition: "符合团队编码规范的程度(0-100)"
collection: "Lint工具评分"
target: "> 85"
user_satisfaction_csat:
name: "用户满意度(CSAT)"
definition: "用户主动评分(1-5星)"
collection: "每周弹窗调研"
target: "> 4.2"
time_saved_estimate:
name: "节省时间估算"
definition: "(采纳代码行数 × 平均编写速度) - 交互耗时"
collection: "统计分析"
reporting: "管理层周报"
四、告警规则体系
4.1 告警分级标准
╔═══════════════════════════════════════════════════════════╗
║ MonkeyCode 告警分级体系 ║
╠═══════════════════════════════════════════════════════════╣
║ ║
║ 🔴 P0 — 致命(立即处理) ║
║ ├── 服务完全不可用 ║
║ ├── 数据泄露风险 ║
║ ├── GPU集群全部宕机 ║
║ ├── 通知方式:电话 + 短信 + 钉钉 + 全员@ ║
║ ├── 响应要求:< 5分钟 ║
║ └── 示例:monkeycode_up == 0 ║
║ ║
║ 🟠 P1 — 严重(30分钟内处理) ║
║ ├── 单节点故障导致部分不可用 ║
║ ├── 响应延迟超过SLA ║
║ ├── GPU OOM频繁 ║
║ ├── 通知方式:钉钉 + 电话值班 + 邮件 ║
║ ├── 响应要求:< 15分钟 ║
║ └── 示例:p99_latency > 30s 持续5分钟 ║
║ ║
║ 🟡 P2 — 警告(2小时内处理) ║
║ ├── 资源使用率偏高但未影响服务 ║
║ ├── 代码质量指标轻微下降 ║
║ ├── 通知方式:钉钉 + 企业微信 ║
║ ├── 响应要求:< 2小时 ║
║ └── 示例:GPU利用率 < 20% 持续30分钟 ║
║ ║
║ 🟢 P3 — 提示(工作时间内处理) ║
║ ├── 容量趋势预警 ║
║ ├── 证书即将过期 ║
║ ├── 通知方式:邮件 + 每日报告 ║
║ ├── 响应要求:< 24小时 ║
║ └── 示例:磁盘使用 > 80%(趋势上升) ║
║ ║
╚═══════════════════════════════════════════════════════════╝
4.2 核心告警规则配置
# monkeycode_alert_rules.yaml
# Prometheus AlertManager 告警规则配置
groups:
- name: monkeycode_critical_p0
interval: 10s
rules:
# === 服务可用性 ===
- alert: MonkeyCodeServiceDown
expr: up{job="monkeycode-api"} == 0
for: 1m
labels:
severity: critical
priority: P0
annotations:
summary: "🔴 MonkeyCode服务完全不可用!"
description: "实例 {{ $labels.instance }} 已离线超过1分钟"
runbook: "https://wiki.internal/runbooks/monkeycode-down"
- alert: MonkeyCodeAllGPUDown
expr: count(DCGM_FI_DEV_GPU_STATUS{job="dcgm-exporter"} == 0) == count(DCGM_FI_DEV_GPU_STATUS{job="dcgm-exporter"})
for: 0m
labels:
severity: critical
priority: P0
annotations:
summary: "🔴 所有GPU均不可用!"
description: "MonkeyCode所有GPU节点异常,AI推理服务完全中断"
# === 数据安全 ===
- alert: MonkeyCodeDataLeakDetected
expr: increase(monkeycode_security_events_total{event_type="data_exfiltration"}[5m]) > 0
for: 0m
labels:
severity: critical
priority: P0
annotations:
summary: "🔴 检测到疑似数据外泄行为!"
description: "过去5分钟检测到 {{ $value }} 次数据外泄事件"
- name: monkeycode_severe_p1
interval: 30s
rules:
# === 性能SLA ===
- alert: MonkeyCodeHighLatencyP99
expr: histogram_quantile(0.99, rate(monkeycode_request_duration_seconds_bucket[5m])) > 30
for: 5m
labels:
severity: severe
priority: P1
annotations:
summary: "🟠 API响应延迟P99超过30秒"
description: "当前P99={{ $value }}s,严重影响用户体验"
- alert: MonkeyCodeHighLatencyP50
expr: histogram_quantile(0.50, rate(monkeycode_request_duration_seconds_bucket[5m])) > 10
for: 10m
labels:
severity: severe
priority: P1
annotations:
summary: "🟠 API响应延迟P50超过10秒"
description: "中位延迟已达{{ $value }}s,半数以上用户受影响"
# === GPU资源 ===
- alert: MonkeyCodeGPUOOM
expr: increase(monkeycode_gpu_oom_total[10m]) > 0
for: 0m
labels:
severity: severe
priority: P1
annotations:
summary: "🟠 发生GPU OOM(显存不足)"
description: "过去10分钟发生 {{ $value }} 次OOM"
- alert: MonkeyCodeQueueBacklog
expr: monkeycode_queue_depth > 100
for: 2m
labels:
severity: severe
priority: P1
annotations:
summary: "🟠 请求队列严重积压"
description: "当前队列深度{{ $value }},新请求将超时"
# === 错误率 ===
- alert: MonkeyCodeHighErrorRate
expr: |
(
sum(rate(monkeycode_requests_total{status="error"}[5m]))
/
sum(rate(monkeycode_requests_total[5m]))
) > 0.1
for: 5m
labels:
severity: severe
priority: P1
annotations:
summary: "🟠 错误率超过10%"
description: "当前错误率 {{ $value | humanizePercentage }}"
- name: monkeycode_warning_p2
interval: 1m
rules:
# === 资源低效 ===
- alert: MonkeyCodeLowGPUUtilization
expr: avg(DCGM_FI_DEV_GPU_UTIL{job="dcgm-exporter"}) < 20
for: 30m
labels:
severity: warning
priority: P2
annotations:
summary: "🟡 GPU利用率持续偏低"
description: "平均利用率仅{{ $value }}%,可能存在资源浪费"
- alert: MonkeyCodeHighMemoryUsage
expr: monkeycode_process_memory_bytes / monkeycode_process_memory_limit_bytes > 0.85
for: 15m
labels:
severity: warning
priority: P2
annotations:
summary: "🟡 进程内存使用超过85%"
description: "当前使用{{ $value | humanizePercentage }}"
# === 质量下降 ===
- alert: MonkeyCodeAcceptanceRateDrop
expr: monkeycode_acceptance_rate < 0.40
for: 4h
labels:
severity: warning
priority: P2
annotations:
summary: "🟡 代码采纳率低于40%"
description: "当前采纳率{{ $value | humanizePercentage }},可能模型退化或需求变化"
- alert: MonkeyCodeCacheHitRateDrop
expr: monkeycode_cache_hit_rate < 0.3
for: 1h
labels:
severity: warning
priority: P2
annotations:
summary: "🟡 KV Cache命中率过低"
description: "命中率{{ $value | humanizePercentage }},检查prefix caching配置"
- name: monkeycode_info_p3
interval: 5m
rules:
# === 容量预警 ===
- alert: MonkeyCodeDiskSpaceWarning
expr: (1 - node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes{mountpoint="/data"}) > 0.8
for: 1h
labels:
severity: info
priority: P3
annotations:
summary: "🟢 数据盘空间使用超过80%"
description: "剩余空间 {{ $value | humanize1024 }}"
- alert: MonkeyCodeCertExpiringSoon
expr: (monkeycode_tls_cert_expiry - time()) / 86400 < 14
for: 0m
labels:
severity: info
priority: P3
annotations:
summary: "🟢 TLS证书将在14天内过期"
description: "证书到期时间:{{ $value | humanizeTimestamp }}"
# === 成本趋势 ===
- alert: MonkeyCodeCostTrendUp
expr: |
(
sum(rate(monkeycode_input_tokens_total[7d]))
/
sum(rate(monkeycode_input_tokens_total[28d] offset 21d))
) > 1.5
for: 0m
labels:
severity: info
priority: P3
annotations:
summary: "🟢 Token用量近7天比前3周增长超过50%"
description: "增长倍数 {{ $value }}x,关注成本变化"
4.3 告警抑制与静默策略
alert_suppression:
# 维护窗口自动静默
maintenance_windows:
- name: "每周例行维护"
schedule: "每周三 02:00-04:00"
suppress_levels: ["P2", "P3"]
allow: ["P0", "P1"] # 即使维护期间也要报致命告警
- name: "模型升级窗口"
trigger: "检测到模型部署操作"
duration: "30分钟"
suppress:
- "MonkeyCodeServiceDown" # 升级期间短暂不可接受
- "MonkeyCodeHighLatency*" # 冷启动期间延迟高正常
# 告警风暴抑制
storm_protection:
group_by: ["alertname", "severity"]
group_wait: "30s"
group_interval: "5m"
repeat_interval: "4h"
# 同一告警不会在4小时内重复发送
# 相关性抑制(避免告警风暴)
correlation:
- when: ["MonkeyCodeServiceDown"]
suppress: ["*Latency*", "*ErrorRate*", "*Queue*"]
reason: "服务不可用时,延迟/错误率/队列告警是必然的,不需要重复通知"
- when: ["MonkeyCodeAllGPUDown"]
suppress: ["*GPU*", "*Inference*"]
reason: "GPU全挂时,单个GPU指标告警无意义"
五、可视化Dashboard设计
5.1 全局概览Dashboard(One Page Glance)
┌──────────────────────────────────────────────────────────────────────────┐
│ 🐒 MonkeyCode 运维概览 │
│ 最后更新: 2026-06-22 14:32:08 │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ ✅ 服务状态 │ │ 📊 可用率 │ │ 👥 在线用户 │ │ ⚡ QPS │ │
│ │ 正常运行 │ │ 99.97% │ │ 1,247 │ │ 156.3 │ │
│ │ │ │ ↑ vs 上月 │ │ ↑ 12% │ │ ↑ 8% │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────────────────────────────┐ ┌───────────────────────────┐ │
│ │ 📈 今日请求趋势 │ │ 🎯 代码质量指标 │ │
│ │ (折线图) │ │ │ │
│ │ 200 │ /\/\\/\ │ │ 采纳率 ████████░░ 72% │ │
│ │ │ / \ \ \ │ │ 编译率 ██████████ 91% │ │
│ │ 100 │ / \ \ \ │ │ 测试率 ███████░░░ 68% │ │
│ │ │ / \ \ \ │ │ 安全分 █████████░ 89% │ │
│ │ 0 ├────────────────────→ │ │ 风格分 ████████░░ 84% │ │
│ │ 0h 6h 12h 18h 24h │ │ │ │
│ └──────────────────────────────────────┘ └───────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐│
│ │ 🖥️ GPU集群状态 ││
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ││
│ │ │ GPU-0 │ │ GPU-1 │ │ GPU-2 │ │ GPU-3 │ │ GPU-4 │ │ GPU-5 │ ││
│ │ │ 78% 🔵 │ │ 82% 🔵 │ │ 15% 🟡 │ │ 91% 🔴 │ │ 67% 🔵 │ │ 73% 🔵 │ ││
│ │ │ 72°C │ │ 75°C │ │ 58°C │ │ 83°C │ │ 71°C │ │ 69°C │ ││
│ │ │ 245W │ │ 251W │ │ 180W │ │ 278W │ │ 238W │ │ 242W │ ││
│ │ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ ││
│ │ 平均: 68% | 最高温度: 83°C | 总功耗: 1,472W ││
│ └─────────────────────────────────────────────────────────────────────┘│
│ │
│ ┌──────────────────────────────┐ ┌───────────────────────────────────┐ │
│ │ ⏱️ 延迟分布(P99) │ │ 💰 今日成本估算 │ │
│ │ 目标线: 10s │ │ API调用: $234.56 │ │
│ │ 当前: 7.2s ✅ │ │ GPU租用: $1,200.00 │ │
│ │ ┌────────────────┐ │ │ 存储: $45.00 │ │
│ │ │ ╱╲ ╱╲ │ │ │ 网络: $12.00 │ │
│ │ │ ╱ ╲ ╱ ╲ │ │ │ ───────────────────── │ │
│ │ │ ╱ ╲╱ ╲ │ │ │ 合计: $1,491.56 │ │
│ │ │──────────────│ 10s │ │ 每请求: $0.0095 │ │
│ │ └────────────────┘ │ │ vs预算: ✓ 正常 │ │
│ └──────────────────────────────┘ └───────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐│
│ │ 🚨 最近告警 (最近24小时) ││
│ │ 14:20 🟡 GPU-2利用率偏低 (<20%) — 持续35分钟 ││
│ │ 12:05 🟠 P99延迟超标(12.3s) — 已恢复 ││
│ │ 09:30 🟢 证书将在12天后过期 ││
│ │ 03:15 🔧 计划内维护完成 — 服务恢复正常 ││
│ └─────────────────────────────────────────────────────────────────────┘│
└──────────────────────────────────────────────────────────────────────────┘
六、故障自愈机制
6.1 自动化运维动作
auto_healing_actions:
level_1_automatic: # 无需人工干预
- name: "单Pod重启"
trigger: "monkeycode_pod_restart_total{reason="oomkilled"} > 0"
action: "kubectl delete pod --force && 等待HPA重建"
notify: "仅在失败时通知"
- name: "限流保护"
trigger: "monkeycode_queue_depth > 80"
action: "动态调整concurrency_limit = current * 0.8"
rollback: "队列恢复< 30后逐步恢复原限制"
- name: "模型回滚"
trigger: "monkeycode_acceptance_rate drop > 30% in 2h"
action: "自动切回上一版模型版本"
notify: "P2通知 + 回滚原因分析"
- name: "优雅降级"
trigger: "GPU可用数量 < 总数 * 50%"
action: |
切换到轻量模型(如7B替代32B)
启用更长超时
显示"当前负载较高"提示
notify: "P1通知 + 降级公告"
level_2_approval_needed: # 需要审批
- name: "扩容GPU节点"
trigger: "GPU利用率 > 85% 持续2小时 且 队列积压"
action: "新增N个GPU节点"
approval: "成本阈值<$500/天 自动批准"
- name: "切换上游模型供应商"
trigger: "某供应商连续错误率>20%"
action: "路由流量到备用供应商"
approval: "需要on-call确认"
6.2 故障演练(Chaos Engineering)
#!/bin/bash
# MonkeyCode 故障演练计划
# 建议每月执行一次
echo "=========================================="
echo " MonkeyCode Chaos Engineering 测试"
echo "=========================================="
# 场景1:模拟GPU节点故障
echo "[场景1] 模拟GPU-3节点故障..."
kubectl cordon node gpu-node-3
kubectl drain node gpu-node-3 --ignore-daemonsets --delete-emptydir-data
sleep 60
echo "✅ 验证:服务是否自动迁移?剩余GPU是否接管流量?"
# 场景2:模拟网络延迟
echo "[场景2] 模拟API网关延迟500ms..."
kubectl apply -f chaos/network-latency.yaml
sleep 120
echo "✅ 验证:超时设置是否合理?用户体验是否可接受?"
# 场景3:模拟高并发突发
echo "[场景3] 模拟3倍并发突发..."
kubectl apply -f chaos/load-spike.yaml
sleep 180
echo "✅ 验证:限流是否生效?队列是否可控?"
# 场景4:模拟模型输出异常
echo "[场景4] 模拟模型输出乱码/退化..."
# 通过feature flag切换到测试模型
curl -X POST http://monkeycode-internal/admin/set-model \
-H "Content-Type: application/json" \
-d '{"model": "test-degraded-model"}'
sleep 300
echo "✅ 验证:质量监控是否捕获?自动回滚是否触发?"
# 清理
echo "[清理] 恢复正常状态..."
kubectl uncordon node gpu-node-3
kubectl delete -f chaos/
curl -X POST http://monkeycode-internal/admin/set-model \
-d '{"model": "production-model"}'
echo ""
echo "✅ 所有演练场景完成!"
echo "📋 请查看演练报告: https://grafana.internal/d/chaos-report"
七、监控体系建设路线图
Phase 1 (第1-2周): 基础监控
├── Prometheus + Grafana 部署
├── 基础指标采集(CPU/内存/GPU/请求量)
├── P0/P1 告警规则上线
└── 目标:知道"服务是否活着"
Phase 2 (第3-4周): 深度监控
├── 业务质量指标接入
├── 分布式链路追踪
├── 日志聚合与分析
├── Dashboard完善
└── 目标:知道"服务是否健康"
Phase 3 (第5-6周): 智能监控
├── 异常检测算法上线
├── 根因分析自动化
├── 故障自愈机制
├── 容量预测
└── 目标:知道"服务将要出问题"
Phase 4 (第7-8周): 运营闭环
├── SLA报告自动化
├── 成本优化建议
├── 定期故障演练
├── 监控即文档(Runbook自动生成)
└── 目标:监控驱动持续改进
八、总结
监控是MonkeyCode从"能用"到"好用"再到"放心用"的基础设施保障。
核心要点回顾:
- 四层监控架构:采集→存储→处理→展示,层层递进
- GPU是重中之重:显存、利用率、温度、功耗缺一不可
- 业务质量指标:采纳率、修改率、编译率——AI服务的独特衡量维度
- 分级告警体系:P0-P3四级,不同级别不同响应策略
- 故障自愈:从自动重启到智能降级,减少MTTR
- 渐进式建设:不要试图一步到位,按阶段推进
一句话总结:好的监控系统让你在问题影响到开发者之前就已经知道了它的存在。
下一篇预告:《MonkeyCode故障排查手册:常见问题诊断与解决》
浙公网安备 33010602011771号