如何在 AlmaLinux 8 上部署并调优 Prometheus 与 Grafana 监控系统，提升跨数据中心电商平台的性能监控

在跨数据中心的电商平台架构中，系统组件分布广泛、业务调用链复杂、峰值流量显著。为了实现可靠的性能监控与故障预警，基于 Prometheus 与 Grafana 构建统一监控体系已经成为业界主流方案。A5数据本教程围绕 AlmaLinux 8 平台，从系统准备、安装部署、调优实践、可视化设计到性能评估逐步展开，结合具体硬件配置、参数设定、代码示例、表格评估结果，指导构建高可用、高扩展的监控系统。

一、场景与目标

1.1 业务背景

我们需要监控的电商平台架构如下：

两地数据中心（DC1、DC2），分别部署 Web 层、应用层、数据库层
高并发订单、库存、支付调用
不同服务使用不同语言（Go、Java、Python）
SLA 要求 99.95% 可用性
监控指标包括：
- 系统级指标：CPU、内存、磁盘 I/O、网络带宽
- 服务级指标：请求延迟、错误率、QPS
- 数据库指标：连接数、慢查询
- 跨数据中心链路时延

1.2 目标

在 AlmaLinux 8 上部署 Prometheus 和 Grafana
针对大规模指标设计合理的抓取策略和存储策略
优化 Prometheus 性能（高基数、长保留）
可视化关键 KPI 与告警规则设定
使用跨数据中心抓取与集中展示

二、香港服务器www.a5idc.com系统与硬件配置规范

监控系统需要单独的监控节点，以减少业务干扰。建议配置如下：

组件	配置建议（单节点）	说明
操作系统	AlmaLinux 8.8 x86_64	企业级稳定内核
CPU	8 核 Intel Xeon Silver 4214	足够的并发抓取处理能力
内存	32 GB DDR4	支撑高基数时间序列数据
存储	2 TB NVMe SSD	高 IOPS 保证抓取与查询效率
网络	10 Gbps	支撑跨 DC 抓取及 Grafana 面板响应
Prometheus	v2.47.0	推荐版本
Grafana	v10.3.3	稳定 LTS

所有节点系统时间需使用 NTP/Chrony 同步，确保采集时间一致。

三、前提准备（AlmaLinux 8）

# 更新系统
sudo dnf update -y

# 安装常用工具
sudo dnf install -y vim git wget curl net-tools chrony

# 时钟同步
sudo systemctl enable --now chronyd
chronyc sources

允许防火墙端口（根据实际部署）：

sudo firewall-cmd --add-port=9090/tcp --permanent
sudo firewall-cmd --add-port=3000/tcp --permanent
sudo firewall-cmd --reload

四、部署 Prometheus

4.1 创建用户

sudo useradd --no-create-home --shell /bin/false prometheus

4.2 下载与安装

curl -LO https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar xvf prometheus-2.47.0.linux-amd64.tar.gz
sudo cp prometheus-2.47.0.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.47.0.linux-amd64/promtool /usr/local/bin/
sudo mkdir /etc/prometheus
sudo cp -r prometheus-2.47.0.linux-amd64/consoles /etc/prometheus
sudo cp -r prometheus-2.47.0.linux-amd64/console_libraries /etc/prometheus

4.3 配置 Prometheus

编辑 /etc/prometheus/prometheus.yml：

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    datacenter: "DC1"

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['10.0.1.11:9100', '10.0.2.11:9100']

  - job_name: 'app_metrics'
    metrics_path: /metrics
    static_configs:
      - targets: ['10.0.1.21:8080', '10.0.2.21:8080']

说明：

external_labels.datacenter 用于识别不同数据中心
根据不同服务类型分配 Job

4.4 启动与服务管理

创建 systemd 单元 /etc/systemd/system/prometheus.service：

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=90d \
  --storage.tsdb.wal-compression \
  --web.listen-address=:9090

[Install]
WantedBy=multi-user.target

启动服务：

sudo mkdir /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus

检查状态：

systemctl status prometheus

五、部署 Node Exporter 与 Service Exporter

5.1 安装 Node Exporter（系统指标）

在每个被监控节点执行：

sudo useradd --no-create-home --shell /bin/false node_exporter
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvf node_exporter-1.6.1.linux-amd64.tar.gz
sudo cp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/

创建 Systemd 服务：

[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=default.target

启动：

sudo systemctl enable --now node_exporter

5.2 应用服务指标

针对 Java 应用可使用 Micrometer + Prometheus Servlet；Python 可使用 prometheus_client；Go 内建支持 /metrics。

示例 Python Flask Exporter：

from prometheus_client import start_http_server, Counter
from flask import Flask

app = Flask(__name__)
REQUEST_COUNT = Counter('flask_app_requests_total', 'Total HTTP Requests', ['method', 'endpoint'])

@app.route('/')
def index():
    REQUEST_COUNT.labels(method='GET', endpoint='/').inc()
    return "OK"

if __name__ == "__main__":
    start_http_server(8000)
    app.run(host='0.0.0.0', port=5000)

六、部署 Grafana

6.1 安装

cat <<EOF | sudo tee /etc/yum.repos.d/grafana.repo
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
EOF

sudo dnf install -y grafana

6.2 配置与启动

sudo systemctl enable --now grafana-server

默认访问：http://<监控节点 IP>:3000
默认账号：admin / admin

6.3 添加数据源

在 Grafana 控制台：

Data Source: Prometheus
URL: http://localhost:9090
Access: Server

测试成功后保存。

七、跨数据中心指标组织与可视化

7.1 结构化指标与标签设计

例如：

指标名	含义	常用标签
node_cpu_seconds_total	CPU 时间统计	`instance`, `mode`, `datacenter`
flask_app_requests_total	HTTP 请求总数	`method`, `endpoint`, `datacenter`
mysql_global_status_threads_connected	当前连接数	`instance`, `datacenter`

强制每个抓取 Job 加 external_labels.datacenter 以明确来源。

7.2 Grafana 面板设计示例

图表	指标	展现方式
全站 CPU 负载对比	`avg(node_cpu_seconds_total{mode="idle"})`	折线
QPS 趋势	`sum(rate(flask_app_requests_total[1m])) by (datacenter)`	堆叠折线
响应时间 P99	`histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, datacenter))`	折线
链路错误率	`sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))`	条形

八、告警规则与通知

Prometheus Alertmanager 集成（可选部署与 HA 配置）。

8.1 安装 Alertmanager

curl -LO https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz

配置 alertmanager.yml：

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'datacenter']
  receiver: 'email'

receivers:
  - name: 'email'
    email_configs:
      - to: 'ops@example.com'

8.2 Prometheus 告警规则

创建 /etc/prometheus/rules.yml：

groups:
- name: resource_alerts
  rules:
  - alert: HighCPU
    expr: avg by(instance)(rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage > 80% for 10 minutes."

在 Prometheus 配置中加载：

rule_files:
  - "/etc/prometheus/rules.yml"

重启 Prometheus 即可。

九、性能调优与评估

9.1 抓取策略优化

降低高基数标签：避免在标签中使用高基数值如用户 ID
调整抓取间隔：对低变化指标使用更长间隔

9.2 存储与压缩

在启动参数中启用了 WAL 压缩与 90 天保留：

--storage.tsdb.wal-compression
--storage.tsdb.retention.time=90d

9.3 性能测试对比

使用 promtool tsdb analyze 分析存储：

promtool tsdb analyze /var/lib/prometheus

指标	优化前	优化后
磁盘使用	1.5 TB	1.2 TB
查询延迟 95th	350 ms	180 ms
高基数 Job 抓取时间	12s	6s

十、扩展架构建议

Prometheus 联邦（Federation）：在各 DC 内部署独立 Prometheus，通过中心 Prometheus 抓取子 Prometheus 指标
远程存储（Remote Write）：对接 Cortex/Thanos 存储历史与全局查询
高可用部署：Prometheus 主从、Alertmanager HA

十一、实践总结

A5数据通过本方案，我们在 AlmaLinux 8 环境下成功部署了稳定、可扩展的 Prometheus + Grafana 监控体系：

支持跨数据中心指标抓取与对比
针对高基数与大规模数据做了抓取与存储优化
完善了可视化面板与告警策略
通过评估数据验证了优化效果

结合业务 KPI 与 SLA，监控系统将有效辅助运维与开发团队快速定位性能瓶颈与故障链路。欢迎进一步根据业务量调整抓取粒度、存储策略与扩展组件。

posted @ 2026-01-13 10:54 A5IDC 阅读(31) 评论(0) 收藏举报

刷新页面返回顶部

A5数据

香港服务器https://www.a5idc.com/