GKLBB

当你经历了暴风雨,你也就成为了暴风雨

导航

术语俗话 --- 什么是类DDoS雪崩

image

 

此类故障的通用故障现象总结

此类故障可以概括为:

网络抖动诱发的类 DDoS 服务雪崩故障。

它不是单点突然宕机,而是从轻微网络异常开始,被客户端重试、连接堆积、服务端资源占用逐步放大,最终导致业务系统大面积不可用。


一、用户侧现象

阶段典型表现
初期 HIS 打开变慢、登录慢、偶发卡顿
发展期 操作转圈、查询慢、保存慢、偶尔提示连接失败
扩散期 部分客户端登录失败,重登后有时能进
高峰期 大量客户端同时无法登录或登录后无响应
严重期 报数据库连接失败、服务器连接失败、业务中断
恢复期 重启服务后短暂恢复,但仍有零星卡顿

二、网络侧现象

指标现象
ping 偶发 timeout,丢包率升高
TCP连接 出现 SYN_NO_REPLY、timeout、RST
TCP重传 明显升高,说明丢包或响应延迟严重
网卡流量 高峰期接近瓶颈,后期可能突然下降
网关/防火墙 可能出现延迟升高、会话数升高、接口拥塞
交换机端口 可能出现 drop、discard、CRC、pause frame

特别注意:

流量突然下降不一定代表恢复,也可能说明服务端已经无法响应,客户端请求发不进来了。


三、服务端现象

指标现象
业务端口 5522 Established 连接数快速升高
5522 CLOSE_WAIT 明显增多,说明连接释放异常
5522 SYN_RECEIVED 增多,说明建连阶段出现积压
1432 数据库端口 连接数随后升高,被业务层传导拖高
CPU 不一定很高,可能中低负载也会故障
内存 不一定耗尽
磁盘 可能短时队列升高,但通常不是第一原因
服务状态 可能仍在运行,但已经无法正常响应业务

四、数据库侧现象

现象含义
1432 连接数升高 HIS业务服务大量占用数据库连接
SQL连接等待 请求堆积传导到数据库
查询变慢 数据库被动承压
客户端报数据库连接失败 不一定是数据库首发故障,可能是业务层雪崩传导

判断重点:

如果 5522 先异常、1432 后升高,多数是业务雪崩传导;不是数据库一开始坏了。


五、故障演化特征

 
网络轻微抖动

5522业务端口响应变慢

客户端自动重试/用户反复登录

5522连接数快速增加

TCP重传升高、CLOSE_WAIT堆积

HIS服务处理能力下降

1432数据库连接被同步拉高

客户端报连接失败/数据库失败

服务进入类DDoS雪崩状态
 

六、最典型的几个特征

1. 先慢后断

一开始不是完全不能用,而是:

慢、卡、偶发失败、刷新后恢复。


2. 局部先异常,随后扩散

先是少数客户端反馈,随后扩散到多台机器,最后大面积不可用。


3. 业务端口先堵

HIS 业务入口 5522 通常先出现异常,数据库 1432 是后续被拖高。


4. 重试会放大故障

用户越反复登录、刷新、重试,连接数越高,系统越容易被拖垮。


5. CPU、内存不一定满

这类故障不一定表现为 CPU 100% 或内存耗尽,而是表现为:

TCP重传高、连接数高、CLOSE_WAIT高、端口建连失败。


6. 流量骤降可能是假恢复

当服务端完全处理不了请求时,客户端连接不上,流量反而会下降。

这时不是恢复,而是:

请求进不来,服务已经接近不可用。


一句话总结

此类故障的通用现象是:先出现网络抖动和业务端口响应变慢,随后客户端重试导致连接堆积,服务端连接释放异常,数据库连接被传导拉高,最终形成类似 DDoS 效果的服务雪崩,表现为从“偶发慢”逐步发展为“大面积登录失败和业务中断”。

 

 

 

 

现有工具监测HIS雪崩问题的完整梳理


先说结论

text
没有一个现成工具能开箱即用地监测这类问题

但是:
用现有工具组合,可以覆盖80%的监测需求
剩下20%(四层联动评分)需要少量定制脚本

一、按监测层分:现有工具能覆盖什么

第一层:链路层监测

Smokeping(最推荐)

text
这个工具专门干这件事
官网:https://oss.oetiker.ch/smokeping/

监测什么:
- 连续ping,画出丢包率和延迟趋势图
- 能看到抖动(jitter)
- 能看到丢包突然升高的时间点

对应本次故障:
- 监测 192.168.8.249 的ping质量
- 监测 192.168.8.254 网关的ping质量
- 两个曲线对比,判断是链路问题还是服务器问题

部署方式:
docker run -d \
  --name smokeping \
  -p 8080:80 \
  -e PUID=1000 \
  -e PGID=1000 \
  lscr.io/linuxserver/smokeping:latest

配置示例(添加HIS监测目标):
*** Targets ***
probe = FPing

+ HIS
menu = HIS服务器监测
title = HIS网络质量

++ HISServer
menu = HIS主服务器
title = 192.168.8.249 网络质量
host = 192.168.8.249

++ Gateway
menu = 网关
title = 192.168.8.254 网关质量
host = 192.168.8.254

优点:
- 免费开源
- 图形直观,一眼看出抖动时间点
- 历史数据保留,方便复盘

缺点:
- 只监测ping,不监测TCP端口状态
- 不能自动告警(需要配合其他工具)

LibreNMS / Zabbix(网络设备监测)

text
用于监测交换机、防火墙的接口流量和丢包

LibreNMS:
- 自动发现网络设备
- SNMP采集交换机接口drop/discard
- 带宽利用率趋势图
- 内置告警规则

Zabbix:
- 更通用的监控平台
- 支持SNMP监测交换机
- 支持自定义监测项

对应本次故障:
- 监测接入交换机的HIS服务器连接端口
- 看端口的in/out流量、drop计数
- 防火墙的CPU、会话数、NAT表

第二层:端口层监测(5522)

Zabbix(最常用)

text
内置TCP端口监测模板

配置示例:
监测项:net.tcp.service[tcp,192.168.8.249,5522]
- 返回1:端口可连接
- 返回0:端口不可连接

监测项:net.tcp.service.perf[tcp,192.168.8.249,5522]
- 返回TCP连接建立耗时(毫秒)

告警规则:
- 连续3次返回0:触发告警
- 连接时间 > 3000ms:触发告警

缺点:
- 只能探测端口通不通
- 不能看连接状态(Established/CLOSE_WAIT等)
- 看不到连接数量

Prometheus + Blackbox Exporter(推荐组合)

text
Blackbox Exporter专门做探测类监测

支持的探测:
- TCP连接探测
- HTTP探测
- ICMP探测
- DNS探测

配置示例(监测5522端口):
# blackbox.yml
modules:
  tcp_connect_5522:
    prober: tcp
    timeout: 5s
    tcp:
      preferred_ip_protocol: ip4

# prometheus.yml
scrape_configs:
  - job_name: 'his_port_probe'
    metrics_path: /probe
    params:
      module: [tcp_connect_5522]
    static_configs:
      - targets:
          - 192.168.8.249:5522
          - 192.168.8.249:1432
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: localhost:9115  # Blackbox Exporter地址

Prometheus采集到的指标:
- probe_success:探测是否成功(0/1)
- probe_duration_seconds:探测耗时
- probe_failed_due_to_regex:是否匹配预期响应

Grafana告警规则:
probe_success{job="his_port_probe",instance="192.168.8.249:5522"} == 0
持续60秒 → 触发告警

优点:
- 免费开源,社区成熟
- Grafana画图非常好看
- 告警灵活

缺点:
- 同样只探测通不通
- 看不到连接状态细节

Netdata(实时性最好)

text
安装在HIS服务器本机

自动采集:
- 网络连接状态(TCP状态统计)
- 本机所有端口的连接数
- TCP重传统计
- 网卡流量

对应本次故障:
- 能看到5522端口的连接数变化
- 能看到CLOSE_WAIT、TIME_WAIT等状态数量
- 实时更新(1秒粒度)

安装命令:
bash <(curl -Ss https://my-netdata.io/kickstart.sh)

访问:http://192.168.8.249:19999

能看到的数据:
net.tcpconns(TCP连接数按状态)
  - established
  - syn_sent
  - syn_recv
  - close_wait
  - time_wait
  等等

缺点:
- 显示的是全机器所有端口的TCP状态汇总
- 不能只看5522端口的状态
- 默认不支持只看某个端口的CLOSE_WAIT

第三层:数据库层监测(1432)

SQL Server自带工具

text
SQL Server Management Studio (SSMS)
内置活动监视器:
- 当前用户连接数
- 活跃请求
- 等待任务
- 数据文件I/O

查询语句(可以定时执行并记录):

-- 查当前连接数
SELECT COUNT(*) as connections
FROM sys.dm_exec_sessions
WHERE is_user_process = 1

-- 查阻塞情况
SELECT
    blocking_session_id,
    session_id,
    wait_type,
    wait_time,
    SUBSTRING(st.text, (r.statement_start_offset/2)+1, 100) as sql_text
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) st
WHERE blocking_session_id != 0

-- 查长时间运行的查询
SELECT
    session_id,
    DATEDIFF(SECOND, start_time, GETDATE()) as elapsed_sec,
    status,
    command
FROM sys.dm_exec_requests
WHERE DATEDIFF(SECOND, start_time, GETDATE()) > 30

Prometheus + SQL Server Exporter

text
开源项目:
https://github.com/burningalchemist/sql_exporter

功能:
- 采集SQL Server自定义查询结果
- 推送到Prometheus
- Grafana展示

配置示例:
# sql_exporter.yml
jobs:
  - name: sql_server_his
    interval: '15s'
    connections:
      - 'sqlserver://monitor:password@192.168.8.249:1432'
    queries:
      - name: user_connections
        help: "SQL Server用户连接数"
        values: [connections]
        query: |
          SELECT COUNT(*) as connections
          FROM sys.dm_exec_sessions
          WHERE is_user_process = 1

      - name: blocked_processes
        help: "被阻塞的进程数"
        values: [blocked]
        query: |
          SELECT COUNT(*) as blocked
          FROM sys.dm_exec_requests
          WHERE blocking_session_id != 0

      - name: long_running_queries
        help: "超过30秒的查询数"
        values: [count]
        query: |
          SELECT COUNT(*) as count
          FROM sys.dm_exec_requests
          WHERE DATEDIFF(SECOND, start_time, GETDATE()) > 30

Grafana告警:
sql_server_his_blocked_processes > 5 → 触发告警

Zabbix SQL Server模板

text
Zabbix官方提供SQL Server监测模板
模板名:Template DB Microsoft SQL Server

监测项包括:
- User connections(用户连接数)
- Batch requests/sec
- SQL compilations/sec
- Buffer cache hit ratio
- Page faults/sec
- Lock waits/sec

直接导入模板就能用,不需要写代码

第四层:客户端层监测

现有工具覆盖较弱

text
这一层现有工具基本无法直接监测:
- 登录失败次数:需要HIS应用日志支持
- 客户端重试行为:需要应用层埋点

部分可用方案:

方案1:分析HIS应用日志
如果HIS记录了登录日志,用ELK或简单脚本分析

方案2:Windows事件日志
如果HIS用Windows认证,可以从安全事件日志抓登录失败
wevtutil qe Security /q:"*[System[(EventID=4625)]]"

方案3:网络侧监测(被动)
用Wireshark/tcpdump抓包分析登录请求频率

二、推荐组合方案(低成本可落地)

方案A:最简版(1周内可部署)

text
工具选择:
┌─────────────────────────────────────────┐
│  Smokeping     → 链路抖动监测           │
│  Zabbix        → TCP端口探测 + 告警     │
│  Netdata       → 服务器本机TCP状态      │
│  SSMS活动监视器→ 数据库人工查看         │
└─────────────────────────────────────────┘

覆盖率:链路层✅ 端口探测✅ 服务器状态✅ 数据库❌(手工)

缺点:
- 不能看5522的CLOSE_WAIT具体数量
- 没有四层联动评分
- 数据库层需要人工查看

方案B:推荐版(2-3周可部署)

text
工具选择:
┌─────────────────────────────────────────────────────┐
│  Smokeping          → 链路抖动可视化                 │
│  Prometheus         → 数据采集中心                   │
│    + Blackbox Exp   → TCP端口探测(5522/1432)        │
│    + Node Exporter  → 服务器系统指标                 │
│    + SQL Exporter   → SQL Server连接数/阻塞          │
│  Grafana            → 统一展示 + 告警规则             │
│  自定义脚本(20行)   → 5522/1432连接状态细节           │
└─────────────────────────────────────────────────────┘

覆盖率:链路层✅ 端口探测✅ 连接状态✅ 数据库✅

这个方案的核心补充脚本:
Bash
#!/bin/bash
# 部署在HIS服务器上,每10秒采集一次
# 输出格式供Prometheus的textfile采集器读取

OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/his_conn.prom"

collect_port_stats() {
    local port=$1
    local label=$2

    # 统计各状态连接数
    established=$(ss -tn state established "( dport = :$port or sport = :$port )" 2>/dev/null | grep -c .)
    close_wait=$(ss -tn state close-wait "( dport = :$port or sport = :$port )" 2>/dev/null | grep -c .)
    syn_recv=$(ss -tn state syn-recv "( dport = :$port or sport = :$port )" 2>/dev/null | grep -c .)
    time_wait=$(ss -tn state time-wait "( dport = :$port or sport = :$port )" 2>/dev/null | grep -c .)

    # 统计连接来源IP数量
    unique_ips=$(ss -tn state established "( sport = :$port )" 2>/dev/null | awk 'NR>1{print $5}' | cut -d: -f1 | sort -u | wc -l)

    echo "his_port_established{port=\"$port\",service=\"$label\"} $established"
    echo "his_port_close_wait{port=\"$port\",service=\"$label\"} $close_wait"
    echo "his_port_syn_recv{port=\"$port\",service=\"$label\"} $syn_recv"
    echo "his_port_time_wait{port=\"$port\",service=\"$label\"} $time_wait"
    echo "his_port_unique_source_ips{port=\"$port\",service=\"$label\"} $unique_ips"
}

# 采集TCP重传统计
collect_tcp_retransmit() {
    retrans=$(cat /proc/net/snmp | grep Tcp: | tail -1 | awk '{print $13}')
    echo "his_tcp_retransmit_total $retrans"
}

# 写入结果
{
    echo "# HELP his_port_established HIS端口Established连接数"
    echo "# TYPE his_port_established gauge"
    collect_port_stats 5522 "his_business"
    collect_port_stats 1432 "his_database"
    echo ""
    echo "# HELP his_tcp_retransmit_total TCP重传累计次数"
    echo "# TYPE his_tcp_retransmit_total counter"
    collect_tcp_retransmit
} > "$OUTPUT_FILE"
text
这个脚本配合Node Exporter的textfile采集器
就能把5522/1432的连接状态细节推送到Prometheus
然后在Grafana里画图和告警

方案C:完整版(对应上一个回答的自研系统)

text
在方案B基础上增加:
- 四层联动风险评分
- 自动基线学习
- 雪崩风险预警
- 处置建议推送

适合:对HIS稳定性要求极高的三甲医院

三、工具对比总表

text
┌──────────────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 工具         │链路层│端口探│连接  │数据库│联动  │部署  │
│              │      │测    │状态  │监测  │评分  │难度  │
├──────────────┼──────┼──────┼──────┼──────┼──────┼──────┤
│ Smokeping    │  ✅  │  ❌  │  ❌  │  ❌  │  ❌  │ 简单 │
│ Zabbix       │  ✅  │  ✅  │  ❌  │  ✅  │  ❌  │ 中等 │
│ Prometheus   │  ✅  │  ✅  │  ✅* │  ✅  │  ❌  │ 中等 │
│ +Grafana     │      │      │      │      │      │      │
│ Netdata      │  ✅  │  ❌  │  ✅  │  ❌  │  ❌  │ 简单 │
│ LibreNMS     │  ✅  │  ✅  │  ❌  │  ❌  │  ❌  │ 中等 │
│ PRTG(商业)   │  ✅  │  ✅  │  ❌  │  ✅  │  ❌  │ 简单 │
│ 自研脚本补充 │  -   │  -   │  ✅  │  -   │  -   │ 简单 │
│ 自研系统     │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │ 复杂 │
└──────────────┴──────┴──────┴──────┴──────┴──────┴──────┘

✅* = 需要配合自定义脚本补充端口级连接状态

四、最终建议

text
如果只有1个运维人员,时间有限:
→ 用 Smokeping + Zabbix + 20行脚本
→ 1周内搞定,覆盖最核心的监测需求

如果有2-3人,有一定技术能力:
→ 用 Prometheus + Grafana + Blackbox + SQL Exporter + 自定义脚本
→ 2-3周搞定,覆盖率达到80%,告警灵活

如果对稳定性要求极高,有开发资源:
→ 在上面基础上加自研四层评分引擎
→ 把现成工具的数据汇聚进来做联动分析
→ 不需要从零开始,现成工具负责采集,自研只负责分析和评分

核心观点:现有工具负责采集,自研只补充"联动分析和评分"这一层,这是最省力的落地方式。

 

gpt

有现成软件可以做,而且不建议只用一个工具看 ping。最有效的是“端口探测 + Windows性能 + 交换机/防火墙SNMP + 告警联动”组合,我会按易部署和专业程度给你选型。

有,现成软件完全可以监测这类故障。但要注意:这类问题不是单纯“服务器监控”,而是 链路抖动 + 5522端口堆积 + 1432传导 + TCP重传 + 客户端重试,所以最好用能同时监控 Windows服务器、TCP端口、SNMP网络设备、告警规则 的工具。

最推荐方案:Zabbix

如果你想要 免费、专业、可长期用,首选:

Zabbix + Windows Agent + SNMP + TCP端口探测

Zabbix 支持无 Agent 的 TCP 服务探测,例如 net.tcp.servicenet.tcp.service.perf,可以直接探测 192.168.8.249 的 5522 和 1432 是否可连接、响应是否变慢。
Zabbix 也支持 Windows 性能计数器,例如通过 perf_counter[] 采集 CPU、内存、磁盘、网络、TCP 等 Windows 性能指标。
交换机、防火墙、网关这类设备可以通过 SNMP 接入 Zabbix,官方文档也支持通过 SNMP OID/snmpwalk 方式采集网络设备指标。

你这次故障,用 Zabbix 可以监控:

监控对象监控内容
192.168.8.249 CPU、内存、磁盘、网卡流量、TCP重传
5522 HIS业务端口是否可连接、响应时间
1432 数据库端口是否可连接、响应时间
192.168.8.254 网关 ping、SNMP接口流量、CPU、会话数
交换机端口 249接入口带宽、drop/discard、CRC
连接状态 Established、CLOSE_WAIT、SYN_RECEIVED
告警 连接堆积、TCP重传、端口失败、流量骤降

适合你这种场景:医院/局域网/Windows服务器/HIS业务/交换机防火墙都要看。


最省事方案:PRTG Network Monitor

如果你想要 图形化、部署快、Windows环境友好,可以用:

Paessler PRTG

PRTG 是一体化网络监控软件,官方说明支持 SNMP 监控网络设备,也提供大量传感器和可视化仪表盘。
它有 Port Sensor,可以直接监控某个 TCP/IP 端口是否可连接,适合监控 249:5522 和 249:1432。
它也有 SNMP Traffic Sensor,可以通过 SNMP 监控交换机、防火墙接口流量。
PRTG 还支持 Windows 性能计数器、WMI、SNMP、PowerShell 等方式监控 Windows 主机。

PRTG 的优点是:

优点说明
上手快 图形化配置,适合运维现场
TCP端口探测方便 直接添加 5522、1432 端口传感器
网络设备监控方便 SNMP 直接看交换机/防火墙流量
告警直观 页面、邮件、短信/企业微信集成更容易
适合 Windows 环境 WMI/性能计数器支持较好

缺点是:大规模传感器需要授权费用。


开源高级方案:Prometheus + Grafana

如果你有技术人员维护,想做漂亮大屏和复杂告警,可以用:

Prometheus + Grafana + Blackbox Exporter + windows_exporter

Blackbox Exporter 支持 HTTP、DNS、TCP、ICMP、gRPC 等黑盒探测,因此可以探测 249:5522、249:1432 和 ping 类可达性。
windows_exporter 可以采集 Windows 主机指标,官方说明其会暴露启用 collector 的 Windows 指标,供 Prometheus 抓取。
Grafana 可以接入 Prometheus 数据源,用于展示和告警,Grafana 官方文档说明 Prometheus 是用于监控和告警的开源监控系统与时序数据库。

适合做这种仪表盘:

 
HIS雪崩风险大屏
├─ 5522端口成功率
├─ 1432端口成功率
├─ TCP重传/sec
├─ 5522 Established
├─ 5522 CLOSE_WAIT
├─ 1432 Established
├─ 网卡Mbps
├─ 网关丢包率
├─ 交换机端口drop
└─ 雪崩风险评分
 

这个方案最灵活,但部署和维护门槛比 Zabbix、PRTG 高。


商业综合方案:ManageEngine OpManager

如果你想买一个偏企业级、图形化的一体化平台,可以考虑:

ManageEngine OpManager

OpManager 官方说明支持通过 SNMP 监控网络设备的可用性、网络性能和故障管理。
它也支持发现网络设备接口并监控接口可用性和性能。
对 Windows 服务器,OpManager 支持 WMI 方式监控性能指标。

它适合:

  • 交换机、防火墙、服务器一起管;
  • 需要图形化界面;
  • 希望快速部署;
  • 希望有厂商支持。

你这个场景的最佳选择

如果按你现在的需求,我建议这样选:

场景推荐软件
免费、专业、长期可扩展 Zabbix
快速落地、图形化、省心 PRTG
想做高级大屏和复杂规则 Prometheus + Grafana
企业商业化运维平台 ManageEngine OpManager
只想临时监控排障 PowerShell脚本 + PRTG/Zabbix简单探测

我最推荐你落地的组合

Zabbix + Grafana

原因是:

 
Zabbix负责采集和告警
Grafana负责大屏展示
Windows Agent负责249服务器指标
SNMP负责交换机/防火墙指标
TCP探测负责5522/1432可用性
PowerShell自定义项负责CLOSE_WAIT/连接数统计
 

最终效果是:

故障阶段能否提前发现
网络开始抖 能,通过 ping丢包、TCP重传、网关延迟
5522开始堵 能,通过5522响应时间和连接数
CLOSE_WAIT堆积 能,通过自定义脚本采集
1432被传导 能,通过1432连接数和SQL连接数
服务雪崩 能,通过组合告警提前触发

对这次故障最关键的监控项

无论用哪个软件,都必须监控这 10 项:

序号指标
1 192.168.8.249 ping 丢包率
2 192.168.8.254 网关 ping 丢包率
3 192.168.8.249:5522 TCP连接成功率
4 192.168.8.249:1432 TCP连接成功率
5 249 网卡 Mbps
6 TCP Retransmitted/sec
7 5522 Established 数量
8 5522 CLOSE_WAIT 数量
9 1432 Established 数量
10 交换机/防火墙端口 drop、discard、会话数

最终建议

你们如果要马上上线,我建议:

先用 PRTG 快速部署端口、ping、SNMP、Windows监控;后续再用 Zabbix 做长期标准化监控。

如果只选一个长期方案:

选 Zabbix。

它最适合你这个 HIS 场景,因为它能同时覆盖:服务器、端口、交换机、防火墙、数据库、脚本自定义指标和组合告警

 

 

 

 

HIS类DDoS雪崩监测系统 - 完整软件研发方案

一、项目概述

1.1 项目定位

text
项目名称:HIS-Guardian(HIS雪崩早期预警监测平台)
研发周期:16周
技术路线:Python + Go + Vue3 + InfluxDB + Redis + MySQL
部署方式:内网独立部署,不依赖外网
核心价值:在HIS服务雪崩发生前15-30分钟发出预警

1.2 系统架构总览

text
┌─────────────────────────────────────────────────────────────┐
│                    HIS-Guardian 架构图                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │ 链路层   │  │ 业务入口 │  │ 数据库层 │  │ 客户端层 │   │
│  │ 采集器   │  │ 采集器   │  │ 采集器   │  │ 采集器   │   │
│  │(Go)      │  │(Go)      │  │(Go)      │  │(Python)  │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │              │              │              │          │
│  ┌────▼──────────────▼──────────────▼──────────────▼─────┐  │
│  │                  数据总线 (Redis Stream)                │  │
│  └────────────────────────┬───────────────────────────────┘  │
│                           │                                   │
│  ┌────────────────────────▼───────────────────────────────┐  │
│  │              分析引擎 (Python)                          │  │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐  │  │
│  │  │ 风险评分 │ │ 趋势分析 │ │ 关联分析 │ │ 预警决策 │  │  │
│  │  └──────────┘ └──────────┘ └──────────┘ └──────────┘  │  │
│  └────────────────────────┬───────────────────────────────┘  │
│                           │                                   │
│  ┌──────────┐  ┌──────────▼──────┐  ┌──────────────────┐    │
│  │InfluxDB  │◄─│   存储路由层    │─►│     MySQL        │    │
│  │(时序数据)│  └─────────────────┘  │(配置/告警/事件)  │    │
│  └──────────┘                       └──────────────────┘    │
│                           │                                   │
│  ┌────────────────────────▼───────────────────────────────┐  │
│  │           告警分发层                                    │  │
│  │   短信  │  钉钉/企微  │  邮件  │  声光报警  │  大屏    │  │
│  └────────────────────────────────────────────────────────┘  │
│                           │                                   │
│  ┌────────────────────────▼───────────────────────────────┐  │
│  │              Vue3 前端展示层                            │  │
│  │   实时大屏  │  趋势图  │  告警管理  │  配置管理        │  │
│  └────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

二、目录结构设计

text
his-guardian/
├── collector/                    # 数据采集层 (Go)
│   ├── cmd/
│   │   └── main.go
│   ├── internal/
│   │   ├── link/                 # 链路层采集
│   │   │   ├── ping_collector.go
│   │   │   ├── snmp_collector.go
│   │   │   └── netcard_collector.go
│   │   ├── port/                 # 端口层采集
│   │   │   ├── tcp_probe.go
│   │   │   ├── conn_state.go
│   │   │   └── port_analyzer.go
│   │   ├── database/             # 数据库层采集
│   │   │   ├── mssql_collector.go
│   │   │   └── conn_monitor.go
│   │   ├── client/               # 客户端层采集
│   │   │   └── login_collector.go
│   │   └── publisher/            # 数据发布
│   │       └── redis_publisher.go
│   ├── config/
│   │   └── config.yaml
│   └── go.mod
├── analyzer/                     # 分析引擎 (Python)
│   ├── main.py
│   ├── core/
│   │   ├── risk_scorer.py        # 风险评分引擎
│   │   ├── trend_analyzer.py     # 趋势分析
│   │   ├── correlation.py        # 关联分析
│   │   ├── alert_decision.py     # 预警决策
│   │   └── baseline_manager.py   # 基线管理
│   ├── models/
│   │   ├── metric.py
│   │   ├── alert.py
│   │   └── risk_score.py
│   ├── storage/
│   │   ├── influx_writer.py
│   │   ├── mysql_writer.py
│   │   └── redis_reader.py
│   ├── notifier/
│   │   ├── sms_sender.py
│   │   ├── dingtalk_sender.py
│   │   ├── email_sender.py
│   │   └── webhook_sender.py
│   └── requirements.txt
├── api/                          # API服务 (Python FastAPI)
│   ├── main.py
│   ├── routers/
│   │   ├── metrics.py
│   │   ├── alerts.py
│   │   ├── config.py
│   │   └── dashboard.py
│   ├── schemas/
│   │   └── response.py
│   └── requirements.txt
├── frontend/                     # 前端 (Vue3)
│   ├── src/
│   │   ├── views/
│   │   │   ├── Dashboard.vue     # 实时大屏
│   │   │   ├── Metrics.vue       # 指标详情
│   │   │   ├── Alerts.vue        # 告警管理
│   │   │   └── Config.vue        # 系统配置
│   │   ├── components/
│   │   │   ├── RiskGauge.vue     # 风险仪表盘
│   │   │   ├── MetricChart.vue   # 指标图表
│   │   │   ├── AlertList.vue     # 告警列表
│   │   │   └── TopologyMap.vue   # 拓扑图
│   │   ├── api/
│   │   │   └── index.js
│   │   └── store/
│   │       └── index.js
│   └── package.json
├── database/
│   ├── migrations/
│   │   ├── 001_create_tables.sql
│   │   └── 002_init_config.sql
│   └── influx_init.sh
├── deploy/
│   ├── docker-compose.yml
│   ├── nginx.conf
│   └── systemd/
│       ├── his-collector.service
│       ├── his-analyzer.service
│       └── his-api.service
└── docs/
    ├── deploy.md
    └── api.md

三、核心代码实现

3.1 采集层 - 链路层采集器(Go)

Go
// collector/internal/link/ping_collector.go
package link

import (
	"context"
	"fmt"
	"math"
	"net"
	"sync"
	"time"

	"golang.org/x/net/icmp"
	"golang.org/x/net/ipv4"
)

// PingResult 单次Ping结果
type PingResult struct {
	Target    string
	Timestamp time.Time
	Success   bool
	RTT       float64 // 毫秒
	PacketLoss float64 // 百分比
	Seq       int
}

// PingStats 统计结果(每个采集周期输出一次)
type PingStats struct {
	Target       string
	Timestamp    time.Time
	Sent         int
	Received     int
	PacketLoss   float64
	MinRTT       float64
	MaxRTT       float64
	AvgRTT       float64
	StddevRTT    float64
	Jitter       float64 // 抖动:相邻RTT差值均值
}

// PingCollector 链路探测采集器
type PingCollector struct {
	targets    []string
	interval   time.Duration // 探测间隔
	count      int           // 每次统计的包数
	timeout    time.Duration
	resultChan chan PingStats
	mu         sync.Mutex
}

// NewPingCollector 创建采集器
func NewPingCollector(targets []string, interval time.Duration, count int) *PingCollector {
	return &PingCollector{
		targets:    targets,
		interval:   interval,
		count:      count,
		timeout:    time.Second * 3,
		resultChan: make(chan PingStats, 100),
	}
}

// Start 启动采集
func (p *PingCollector) Start(ctx context.Context) <-chan PingStats {
	for _, target := range p.targets {
		go p.collectTarget(ctx, target)
	}
	return p.resultChan
}

// collectTarget 对单个目标持续探测
func (p *PingCollector) collectTarget(ctx context.Context, target string) {
	ticker := time.NewTicker(p.interval)
	defer ticker.Stop()

	for {
		select {
		case <-ctx.Done():
			return
		case <-ticker.C:
			stats := p.doPing(target)
			select {
			case p.resultChan <- stats:
			default:
				// 队列满了丢弃,不阻塞采集
			}
		}
	}
}

// doPing 执行一轮ping统计
func (p *PingCollector) doPing(target string) PingStats {
	stats := PingStats{
		Target:    target,
		Timestamp: time.Now(),
	}

	conn, err := icmp.ListenPacket("ip4:icmp", "0.0.0.0")
	if err != nil {
		// 无root权限,使用TCP探测替代
		return p.tcpFallback(target)
	}
	defer conn.Close()

	dst, err := net.ResolveIPAddr("ip4", target)
	if err != nil {
		return stats
	}

	rtts := make([]float64, 0, p.count)

	for i := 0; i < p.count; i++ {
		rtt, success := p.sendPing(conn, dst, i)
		stats.Sent++
		if success {
			stats.Received++
			rtts = append(rtts, rtt)
		}
		time.Sleep(time.Millisecond * 200)
	}

	// 计算统计值
	stats.PacketLoss = float64(stats.Sent-stats.Received) / float64(stats.Sent) * 100

	if len(rtts) > 0 {
		stats.MinRTT = rtts[0]
		stats.MaxRTT = rtts[0]
		sum := 0.0
		for _, r := range rtts {
			sum += r
			if r < stats.MinRTT {
				stats.MinRTT = r
			}
			if r > stats.MaxRTT {
				stats.MaxRTT = r
			}
		}
		stats.AvgRTT = sum / float64(len(rtts))

		// 计算标准差
		variance := 0.0
		for _, r := range rtts {
			diff := r - stats.AvgRTT
			variance += diff * diff
		}
		stats.StddevRTT = math.Sqrt(variance / float64(len(rtts)))

		// 计算抖动(相邻RTT差值的均值)
		if len(rtts) > 1 {
			jitterSum := 0.0
			for i := 1; i < len(rtts); i++ {
				jitterSum += math.Abs(rtts[i] - rtts[i-1])
			}
			stats.Jitter = jitterSum / float64(len(rtts)-1)
		}
	}

	return stats
}

// sendPing 发送单个ICMP包并等待回应
func (p *PingCollector) sendPing(conn *icmp.PacketConn, dst *net.IPAddr, seq int) (float64, bool) {
	msg := icmp.Message{
		Type: ipv4.ICMPTypeEcho,
		Code: 0,
		Body: &icmp.Echo{
			ID:   1,
			Seq:  seq,
			Data: []byte("his-guardian"),
		},
	}

	data, err := msg.Marshal(nil)
	if err != nil {
		return 0, false
	}

	start := time.Now()
	_, err = conn.WriteTo(data, &net.UDPAddr{IP: dst.IP})
	if err != nil {
		return 0, false
	}

	conn.SetReadDeadline(time.Now().Add(p.timeout))
	reply := make([]byte, 1500)
	_, _, err = conn.ReadFrom(reply)
	if err != nil {
		return 0, false
	}

	rtt := float64(time.Since(start).Microseconds()) / 1000.0
	return rtt, true
}

// tcpFallback 当ICMP不可用时,用TCP探测替代
func (p *PingCollector) tcpFallback(target string) PingStats {
	stats := PingStats{
		Target:    target,
		Timestamp: time.Now(),
	}

	// 尝试常见端口
	ports := []int{80, 443, 22, 5522}
	for _, port := range ports {
		addr := fmt.Sprintf("%s:%d", target, port)
		start := time.Now()
		conn, err := net.DialTimeout("tcp", addr, p.timeout)
		stats.Sent++
		if err == nil {
			conn.Close()
			rtt := float64(time.Since(start).Microseconds()) / 1000.0
			stats.Received++
			stats.AvgRTT = rtt
			stats.MinRTT = rtt
			stats.MaxRTT = rtt
			break
		}
	}

	stats.PacketLoss = float64(stats.Sent-stats.Received) / float64(stats.Sent) * 100
	return stats
}

3.2 采集层 - TCP端口状态采集器(Go)

Go
// collector/internal/port/conn_state.go
package port

import (
	"bufio"
	"fmt"
	"net"
	"os"
	"runtime"
	"strconv"
	"strings"
	"time"
)

// ConnState TCP连接状态
type ConnState string

const (
	StateEstablished ConnState = "ESTABLISHED"
	StateSynSent     ConnState = "SYN_SENT"
	StateSynRecv     ConnState = "SYN_RECV"
	StateFinWait1    ConnState = "FIN_WAIT1"
	StateFinWait2    ConnState = "FIN_WAIT2"
	StateTimeWait    ConnState = "TIME_WAIT"
	StateClose       ConnState = "CLOSE"
	StateCloseWait   ConnState = "CLOSE_WAIT"
	StateLastAck     ConnState = "LAST_ACK"
	StateListen      ConnState = "LISTEN"
	StateClosing     ConnState = "CLOSING"
)

// PortStats 端口连接状态统计
type PortStats struct {
	Port      int
	Timestamp time.Time

	// 连接状态计数
	Established int
	SynSent     int
	SynRecv     int
	FinWait1    int
	FinWait2    int
	TimeWait    int
	CloseWait   int
	LastAck     int
	Listen      int
	Total       int

	// 按来源IP统计
	SourceIPCount map[string]int

	// 最大单IP连接数
	MaxSingleIPConn int

	// TCP探测结果
	ProbeSuccess    bool
	ProbeLatency    float64 // ms
}

// ConnStateCollector 连接状态采集器
type ConnStateCollector struct {
	monitorPorts []int
}

// NewConnStateCollector 创建采集器
func NewConnStateCollector(ports []int) *ConnStateCollector {
	return &ConnStateCollector{monitorPorts: ports}
}

// Collect 采集所有监控端口的连接状态
func (c *ConnStateCollector) Collect() ([]PortStats, error) {
	switch runtime.GOOS {
	case "linux":
		return c.collectLinux()
	case "windows":
		return c.collectWindows()
	default:
		return nil, fmt.Errorf("unsupported OS: %s", runtime.GOOS)
	}
}

// collectLinux 从 /proc/net/tcp 解析连接状态
func (c *ConnStateCollector) collectLinux() ([]PortStats, error) {
	// 初始化各端口统计
	statsMap := make(map[int]*PortStats)
	for _, port := range c.monitorPorts {
		statsMap[port] = &PortStats{
			Port:          port,
			Timestamp:     time.Now(),
			SourceIPCount: make(map[string]int),
		}
	}

	// Linux TCP状态码映射
	stateMap := map[string]ConnState{
		"01": StateEstablished,
		"02": StateSynSent,
		"03": StateSynRecv,
		"04": StateFinWait1,
		"05": StateFinWait2,
		"06": StateTimeWait,
		"07": StateClose,
		"08": StateCloseWait,
		"09": StateLastAck,
		"0A": StateListen,
		"0B": StateClosing,
	}

	files := []string{"/proc/net/tcp", "/proc/net/tcp6"}
	for _, filePath := range files {
		if err := c.parseNetTCP(filePath, statsMap, stateMap); err != nil {
			// tcp6不存在时忽略
			continue
		}
	}

	// 计算派生指标
	result := make([]PortStats, 0, len(c.monitorPorts))
	for _, port := range c.monitorPorts {
		s := statsMap[port]
		s.Total = s.Established + s.SynSent + s.SynRecv +
			s.FinWait1 + s.FinWait2 + s.TimeWait +
			s.CloseWait + s.LastAck

		// 找最大单IP连接数
		for _, cnt := range s.SourceIPCount {
			if cnt > s.MaxSingleIPConn {
				s.MaxSingleIPConn = cnt
			}
		}

		// 做一次TCP探测
		s.ProbeSuccess, s.ProbeLatency = c.tcpProbe("127.0.0.1", port)

		result = append(result, *s)
	}

	return result, nil
}

// parseNetTCP 解析 /proc/net/tcp 文件
func (c *ConnStateCollector) parseNetTCP(
	filePath string,
	statsMap map[int]*PortStats,
	stateMap map[string]ConnState,
) error {
	file, err := os.Open(filePath)
	if err != nil {
		return err
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	scanner.Scan() // 跳过标题行

	for scanner.Scan() {
		line := scanner.Text()
		fields := strings.Fields(line)
		if len(fields) < 4 {
			continue
		}

		// 本地地址: IP:PORT (十六进制)
		localAddr := fields[1]
		remoteAddr := fields[2]
		stateHex := strings.ToUpper(fields[3])

		localPort := c.hexToPort(localAddr)
		if localPort == 0 {
			continue
		}

		state, ok := stateMap[stateHex]
		if !ok {
			continue
		}

		portStat, monitored := statsMap[localPort]
		if !monitored {
			continue
		}

		// 按状态累加
		switch state {
		case StateEstablished:
			portStat.Established++
		case StateSynSent:
			portStat.SynSent++
		case StateSynRecv:
			portStat.SynRecv++
		case StateFinWait1:
			portStat.FinWait1++
		case StateFinWait2:
			portStat.FinWait2++
		case StateTimeWait:
			portStat.TimeWait++
		case StateCloseWait:
			portStat.CloseWait++
		case StateLastAck:
			portStat.LastAck++
		case StateListen:
			portStat.Listen++
		}

		// 统计来源IP
		if state == StateEstablished {
			remoteIP := c.hexToIP(remoteAddr)
			if remoteIP != "" {
				portStat.SourceIPCount[remoteIP]++
			}
		}
	}

	return scanner.Err()
}

// hexToPort 十六进制地址端口解析
// 格式:0F01A8C0:15A2 -> IP:PORT
func (c *ConnStateCollector) hexToPort(hexAddr string) int {
	parts := strings.Split(hexAddr, ":")
	if len(parts) != 2 {
		return 0
	}
	port, err := strconv.ParseInt(parts[1], 16, 32)
	if err != nil {
		return 0
	}
	return int(port)
}

// hexToIP 十六进制地址转IP
func (c *ConnStateCollector) hexToIP(hexAddr string) string {
	parts := strings.Split(hexAddr, ":")
	if len(parts) != 2 {
		return ""
	}
	hexIP := parts[0]
	if len(hexIP) != 8 {
		return ""
	}
	// 小端序,逐字节解析
	b := make([]byte, 4)
	for i := 0; i < 4; i++ {
		val, err := strconv.ParseUint(hexIP[i*2:(i+1)*2], 16, 8)
		if err != nil {
			return ""
		}
		b[3-i] = byte(val)
	}
	return net.IP(b).String()
}

// collectWindows 在Windows上用netstat命令采集
func (c *ConnStateCollector) collectWindows() ([]PortStats, error) {
	// Windows实现通过执行netstat命令解析
	// 实际项目中使用 os/exec 执行 netstat -ano
	statsMap := make(map[int]*PortStats)
	for _, port := range c.monitorPorts {
		statsMap[port] = &PortStats{
			Port:          port,
			Timestamp:     time.Now(),
			SourceIPCount: make(map[string]int),
		}
	}

	// 这里省略netstat解析,实际代码类似Linux解析逻辑
	// 重点:Windows上HIS服务器通常用Windows Server,需要此分支

	result := make([]PortStats, 0)
	for _, port := range c.monitorPorts {
		s := statsMap[port]
		s.ProbeSuccess, s.ProbeLatency = c.tcpProbe("127.0.0.1", port)
		result = append(result, *s)
	}
	return result, nil
}

// tcpProbe TCP连通性探测
func (c *ConnStateCollector) tcpProbe(host string, port int) (bool, float64) {
	addr := fmt.Sprintf("%s:%d", host, port)
	start := time.Now()
	conn, err := net.DialTimeout("tcp", addr, time.Second*3)
	if err != nil {
		return false, 0
	}
	conn.Close()
	latency := float64(time.Since(start).Microseconds()) / 1000.0
	return true, latency
}

3.3 采集层 - 数据库连接监控(Go)

Go
// collector/internal/database/mssql_collector.go
package database

import (
	"context"
	"database/sql"
	"fmt"
	"time"

	_ "github.com/denisenkom/go-mssqldb"
)

// MSSQLStats SQL Server监控指标
type MSSQLStats struct {
	Timestamp time.Time

	// 连接统计
	UserConnections    int
	ActiveConnections  int
	BlockedProcesses   int

	// 等待统计
	LockWaits        int64
	LogWriteWaits    int64
	PageIOLatchWaits int64
	TotalWaits       int64

	// 性能统计
	BatchRequestsPerSec  float64
	SQLCompilationsPerSec float64
	PageFaultsPerSec     float64

	// 慢查询
	LongRunningQueries []LongRunningQuery

	// 连接池状态
	Port1432Stats PortConnectionInfo
}

// LongRunningQuery 慢查询信息
type LongRunningQuery struct {
	SessionID   int
	ElapsedTime int    // 秒
	Status      string
	Command     string
	DatabaseName string
	LoginName   string
	HostName    string
	WaitType    string
	WaitTime    int
}

// PortConnectionInfo 端口连接信息
type PortConnectionInfo struct {
	Established int
	WaitCount   int
}

// MSSQLCollector SQL Server采集器
type MSSQLCollector struct {
	connStr         string
	db              *sql.DB
	longQueryThresh int // 慢查询阈值(秒)
}

// NewMSSQLCollector 创建采集器
func NewMSSQLCollector(host string, port int, user, password, database string) *MSSQLCollector {
	connStr := fmt.Sprintf(
		"server=%s;port=%d;user id=%s;password=%s;database=%s;connection timeout=5",
		host, port, user, password, database,
	)
	return &MSSQLCollector{
		connStr:         connStr,
		longQueryThresh: 30, // 超过30秒的查询记录
	}
}

// Connect 建立数据库连接
func (m *MSSQLCollector) Connect() error {
	db, err := sql.Open("sqlserver", m.connStr)
	if err != nil {
		return err
	}
	db.SetMaxOpenConns(3)
	db.SetMaxIdleConns(1)
	db.SetConnMaxLifetime(time.Minute * 5)

	ctx, cancel := context.WithTimeout(context.Background(), time.Second*5)
	defer cancel()

	if err := db.PingContext(ctx); err != nil {
		return err
	}
	m.db = db
	return nil
}

// Collect 采集监控数据
func (m *MSSQLCollector) Collect(ctx context.Context) (*MSSQLStats, error) {
	stats := &MSSQLStats{Timestamp: time.Now()}

	// 并发采集各项指标
	errChan := make(chan error, 4)

	go func() {
		errChan <- m.collectConnections(ctx, stats)
	}()
	go func() {
		errChan <- m.collectWaitStats(ctx, stats)
	}()
	go func() {
		errChan <- m.collectPerfCounters(ctx, stats)
	}()
	go func() {
		errChan <- m.collectLongQueries(ctx, stats)
	}()

	// 等待所有采集完成
	var lastErr error
	for i := 0; i < 4; i++ {
		if err := <-errChan; err != nil {
			lastErr = err
		}
	}

	return stats, lastErr
}

// collectConnections 采集连接数信息
func (m *MSSQLCollector) collectConnections(ctx context.Context, stats *MSSQLStats) error {
	query := `
	SELECT 
		(SELECT COUNT(*) FROM sys.dm_exec_sessions WHERE is_user_process = 1) AS user_connections,
		(SELECT COUNT(*) FROM sys.dm_exec_requests WHERE status = 'running') AS active_connections,
		(SELECT COUNT(*) FROM sys.dm_exec_requests WHERE blocking_session_id != 0) AS blocked_processes
	`
	row := m.db.QueryRowContext(ctx, query)
	return row.Scan(
		&stats.UserConnections,
		&stats.ActiveConnections,
		&stats.BlockedProcesses,
	)
}

// collectWaitStats 采集等待统计
func (m *MSSQLCollector) collectWaitStats(ctx context.Context, stats *MSSQLStats) error {
	query := `
	SELECT 
		wait_type,
		waiting_tasks_count,
		wait_time_ms
	FROM sys.dm_os_wait_stats
	WHERE wait_type IN (
		'LCK_M_S', 'LCK_M_X', 'LCK_M_U',
		'WRITELOG',
		'PAGEIOLATCH_SH', 'PAGEIOLATCH_EX'
	)
	AND waiting_tasks_count > 0
	`
	rows, err := m.db.QueryContext(ctx, query)
	if err != nil {
		return err
	}
	defer rows.Close()

	for rows.Next() {
		var waitType string
		var waitCount, waitTime int64
		if err := rows.Scan(&waitType, &waitCount, &waitTime); err != nil {
			continue
		}
		stats.TotalWaits += waitCount

		switch {
		case strings.HasPrefix(waitType, "LCK"):
			stats.LockWaits += waitCount
		case waitType == "WRITELOG":
			stats.LogWriteWaits += waitCount
		case strings.HasPrefix(waitType, "PAGEIOLATCH"):
			stats.PageIOLatchWaits += waitCount
		}
	}
	return rows.Err()
}

// collectLongQueries 采集长时间运行的查询
func (m *MSSQLCollector) collectLongQueries(ctx context.Context, stats *MSSQLStats) error {
	query := `
	SELECT 
		s.session_id,
		DATEDIFF(SECOND, r.start_time, GETDATE()) AS elapsed_seconds,
		r.status,
		r.command,
		DB_NAME(r.database_id) AS database_name,
		s.login_name,
		s.host_name,
		ISNULL(r.wait_type, '') AS wait_type,
		ISNULL(r.wait_time, 0) AS wait_time_ms
	FROM sys.dm_exec_requests r
	INNER JOIN sys.dm_exec_sessions s ON r.session_id = s.session_id
	WHERE s.is_user_process = 1
	AND DATEDIFF(SECOND, r.start_time, GETDATE()) > @threshold
	ORDER BY elapsed_seconds DESC
	`
	rows, err := m.db.QueryContext(ctx, query,
		sql.Named("threshold", m.longQueryThresh))
	if err != nil {
		return err
	}
	defer rows.Close()

	for rows.Next() {
		var q LongRunningQuery
		if err := rows.Scan(
			&q.SessionID, &q.ElapsedTime, &q.Status,
			&q.Command, &q.DatabaseName, &q.LoginName,
			&q.HostName, &q.WaitType, &q.WaitTime,
		); err != nil {
			continue
		}
		stats.LongRunningQueries = append(stats.LongRunningQueries, q)
	}
	return rows.Err()
}

// collectPerfCounters 采集性能计数器
func (m *MSSQLCollector) collectPerfCounters(ctx context.Context, stats *MSSQLStats) error {
	query := `
	SELECT counter_name, cntr_value
	FROM sys.dm_os_performance_counters
	WHERE counter_name IN (
		'Batch Requests/sec',
		'SQL Compilations/sec',
		'Page Faults/sec'
	)
	`
	rows, err := m.db.QueryContext(ctx, query)
	if err != nil {
		return err
	}
	defer rows.Close()

	for rows.Next() {
		var name string
		var value float64
		if err := rows.Scan(&name, &value); err != nil {
			continue
		}
		switch strings.TrimSpace(name) {
		case "Batch Requests/sec":
			stats.BatchRequestsPerSec = value
		case "SQL Compilations/sec":
			stats.SQLCompilationsPerSec = value
		case "Page Faults/sec":
			stats.PageFaultsPerSec = value
		}
	}
	return rows.Err()
}

3.4 分析引擎 - 风险评分核心(Python)

Python
# analyzer/core/risk_scorer.py

from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple
from enum import Enum
import time
import logging

logger = logging.getLogger(__name__)


class RiskLevel(Enum):
    """风险等级"""
    NORMAL = 0          # 正常 (0-30分)
    JITTER = 1          # 网络抖动 (30-60分)
    ACCUMULATION = 2    # 连接堆积 (60-80分)
    AVALANCHE = 3       # 雪崩风险 (80-100分)


@dataclass
class MetricScore:
    """单项指标评分"""
    metric_name: str
    current_value: float
    baseline_value: float
    threshold_warn: float
    threshold_critical: float
    score: float            # 0-25分
    weight: float           # 权重
    weighted_score: float   # 加权得分
    description: str


@dataclass
class RiskScoreResult:
    """风险评分结果"""
    timestamp: float
    total_score: float          # 总分 0-100
    risk_level: RiskLevel
    metric_scores: List[MetricScore]
    triggered_rules: List[str]  # 触发的规则描述
    recommendation: str         # 处置建议
    trend: str                  # 趋势:rising/stable/falling
    
    # 各层分数
    link_layer_score: float
    port_layer_score: float
    db_layer_score: float
    client_layer_score: float


@dataclass
class Baseline:
    """基线数据"""
    metric_name: str
    avg: float
    std: float
    p95: float
    p99: float
    sample_count: int
    last_updated: float


class RiskScorer:
    """
    HIS雪崩风险评分引擎
    
    评分逻辑:
    - 链路层指标:权重30%
    - 端口层指标:权重35%
    - 数据库层指标:权重25%
    - 客户端层指标:权重10%
    
    总分0-100:
    - 0-30:正常
    - 30-60:网络抖动预警(一级)
    - 60-80:连接堆积预警(二级)
    - 80-100:雪崩风险(三/四级)
    """

    # 指标权重配置
    WEIGHTS = {
        # 链路层 (总权重30%)
        'ping_loss_rate':        0.10,
        'tcp_retransmit_rate':   0.12,
        'gateway_packet_loss':   0.08,

        # 端口层 (总权重35%)
        'port5522_established':  0.12,
        'port5522_close_wait':   0.10,
        'port5522_syn_recv':     0.08,
        'port5522_probe_fail':   0.05,

        # 数据库层 (总权重25%)
        'port1432_established':  0.10,
        'db_blocked_processes':  0.08,
        'db_lock_waits':         0.07,

        # 客户端层 (总权重10%)
        'login_fail_rate':       0.05,
        'client_retry_rate':     0.05,
    }

    def __init__(self, baseline_manager):
        self.baseline_manager = baseline_manager
        self.score_history: List[float] = []
        self.history_max = 60  # 保留最近60次评分

    def score(self, metrics: Dict[str, float]) -> RiskScoreResult:
        """
        对当前指标数据进行风险评分
        
        Args:
            metrics: 各指标当前值的字典
            
        Returns:
            RiskScoreResult: 完整评分结果
        """
        metric_scores = []
        triggered_rules = []
        
        # 各层分数累计
        layer_scores = {
            'link': 0.0,
            'port': 0.0,
            'db': 0.0,
            'client': 0.0,
        }
        layer_weights_used = {
            'link': 0.0,
            'port': 0.0,
            'db': 0.0,
            'client': 0.0,
        }

        # 逐指标评分
        for metric_name, weight in self.WEIGHTS.items():
            if metric_name not in metrics:
                continue

            current = metrics[metric_name]
            baseline = self.baseline_manager.get_baseline(metric_name)
            
            score, warn_thresh, crit_thresh = self._score_single_metric(
                metric_name, current, baseline
            )
            
            weighted = score * weight * 100  # 归一化到权重空间

            ms = MetricScore(
                metric_name=metric_name,
                current_value=current,
                baseline_value=baseline.avg if baseline else 0,
                threshold_warn=warn_thresh,
                threshold_critical=crit_thresh,
                score=score,
                weight=weight,
                weighted_score=weighted,
                description=self._get_metric_description(
                    metric_name, current, score
                )
            )
            metric_scores.append(ms)

            # 分配到对应层
            layer = self._get_metric_layer(metric_name)
            layer_scores[layer] += weighted
            layer_weights_used[layer] += weight

            # 触发规则检查
            rule = self._check_rule(metric_name, current, score, metrics)
            if rule:
                triggered_rules.append(rule)

        # 计算总分(加权求和)
        total_score = sum(ms.weighted_score for ms in metric_scores)
        total_score = min(100.0, max(0.0, total_score))

        # 规范化各层分数到0-100
        def normalize_layer(layer_score, weights_used):
            if weights_used == 0:
                return 0.0
            return min(100.0, layer_score / weights_used)

        # 判断风险等级
        risk_level = self._determine_risk_level(total_score, triggered_rules)

        # 判断趋势
        trend = self._calculate_trend(total_score)

        # 生成处置建议
        recommendation = self._generate_recommendation(
            risk_level, triggered_rules, metrics
        )

        # 记录历史
        self.score_history.append(total_score)
        if len(self.score_history) > self.history_max:
            self.score_history.pop(0)

        return RiskScoreResult(
            timestamp=time.time(),
            total_score=total_score,
            risk_level=risk_level,
            metric_scores=metric_scores,
            triggered_rules=triggered_rules,
            recommendation=recommendation,
            trend=trend,
            link_layer_score=normalize_layer(
                layer_scores['link'], layer_weights_used['link']
            ),
            port_layer_score=normalize_layer(
                layer_scores['port'], layer_weights_used['port']
            ),
            db_layer_score=normalize_layer(
                layer_scores['db'], layer_weights_used['db']
            ),
            client_layer_score=normalize_layer(
                layer_scores['client'], layer_weights_used['client']
            ),
        )

    def _score_single_metric(
        self,
        metric_name: str,
        current: float,
        baseline: Optional[Baseline]
    ) -> Tuple[float, float, float]:
        """
        对单个指标打分 (0.0 ~ 1.0)
        
        Returns:
            (score, warn_threshold, critical_threshold)
        """
        # 各指标的评分规则
        rules = {
            'ping_loss_rate': {
                'warn': 1.0,    # 1%丢包开始预警
                'critical': 5.0,  # 5%丢包严重
                'direction': 'high',  # 越高越危险
            },
            'tcp_retransmit_rate': {
                'warn_multiplier': 3.0,   # 超基线3倍预警
                'critical_multiplier': 8.0,
                'direction': 'high',
            },
            'gateway_packet_loss': {
                'warn': 0.5,
                'critical': 2.0,
                'direction': 'high',
            },
            'port5522_established': {
                'warn_multiplier': 2.0,   # 超基线2倍预警
                'critical_multiplier': 3.0,
                'direction': 'high',
            },
            'port5522_close_wait': {
                'warn': 20,     # 绝对值阈值
                'critical': 50,
                'direction': 'high',
            },
            'port5522_syn_recv': {
                'warn': 10,
                'critical': 30,
                'direction': 'high',
            },
            'port5522_probe_fail': {
                'warn': 0.1,    # 10%探测失败
                'critical': 0.5,
                'direction': 'high',
            },
            'port1432_established': {
                'warn_multiplier': 2.0,
                'critical_multiplier': 3.0,
                'direction': 'high',
            },
            'db_blocked_processes': {
                'warn': 3,
                'critical': 10,
                'direction': 'high',
            },
            'db_lock_waits': {
                'warn_multiplier': 3.0,
                'critical_multiplier': 10.0,
                'direction': 'high',
            },
            'login_fail_rate': {
                'warn': 0.05,   # 5%登录失败率
                'critical': 0.2,
                'direction': 'high',
            },
            'client_retry_rate': {
                'warn': 0.1,
                'critical': 0.3,
                'direction': 'high',
            },
        }

        rule = rules.get(metric_name)
        if not rule:
            return 0.0, 0.0, 0.0

        # 计算阈值
        if 'warn_multiplier' in rule and baseline and baseline.avg > 0:
            warn_thresh = baseline.avg * rule['warn_multiplier']
            critical_thresh = baseline.avg * rule['critical_multiplier']
        else:
            warn_thresh = rule.get('warn', 0)
            critical_thresh = rule.get('critical', 0)

        # 计算分数
        if current <= warn_thresh:
            score = 0.0
        elif current >= critical_thresh:
            score = 1.0
        else:
            # 线性插值
            score = (current - warn_thresh) / (critical_thresh - warn_thresh)

        return score, warn_thresh, critical_thresh

    def _check_rule(
        self,
        metric_name: str,
        current: float,
        score: float,
        all_metrics: Dict[str, float]
    ) -> Optional[str]:
        """
        检查是否触发联动规则(多指标组合判断)
        这是四层联动监测的核心
        """
        if score < 0.3:
            return None

        # 规则1:链路抖动信号
        if metric_name == 'ping_loss_rate' and score >= 0.3:
            tcp_score = self._get_metric_score_from_value(
                'tcp_retransmit_rate',
                all_metrics.get('tcp_retransmit_rate', 0)
            )
            if tcp_score >= 0.3:
                return "⚠️ 链路抖动信号:ping丢包 + TCP重传同时升高,链路可能存在拥塞"

        # 规则2:5522业务入口堆积
        if metric_name == 'port5522_established' and score >= 0.5:
            close_wait = all_metrics.get('port5522_close_wait', 0)
            if close_wait >= 20:
                return f"🔴 5522业务入口堆积:Established升高且CLOSE_WAIT={close_wait},HIS连接释放异常"

        # 规则3:雪崩传导信号
        if metric_name == 'port1432_established' and score >= 0.4:
            port5522_score = self._get_metric_score_from_value(
                'port5522_established',
                all_metrics.get('port5522_established', 0)
            )
            if port5522_score >= 0.4:
                return "🆘 雪崩传导信号:5522和1432连接数同时升高,故障正在向数据库传导"

        # 规则4:客户端重试放大
        if metric_name == 'login_fail_rate' and score >= 0.3:
            retry_rate = all_metrics.get('client_retry_rate', 0)
            if retry_rate >= 0.1:
                return "⚠️ 重试放大风险:登录失败率升高且客户端重试增加,可能形成重试风暴"

        # 规则5:5522探测失败 + 连接堆积
        if metric_name == 'port5522_probe_fail' and current >= 0.1:
            established = all_metrics.get('port5522_established', 0)
            baseline = self.baseline_manager.get_baseline('port5522_established')
            if baseline and established > baseline.avg * 1.5:
                return f"🆘 5522服务异常:TCP探测失败率{current:.1%},同时连接数堆积,服务可能无法接受新连接"

        return None

    def _get_metric_score_from_value(
        self, metric_name: str, value: float
    ) -> float:
        baseline = self.baseline_manager.get_baseline(metric_name)
        score, _, _ = self._score_single_metric(metric_name, value, baseline)
        return score

    def _determine_risk_level(
        self,
        total_score: float,
        triggered_rules: List[str]
    ) -> RiskLevel:
        """
        综合总分和规则触发情况判断风险等级
        规则触发可以升级风险等级
        """
        # 基于分数的基础等级
        if total_score < 30:
            base_level = RiskLevel.NORMAL
        elif total_score < 60:
            base_level = RiskLevel.JITTER
        elif total_score < 80:
            base_level = RiskLevel.ACCUMULATION
        else:
            base_level = RiskLevel.AVALANCHE

        # 雪崩传导规则触发时直接升到AVALANCHE
        for rule in triggered_rules:
            if "雪崩传导" in rule or "服务异常" in rule:
                if base_level.value < RiskLevel.AVALANCHE.value:
                    base_level = RiskLevel.AVALANCHE

        # 重试放大规则 + 连接堆积时升到ACCUMULATION
        for rule in triggered_rules:
            if "重试放大" in rule or "业务入口堆积" in rule:
                if base_level.value < RiskLevel.ACCUMULATION.value:
                    base_level = RiskLevel.ACCUMULATION

        return base_level

    def _calculate_trend(self, current_score: float) -> str:
        """
        基于历史评分判断趋势
        返回:rising / stable / falling
        """
        if len(self.score_history) < 5:
            return "stable"

        recent = self.score_history[-5:]
        avg_recent = sum(recent) / len(recent)

        if len(self.score_history) < 10:
            return "stable"

        older = self.score_history[-10:-5]
        avg_older = sum(older) / len(older)

        diff = avg_recent - avg_older
        if diff > 5:
            return "rising"
        elif diff < -5:
            return "falling"
        return "stable"

    def _generate_recommendation(
        self,
        risk_level: RiskLevel,
        triggered_rules: List[str],
        metrics: Dict[str, float]
    ) -> str:
        """根据风险等级和触发规则生成处置建议"""
        
        recommendations = {
            RiskLevel.NORMAL: "✅ 系统运行正常,继续监测",
            
            RiskLevel.JITTER: (
                "⚠️ 检测到网络抖动迹象,建议:\n"
                "1. 检查192.168.8.249和192.168.8.254之间的链路\n"
                "2. 查看交换机端口drop/discard计数\n"
                "3. 检查防火墙CPU和会话数\n"
                "4. 保存当前网络快照备查"
            ),
            
            RiskLevel.ACCUMULATION: (
                "🔴 检测到连接堆积,建议立即:\n"
                "1. 通知用户暂停频繁重新登录\n"
                "2. 检查5522端口CLOSE_WAIT连接并手动清理\n"
                "3. 查看是否有异常来源IP大量建连\n"
                "4. 检查HIS服务线程池状态\n"
                "5. 检查防火墙TCP会话老化策略"
            ),
            
            RiskLevel.AVALANCHE: (
                "🆘 雪崩风险极高,建议紧急处置:\n"
                "1. 立即停止所有客户端重试行为\n"
                "2. 控制客户端分批登录\n"
                "3. 重启HIS业务服务(5522)\n"
                "4. 检查1432数据库连接池并清理长连接\n"
                "5. 检查网关/交换机/防火墙是否存在流量瓶颈\n"
                "⚡ 注意:流量骤降不代表恢复,可能是服务已崩溃!"
            ),
        }

        base_rec = recommendations.get(risk_level, "")

        # 补充特定规则的专项建议
        extras = []
        for rule in triggered_rules:
            if "链路抖动" in rule:
                extras.append("→ 优先排查物理链路和交换机")
            if "雪崩传导" in rule:
                extras.append("→ 优先处理数据库连接池,防止1432连接耗尽")
            if "重试放大" in rule:
                extras.append("→ 在HIS入口添加登录频率限制")

        if extras:
            base_rec += "\n\n专项建议:\n" + "\n".join(extras)

        return base_rec

    def _get_metric_layer(self, metric_name: str) -> str:
        """获取指标所属层"""
        layer_map = {
            'link': ['ping_loss_rate', 'tcp_retransmit_rate', 'gateway_packet_loss'],
            'port': ['port5522_established', 'port5522_close_wait',
                     'port5522_syn_recv', 'port5522_probe_fail'],
            'db': ['port1432_established', 'db_blocked_processes', 'db_lock_waits'],
            'client': ['login_fail_rate', 'client_retry_rate'],
        }
        for layer, metrics in layer_map.items():
            if metric_name in metrics:
                return layer
        return 'link'
    def _get_metric_description(
        self, metric_name: str, value: float, score: float
    ) -> str:
        """生成指标描述文本"""
        level = "正常" if score < 0.3 else ("预警" if score < 0.7 else "严重")
        descriptions = {
            'ping_loss_rate': f"Ping丢包率 {value:.1f}% [{level}]",
            'tcp_retransmit_rate': f"TCP重传率 {value:.4f} [{level}]",
            'gateway_packet_loss': f"网关丢包率 {value:.2f}% [{level}]",
            'port5522_established': f"5522 Established连接数 {int(value)} [{level}]",
            'port5522_close_wait': f"5522 CLOSE_WAIT连接数 {int(value)} [{level}]",
            'port5522_syn_recv': f"5522 SYN_RECV连接数 {int(value)} [{level}]",
            'port5522_probe_fail': f"5522 TCP探测失败率 {value:.1%} [{level}]",
            'port1432_established': f"1432 Established连接数 {int(value)} [{level}]",
            'db_blocked_processes': f"数据库阻塞进程数 {int(value)} [{level}]",
            'db_lock_waits': f"数据库锁等待次数 {int(value)} [{level}]",
            'login_fail_rate': f"登录失败率 {value:.1%} [{level}]",
            'client_retry_rate': f"客户端重试率 {value:.1%} [{level}]",
        }
        return descriptions.get(metric_name, f"{metric_name}: {value:.2f} [{level}]")

3.5 分析引擎 - 趋势分析器(Python)

Python
# analyzer/core/trend_analyzer.py

from collections import deque
from typing import Dict, List, Optional, Tuple
import statistics
import time


class TrendAnalyzer:
    """
    趋势分析器
    
    核心功能:
    1. 检测指标是否在快速上升(雪崩前兆)
    2. 检测是否存在"流量骤降但故障未恢复"(服务崩溃误判)
    3. 检测5522和1432的因果关系(谁先异常)
    4. 检测客户端重试放大效应
    """

    def __init__(self, window_size: int = 30):
        """
        Args:
            window_size: 滑动窗口大小(时间点数量)
        """
        self.window_size = window_size
        # 各指标的历史数据窗口
        self.history: Dict[str, deque] = {}
        # 各指标首次异常时间(用于因果分析)
        self.first_anomaly_time: Dict[str, Optional[float]] = {}

    def update(self, metric_name: str, value: float, timestamp: float = None):
        """更新指标历史数据"""
        if timestamp is None:
            timestamp = time.time()

        if metric_name not in self.history:
            self.history[metric_name] = deque(maxlen=self.window_size)
            self.first_anomaly_time[metric_name] = None

        self.history[metric_name].append((timestamp, value))

    def detect_rapid_rise(
        self,
        metric_name: str,
        threshold_multiplier: float = 1.5,
        window: int = 5
    ) -> Tuple[bool, float]:
        """
        检测指标是否在快速上升(相比前N个点)
        
        Returns:
            (is_rising, rise_rate): 是否快速上升,上升倍率
        """
        if metric_name not in self.history:
            return False, 0.0

        data = list(self.history[metric_name])
        if len(data) < window * 2:
            return False, 0.0

        # 前half的均值 vs 后half的均值
        recent = [v for _, v in data[-window:]]
        older = [v for _, v in data[-window*2:-window]]

        avg_recent = statistics.mean(recent) if recent else 0
        avg_older = statistics.mean(older) if older else 0

        if avg_older <= 0:
            return False, 0.0

        rise_rate = avg_recent / avg_older
        return rise_rate >= threshold_multiplier, rise_rate

    def detect_traffic_collapse(
        self,
        flow_metric: str,
        error_metric: str,
        collapse_threshold: float = 0.3  # 流量降至原来的30%以下
    ) -> bool:
        """
        检测"流量骤降但错误未减少"场景
        这是服务完全崩溃的信号,不要误判为恢复
        
        Returns:
            True: 检测到服务崩溃(流量降,错误仍高)
        """
        if flow_metric not in self.history or error_metric not in self.history:
            return False

        flow_data = list(self.history[flow_metric])
        error_data = list(self.history[error_metric])

        if len(flow_data) < 10 or len(error_data) < 5:
            return False

        # 流量是否骤降(当前 < 历史峰值的30%)
        recent_flow = statistics.mean([v for _, v in flow_data[-3:]])
        peak_flow = max(v for _, v in flow_data[:-3]) if len(flow_data) > 3 else 0

        if peak_flow <= 0:
            return False

        flow_ratio = recent_flow / peak_flow
        is_flow_collapsed = flow_ratio < collapse_threshold

        # 错误率是否仍然高
        recent_errors = statistics.mean([v for _, v in error_data[-3:]])
        is_error_high = recent_errors > 0.1  # 错误率>10%

        return is_flow_collapsed and is_error_high

    def analyze_causality(
        self,
        metric_a: str,
        metric_b: str,
        anomaly_threshold_a: float,
        anomaly_threshold_b: float
    ) -> Optional[str]:
        """
        因果关系分析:判断A和B谁先出现异常
        用于区分 "5522先于1432异常"(业务传导)vs "1432先于5522"(数据库首发)
        
        Returns:
            'A_first': A先异常
            'B_first': B先异常
            'simultaneous': 同时异常
            None: 尚未检测到异常
        """
        first_a = self._find_first_anomaly_time(metric_a, anomaly_threshold_a)
        first_b = self._find_first_anomaly_time(metric_b, anomaly_threshold_b)

        if first_a is None and first_b is None:
            return None
        if first_a is None:
            return 'B_first'
        if first_b is None:
            return 'A_first'

        time_diff = abs(first_a - first_b)
        if time_diff < 60:  # 60秒内视为同时
            return 'simultaneous'
        elif first_a < first_b:
            return 'A_first'
        else:
            return 'B_first'

    def _find_first_anomaly_time(
        self,
        metric_name: str,
        threshold: float
    ) -> Optional[float]:
        """找到指标首次超过阈值的时间"""
        if metric_name not in self.history:
            return None

        for timestamp, value in self.history[metric_name]:
            if value >= threshold:
                return timestamp
        return None

    def get_rate_of_change(self, metric_name: str, points: int = 5) -> float:
        """
        计算指标的变化速率(每分钟)
        
        Returns:
            正值表示上升,负值表示下降
        """
        if metric_name not in self.history:
            return 0.0

        data = list(self.history[metric_name])
        if len(data) < 2:
            return 0.0

        recent = data[-min(points, len(data)):]
        if len(recent) < 2:
            return 0.0

        time_span = recent[-1][0] - recent[0][0]
        if time_span <= 0:
            return 0.0

        value_change = recent[-1][1] - recent[0][1]
        # 每分钟变化量
        return value_change / (time_span / 60.0)

    def is_sustained_high(
        self,
        metric_name: str,
        threshold: float,
        sustained_points: int = 3
    ) -> bool:
        """
        判断指标是否持续超过阈值(连续N个点)
        避免偶发抖动误报
        """
        if metric_name not in self.history:
            return False

        data = list(self.history[metric_name])
        if len(data) < sustained_points:
            return False

        recent = data[-sustained_points:]
        return all(v >= threshold for _, v in recent)

    def generate_trend_report(self, metrics: List[str]) -> Dict[str, dict]:
        """生成趋势报告"""
        report = {}
        for metric in metrics:
            is_rising, rate = self.detect_rapid_rise(metric)
            roc = self.get_rate_of_change(metric)

            report[metric] = {
                'is_rapidly_rising': is_rising,
                'rise_rate': round(rate, 2),
                'rate_of_change_per_min': round(roc, 4),
                'data_points': len(self.history.get(metric, [])),
            }
        return report

3.6 分析引擎 - 预警决策器(Python)

Python
# analyzer/core/alert_decision.py

import time
import logging
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from enum import Enum

from .risk_scorer import RiskScoreResult, RiskLevel

logger = logging.getLogger(__name__)


class AlertStatus(Enum):
    FIRING = "firing"       # 正在告警
    RESOLVED = "resolved"   # 已恢复
    SUPPRESSED = "suppressed"  # 被抑制(避免重复告警)


@dataclass
class Alert:
    """告警实体"""
    alert_id: str
    level: RiskLevel
    title: str
    description: str
    triggered_rules: List[str]
    recommendation: str
    score: float
    timestamp: float
    status: AlertStatus = AlertStatus.FIRING
    notified: bool = False
    resolved_at: Optional[float] = None
    
    # 各层分数快照
    link_score: float = 0.0
    port_score: float = 0.0
    db_score: float = 0.0
    client_score: float = 0.0


class AlertDecisionEngine:
    """
    预警决策引擎
    
    核心逻辑:
    1. 防抖:避免短暂波动导致频繁告警
    2. 防重:同级别告警在抑制期内不重复发送
    3. 升级:风险等级升高时立即重新告警
    4. 恢复:风险降低后发送恢复通知
    5. 特殊处理:流量骤降场景(不误判为恢复)
    """

    # 各级别告警的抑制时间(秒)
    SUPPRESSION_DURATION = {
        RiskLevel.NORMAL: 0,
        RiskLevel.JITTER: 300,       # 5分钟内不重复一级告警
        RiskLevel.ACCUMULATION: 180, # 3分钟内不重复二级告警
        RiskLevel.AVALANCHE: 60,     # 1分钟内不重复三四级告警
    }

    # 告警确认所需连续触发次数(防抖)
    CONFIRM_COUNT = {
        RiskLevel.NORMAL: 0,
        RiskLevel.JITTER: 3,        # 连续3次才告警
        RiskLevel.ACCUMULATION: 2,
        RiskLevel.AVALANCHE: 1,     # 雪崩级别立即告警
    }

    # 恢复确认所需连续正常次数
    RECOVERY_CONFIRM_COUNT = 5

    def __init__(self, notifier_manager):
        self.notifier_manager = notifier_manager
        
        # 当前活跃告警
        self.active_alerts: Dict[str, Alert] = {}
        
        # 连续触发计数(防抖用)
        self.trigger_counts: Dict[RiskLevel, int] = {level: 0 for level in RiskLevel}
        
        # 连续正常计数(恢复确认用)
        self.normal_counts: int = 0
        
        # 当前处于的告警级别
        self.current_alert_level: RiskLevel = RiskLevel.NORMAL
        
        # 上次告警时间(各级别)
        self.last_alert_time: Dict[RiskLevel, float] = {}
        
        # 告警历史(用于存储)
        self.alert_history: List[Alert] = []

    def process(self, score_result: RiskScoreResult) -> Optional[Alert]:
        """
        处理评分结果,决定是否告警
        
        Returns:
            如果需要发送告警,返回Alert对象;否则返回None
        """
        current_level = score_result.risk_level
        
        # 更新触发计数
        for level in RiskLevel:
            if level == current_level:
                self.trigger_counts[level] += 1
            else:
                self.trigger_counts[level] = 0

        # 判断是否需要告警
        alert = self._decide_alert(score_result)
        
        # 判断是否需要发送恢复通知
        recovery = self._check_recovery(score_result)
        
        if recovery:
            self._send_recovery(recovery)
            
        if alert:
            self._send_alert(alert)
            return alert
            
        return None

    def _decide_alert(self, score_result: RiskScoreResult) -> Optional[Alert]:
        """判断是否应该发送告警"""
        current_level = score_result.risk_level
        
        if current_level == RiskLevel.NORMAL:
            return None

        # 检查防抖:是否达到确认次数
        required_count = self.CONFIRM_COUNT[current_level]
        if self.trigger_counts[current_level] < required_count:
            logger.debug(
                f"防抖检查:{current_level.name} 触发 "
                f"{self.trigger_counts[current_level]}/{required_count} 次"
            )
            return None

        # 检查是否比当前告警级别更高(升级场景立即告警)
        if current_level.value > self.current_alert_level.value:
            return self._create_alert(score_result, "升级告警")

        # 检查抑制:同级别是否在抑制期内
        suppression = self.SUPPRESSION_DURATION[current_level]
        last_time = self.last_alert_time.get(current_level, 0)
        if time.time() - last_time < suppression:
            return None

        # 正常情况:发送告警
        return self._create_alert(score_result, "新增告警")

    def _create_alert(self, score_result: RiskScoreResult, reason: str) -> Alert:
        """创建告警对象"""
        level_names = {
            RiskLevel.JITTER: "一级预警:网络抖动",
            RiskLevel.ACCUMULATION: "二级预警:连接堆积",
            RiskLevel.AVALANCHE: "三/四级预警:雪崩风险",
        }

        level_emojis = {
            RiskLevel.JITTER: "⚠️",
            RiskLevel.ACCUMULATION: "🔴",
            RiskLevel.AVALANCHE: "🆘",
        }

        level = score_result.risk_level
        emoji = level_emojis.get(level, "⚠️")
        title = level_names.get(level, "HIS系统告警")

        # 构建描述
        description_lines = [
            f"风险评分:{score_result.total_score:.1f}分",
            f"趋势:{score_result.trend}",
            "",
            "各层状态:",
            f"  链路层:{score_result.link_layer_score:.0f}分",
            f"  端口层:{score_result.port_layer_score:.0f}分",
            f"  数据库层:{score_result.db_layer_score:.0f}分",
            f"  客户端层:{score_result.client_layer_score:.0f}分",
        ]

        if score_result.triggered_rules:
            description_lines.append("")
            description_lines.append("触发规则:")
            for rule in score_result.triggered_rules:
                description_lines.append(f"  {rule}")

        alert = Alert(
            alert_id=f"alert_{int(time.time())}_{level.value}",
            level=level,
            title=f"{emoji} {title}",
            description="\n".join(description_lines),
            triggered_rules=score_result.triggered_rules,
            recommendation=score_result.recommendation,
            score=score_result.total_score,
            timestamp=score_result.timestamp,
            link_score=score_result.link_layer_score,
            port_score=score_result.port_layer_score,
            db_score=score_result.db_layer_score,
            client_score=score_result.client_layer_score,
        )

        self.active_alerts[alert.alert_id] = alert
        self.alert_history.append(alert)
        self.current_alert_level = level
        self.last_alert_time[level] = time.time()

        logger.info(f"创建告警 [{reason}]: {title}, 分数={score_result.total_score:.1f}")
        return alert

    def _check_recovery(self, score_result: RiskScoreResult) -> Optional[Alert]:
        """检查是否需要发送恢复通知"""
        if score_result.risk_level == RiskLevel.NORMAL:
            self.normal_counts += 1
        else:
            self.normal_counts = 0
            return None

        # 特殊检查:是否是流量骤降导致的"假恢复"
        # 如果是,不发送恢复通知
        # 实际判断放在TrendAnalyzer中,这里通过上层传入标志

        if self.normal_counts >= self.RECOVERY_CONFIRM_COUNT:
            if self.current_alert_level != RiskLevel.NORMAL:
                # 发送恢复通知
                old_level = self.current_alert_level
                self.current_alert_level = RiskLevel.NORMAL
                self.normal_counts = 0

                # 找到最近的活跃告警
                if self.active_alerts:
                    latest_alert_id = max(
                        self.active_alerts.keys(),
                        key=lambda k: self.active_alerts[k].timestamp
                    )
                    recovery_alert = self.active_alerts.pop(latest_alert_id)
                    recovery_alert.status = AlertStatus.RESOLVED
                    recovery_alert.resolved_at = time.time()

                    logger.info(f"系统恢复:从 {old_level.name} 恢复到正常")
                    return recovery_alert

        return None

    def _send_alert(self, alert: Alert):
        """通过通知管理器发送告警"""
        try:
            self.notifier_manager.send(alert)
            alert.notified = True
        except Exception as e:
            logger.error(f"发送告警失败:{e}")

    def _send_recovery(self, alert: Alert):
        """发送恢复通知"""
        try:
            self.notifier_manager.send_recovery(alert)
        except Exception as e:
            logger.error(f"发送恢复通知失败:{e}")

3.7 基线管理器(Python)

Python
# analyzer/core/baseline_manager.py

import json
import time
import statistics
import logging
from typing import Dict, List, Optional
from dataclasses import dataclass, asdict
from pathlib import Path

logger = logging.getLogger(__name__)


@dataclass
class Baseline:
    """指标基线"""
    metric_name: str
    avg: float
    std: float
    p95: float
    p99: float
    sample_count: int
    last_updated: float
    time_window_hours: int  # 基线采样时间窗口


class BaselineManager:
    """
    基线管理器
    
    作用:
    - 存储各指标的正常基线值
    - 基线可以自动学习(根据历史正常数据)
    - 也可以手动配置
    - 基线用于计算"超出多少倍才告警"
    """

    # 默认基线(首次部署时使用,后续自动学习替代)
    DEFAULT_BASELINES = {
        'ping_loss_rate': Baseline(
            metric_name='ping_loss_rate',
            avg=0.0, std=0.1, p95=0.5, p99=1.0,
            sample_count=0, last_updated=0, time_window_hours=24
        ),
        'tcp_retransmit_rate': Baseline(
            metric_name='tcp_retransmit_rate',
            avg=0.001, std=0.0005, p95=0.003, p99=0.005,
            sample_count=0, last_updated=0, time_window_hours=24
        ),
        'port5522_established': Baseline(
            metric_name='port5522_established',
            avg=50.0, std=20.0, p95=100.0, p99=150.0,
            sample_count=0, last_updated=0, time_window_hours=24
        ),
        'port5522_close_wait': Baseline(
            metric_name='port5522_close_wait',
            avg=2.0, std=2.0, p95=8.0, p99=12.0,
            sample_count=0, last_updated=0, time_window_hours=24
        ),
        'port5522_syn_recv': Baseline(
            metric_name='port5522_syn_recv',
            avg=1.0, std=1.0, p95=3.0, p99=5.0,
            sample_count=0, last_updated=0, time_window_hours=24
        ),
        'port1432_established': Baseline(
            metric_name='port1432_established',
            avg=20.0, std=10.0, p95=50.0, p99=80.0,
            sample_count=0, last_updated=0, time_window_hours=24
        ),
        'db_blocked_processes': Baseline(
            metric_name='db_blocked_processes',
            avg=0.0, std=0.5, p95=1.0, p99=2.0,
            sample_count=0, last_updated=0, time_window_hours=24
        ),
        'login_fail_rate': Baseline(
            metric_name='login_fail_rate',
            avg=0.01, std=0.01, p95=0.03, p99=0.05,
            sample_count=0, last_updated=0, time_window_hours=24
        ),
    }

    def __init__(self, storage_path: str = "/var/lib/his-guardian/baselines.json"):
        self.storage_path = Path(storage_path)
        self.baselines: Dict[str, Baseline] = {}
        self.raw_samples: Dict[str, List[float]] = {}  # 原始样本,用于自动学习
        self.max_samples = 10000  # 每个指标最多保留多少原始样本

        self._load()

    def get_baseline(self, metric_name: str) -> Optional[Baseline]:
        """获取指标基线"""
        return self.baselines.get(
            metric_name,
            self.DEFAULT_BASELINES.get(metric_name)
        )

    def add_sample(self, metric_name: str, value: float):
        """
        添加样本值(用于自动学习基线)
        只在系统正常时添加(由外部判断)
        """
        if metric_name not in self.raw_samples:
            self.raw_samples[metric_name] = []

        samples = self.raw_samples[metric_name]
        samples.append(value)

        # 超过上限时,丢弃最老的样本
        if len(samples) > self.max_samples:
            self.raw_samples[metric_name] = samples[-self.max_samples:]

    def recalculate_baseline(self, metric_name: str) -> Optional[Baseline]:
        """
        重新计算指标基线
        建议每小时执行一次
        """
        samples = self.raw_samples.get(metric_name, [])
        if len(samples) < 100:  # 样本不足100个时不更新
            return None

        sorted_samples = sorted(samples)
        n = len(sorted_samples)

        baseline = Baseline(
            metric_name=metric_name,
            avg=statistics.mean(samples),
            std=statistics.stdev(samples) if len(samples) > 1 else 0,
            p95=sorted_samples[int(n * 0.95)],
            p99=sorted_samples[int(n * 0.99)],
            sample_count=n,
            last_updated=time.time(),
            time_window_hours=24,
        )

        self.baselines[metric_name] = baseline
        logger.info(
            f"基线更新:{metric_name} avg={baseline.avg:.4f} "
            f"p95={baseline.p95:.4f} p99={baseline.p99:.4f} "
            f"samples={n}"
        )
        self._save()
        return baseline

    def recalculate_all(self):
        """重新计算所有指标的基线"""
        for metric_name in self.raw_samples:
            self.recalculate_baseline(metric_name)

    def set_manual_baseline(self, metric_name: str, avg: float, p95: float = None):
        """手动设置基线(适用于刚部署时)"""
        if p95 is None:
            p95 = avg * 2

        baseline = Baseline(
            metric_name=metric_name,
            avg=avg,
            std=avg * 0.2,
            p95=p95,
            p99=p95 * 1.5,
            sample_count=0,
            last_updated=time.time(),
            time_window_hours=24,
        )
        self.baselines[metric_name] = baseline
        self._save()
        logger.info(f"手动设置基线:{metric_name} avg={avg}")

    def _load(self):
        """从文件加载基线"""
        if not self.storage_path.exists():
            logger.info("基线文件不存在,使用默认基线")
            self.baselines = dict(self.DEFAULT_BASELINES)
            return

        try:
            with open(self.storage_path, 'r') as f:
                data = json.load(f)
            for name, bl_data in data.items():
                self.baselines[name] = Baseline(**bl_data)
            logger.info(f"加载基线:{len(self.baselines)} 个指标")
        except Exception as e:
            logger.error(f"加载基线失败:{e},使用默认基线")
            self.baselines = dict(self.DEFAULT_BASELINES)

    def _save(self):
        """保存基线到文件"""
        try:
            self.storage_path.parent.mkdir(parents=True, exist_ok=True)
            data = {name: asdict(bl) for name, bl in self.baselines.items()}
            with open(self.storage_path, 'w') as f:
                json.dump(data, f, indent=2)
        except Exception as e:
            logger.error(f"保存基线失败:{e}")

3.8 告警通知 - 钉钉发送器(Python)

Python
# analyzer/notifier/dingtalk_sender.py

import hashlib
import hmac
import base64
import time
import json
import urllib.parse
import urllib.request
import logging
from typing import Optional

from ..core.alert_decision import Alert, AlertStatus
from ..core.risk_scorer import RiskLevel

logger = logging.getLogger(__name__)


class DingTalkSender:
    """
    钉钉机器人告警发送器
    支持签名验证的安全模式
    """

    def __init__(self, webhook_url: str, secret: str):
        self.webhook_url = webhook_url
        self.secret = secret

    def send_alert(self, alert: Alert) -> bool:
        """发送告警通知"""
        message = self._build_alert_message(alert)
        return self._send(message)

    def send_recovery(self, alert: Alert) -> bool:
        """发送恢复通知"""
        message = self._build_recovery_message(alert)
        return self._send(message)

    def _build_alert_message(self, alert: Alert) -> dict:
        """构建告警消息体(Markdown格式)"""
        level_colors = {
            RiskLevel.JITTER: "#FF8C00",       # 橙色
            RiskLevel.ACCUMULATION: "#FF4500",  # 红橙色
            RiskLevel.AVALANCHE: "#FF0000",     # 红色
        }
        color = level_colors.get(alert.level, "#FF8C00")

        # 计算持续时间
        duration = int(time.time() - alert.timestamp)
        duration_str = f"{duration // 60}分{duration % 60}秒"

        markdown_text = f"""## {alert.title}

**风险评分:** <font color="{color}">{alert.score:.1f}分</font>

**发生时间:** {self._format_time(alert.timestamp)}

---

### 📊 各层状态

| 监测层 | 分数 | 状态 |
|--------|------|------|
| 🔗 链路层 | {alert.link_score:.0f}分 | {self._score_emoji(alert.link_score)} |
| 🚪 端口层(5522) | {alert.port_score:.0f}分 | {self._score_emoji(alert.port_score)} |
| 🗄️ 数据库层(1432) | {alert.db_score:.0f}分 | {self._score_emoji(alert.db_score)} |
| 👥 客户端层 | {alert.client_score:.0f}分 | {self._score_emoji(alert.client_score)} |

---

### ⚡ 触发规则

{self._format_rules(alert.triggered_rules)}

---

### 💡 处置建议

{alert.recommendation}

---

> 告警ID:{alert.alert_id} | 持续:{duration_str}
"""

        return {
            "msgtype": "markdown",
            "markdown": {
                "title": alert.title,
                "text": markdown_text,
            },
            "at": {
                "isAtAll": alert.level == RiskLevel.AVALANCHE,  # 雪崩级别@所有人
            }
        }

    def _build_recovery_message(self, alert: Alert) -> dict:
        """构建恢复消息体"""
        if alert.resolved_at and alert.timestamp:
            duration = int(alert.resolved_at - alert.timestamp)
            duration_str = f"{duration // 60}分{duration % 60}秒"
        else:
            duration_str = "未知"

        markdown_text = f"""## ✅ HIS系统恢复正常

**原告警:** {alert.title}

**恢复时间:** {self._format_time(alert.resolved_at or time.time())}

**持续时长:** {duration_str}

> 系统已连续5次评分正常,确认恢复。
> 告警ID:{alert.alert_id}
"""

        return {
            "msgtype": "markdown",
            "markdown": {
                "title": "✅ HIS系统恢复正常",
                "text": markdown_text,
            }
        }

    def _send(self, message: dict) -> bool:
        """发送消息到钉钉"""
        try:
            url = self._get_signed_url()
            data = json.dumps(message).encode('utf-8')

            req = urllib.request.Request(
                url,
                data=data,
                headers={'Content-Type': 'application/json'},
                method='POST'
            )

            with urllib.request.urlopen(req, timeout=10) as resp:
                result = json.loads(resp.read())
                if result.get('errcode') == 0:
                    logger.info("钉钉告警发送成功")
                    return True
                else:
                    logger.error(f"钉钉告警发送失败:{result}")
                    return False

        except Exception as e:
            logger.error(f"钉钉发送异常:{e}")
            return False

    def _get_signed_url(self) -> str:
        """生成带签名的URL"""
        timestamp = str(round(time.time() * 1000))
        string_to_sign = f"{timestamp}\n{self.secret}"
        hmac_code = hmac.new(
            self.secret.encode('utf-8'),
            string_to_sign.encode('utf-8'),
            digestmod=hashlib.sha256
        ).digest()
        sign = urllib.parse.quote_plus(base64.b64encode(hmac_code))
        return f"{self.webhook_url}&timestamp={timestamp}&sign={sign}"

    def _format_time(self, ts: float) -> str:
        import datetime
        return datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')

    def _format_rules(self, rules: list) -> str:
        if not rules:
            return "无特定规则触发"
        return "\n".join(f"- {rule}" for rule in rules)

    def _score_emoji(self, score: float) -> str:
        if score < 30:
            return "🟢 正常"
        elif score < 60:
            return "🟡 预警"
        elif score < 80:
            return "🟠 堆积"
        else:
            return "🔴 危险"

3.9 API服务(FastAPI)

Python
# api/main.py

from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.middleware.cors import CORSMiddleware
import asyncio
import json
import time
from typing import List

app = FastAPI(
    title="HIS-Guardian API",
    description="HIS雪崩早期预警监测系统",
    version="1.0.0"
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

# WebSocket连接管理
class ConnectionManager:
    def __init__(self):
        self.active_connections: List[WebSocket] = []

    async def connect(self, websocket: WebSocket):
        await websocket.accept()
        self.active_connections.append(websocket)

    def disconnect(self, websocket: WebSocket):
        self.active_connections.remove(websocket)

    async def broadcast(self, message: dict):
        data = json.dumps(message, ensure_ascii=False)
        for connection in self.active_connections[:]:
            try:
                await connection.send_text(data)
            except Exception:
                self.active_connections.remove(connection)

manager = ConnectionManager()


@app.websocket("/ws/realtime")
async def websocket_realtime(websocket: WebSocket):
    """WebSocket实时数据推送"""
    await manager.connect(websocket)
    try:
        while True:
            # 从Redis读取最新评分结果并推送
            data = await get_latest_score_from_redis()
            await websocket.send_json(data)
            await asyncio.sleep(5)  # 每5秒推送一次
    except WebSocketDisconnect:
        manager.disconnect(websocket)


@app.get("/api/v1/dashboard/summary")
async def get_dashboard_summary():
    """仪表盘摘要数据"""
    return {
        "timestamp": time.time(),
        "risk_score": 45.2,
        "risk_level": "JITTER",
        "risk_level_name": "一级预警:网络抖动",
        "trend": "rising",
        "layer_scores": {
            "link": 55.0,
            "port": 40.0,
            "db": 30.0,
            "client": 20.0,
        },
        "active_alerts": 1,
        "metrics_summary": {
            "ping_loss_rate": 1.2,
            "port5522_established": 120,
            "port5522_close_wait": 8,
            "port1432_established": 45,
            "tcp_retransmit_rate": 0.003,
        }
    }


@app.get("/api/v1/metrics/history")
async def get_metrics_history(
    metric: str,
    hours: int = 1
):
    """获取指标历史数据(用于绘图)"""
    # 从InfluxDB查询历史数据
    data = await query_influxdb(metric, hours)
    return {
        "metric": metric,
        "hours": hours,
        "data": data
    }


@app.get("/api/v1/alerts")
async def get_alerts(
    status: str = "all",
    limit: int = 50
):
    """获取告警列表"""
    # 从MySQL查询告警
    alerts = await query_alerts_from_db(status, limit)
    return {
        "total": len(alerts),
        "alerts": alerts
    }


@app.get("/api/v1/alerts/{alert_id}")
async def get_alert_detail(alert_id: str):
    """获取告警详情"""
    alert = await query_alert_detail(alert_id)
    return alert


@app.get("/api/v1/baseline")
async def get_baseline_info():
    """获取所有指标基线信息"""
    return {
        "baselines": [
            {
                "metric_name": "port5522_established",
                "avg": 50.0,
                "p95": 100.0,
                "p99": 150.0,
                "warn_threshold": 100.0,
                "critical_threshold": 150.0,
                "last_updated": time.time() - 3600,
            }
            # ... 其他指标
        ]
    }


@app.post("/api/v1/baseline/{metric_name}")
async def update_baseline(metric_name: str, avg: float, p95: float = None):
    """手动更新指标基线"""
    # 调用BaselineManager更新
    return {"success": True, "message": f"基线已更新:{metric_name}"}


@app.get("/api/v1/config")
async def get_config():
    """获取系统配置"""
    return {
        "monitor_targets": {
            "his_server": "192.168.8.249",
            "gateway": "192.168.8.254",
            "his_port": 5522,
            "db_port": 1432,
        },
        "alert_config": {
            "dingtalk_enabled": True,
            "sms_enabled": False,
            "email_enabled": True,
        },
        "thresholds": {
            "ping_loss_warn": 1.0,
            "ping_loss_critical": 5.0,
            "port5522_established_multiplier_warn": 2.0,
            "port5522_close_wait_warn": 20,
        }
    }


# 辅助函数(实际实现连接InfluxDB和MySQL)
async def get_latest_score_from_redis():
    """从Redis获取最新评分"""
    # 实际实现
    return {}

async def query_influxdb(metric: str, hours: int):
    """查询InfluxDB历史数据"""
    # 实际实现
    return []

async def query_alerts_from_db(status: str, limit: int):
    """查询MySQL告警数据"""
    # 实际实现
    return []

async def query_alert_detail(alert_id: str):
    """查询告警详情"""
    # 实际实现
    return {}

3.10 前端核心组件 - 实时大屏(Vue3)

vue
<!-- frontend/src/views/Dashboard.vue -->
<template>
  <div class="dashboard">
    <!-- 顶部状态栏 -->
    <div class="status-bar" :class="riskLevelClass">
      <div class="status-left">
        <span class="system-name">🏥 HIS-Guardian 雪崩早期预警系统</span>
        <span class="current-time">{{ currentTime }}</span>
      </div>
      <div class="status-right">
        <span class="risk-badge">{{ riskLevelName }}</span>
        <span class="score-display">风险评分:{{ score.toFixed(1) }}</span>
        <span class="trend-icon">{{ trendIcon }}</span>
      </div>
    </div>

    <!-- 主要内容区 -->
    <div class="main-content">
      
      <!-- 左侧:风险仪表盘 + 各层状态 -->
      <div class="left-panel">
        <!-- 总体风险仪表盘 -->
        <div class="panel risk-gauge-panel">
          <h3>总体风险评分</h3>
          <RiskGauge :score="score" :level="riskLevel" />
          <div class="risk-description">{{ riskDescription }}</div>
        </div>

        <!-- 四层状态卡片 -->
        <div class="panel layer-status-panel">
          <h3>四层监测状态</h3>
          <div class="layer-cards">
            <LayerCard
              v-for="layer in layers"
              :key="layer.name"
              :layer="layer"
            />
          </div>
        </div>
      </div>

      <!-- 中间:拓扑图 + 关键指标 -->
      <div class="center-panel">
        <!-- 网络拓扑图 -->
        <div class="panel topology-panel">
          <h3>实时链路状态</h3>
          <TopologyMap
            :metrics="currentMetrics"
            :risk-level="riskLevel"
          />
        </div>

        <!-- 五个核心指标实时值 -->
        <div class="panel core-metrics-panel">
          <h3>五个核心指标</h3>
          <div class="core-metrics-grid">
            <CoreMetricCard
              v-for="metric in coreMetrics"
              :key="metric.name"
              :metric="metric"
            />
          </div>
        </div>
      </div>

      <!-- 右侧:告警列表 + 处置建议 -->
      <div class="right-panel">
        <!-- 当前告警 -->
        <div class="panel alert-panel">
          <h3>活跃告警 <span class="badge">{{ activeAlerts.length }}</span></h3>
          <AlertList :alerts="activeAlerts" />
        </div>

        <!-- 处置建议 -->
        <div class="panel recommendation-panel">
          <h3>💡 处置建议</h3>
          <div class="recommendation-text">{{ recommendation }}</div>
        </div>

        <!-- 触发规则 -->
        <div class="panel rules-panel" v-if="triggeredRules.length">
          <h3>⚡ 触发规则</h3>
          <ul class="rules-list">
            <li v-for="(rule, i) in triggeredRules" :key="i">{{ rule }}</li>
          </ul>
        </div>
      </div>
    </div>

    <!-- 底部:历史趋势图 -->
    <div class="bottom-panel">
      <div class="panel trend-panel">
        <h3>历史趋势(最近1小时)</h3>
        <div class="chart-tabs">
          <button
            v-for="tab in chartTabs"
            :key="tab.key"
            :class="{ active: activeTab === tab.key }"
            @click="activeTab = tab.key"
          >
            {{ tab.label }}
          </button>
        </div>
        <MetricChart
          :metric-key="activeTab"
          :data="chartData[activeTab]"
          :thresholds="chartThresholds[activeTab]"
        />
      </div>
    </div>
  </div>
</template>

<script setup>
import { ref, computed, onMounted, onUnmounted } from 'vue'
import RiskGauge from '@/components/RiskGauge.vue'
import LayerCard from '@/components/LayerCard.vue'
import TopologyMap from '@/components/TopologyMap.vue'
import CoreMetricCard from '@/components/CoreMetricCard.vue'
import AlertList from '@/components/AlertList.vue'
import MetricChart from '@/components/MetricChart.vue'

// 状态数据
const score = ref(0)
const riskLevel = ref('NORMAL')
const trend = ref('stable')
const recommendation = ref('')
const triggeredRules = ref([])
const activeAlerts = ref([])
const currentMetrics = ref({})
const chartData = ref({})

const activeTab = ref('port5522_established')

// WebSocket连接
let ws = null
const currentTime = ref('')

// 计算属性
const riskLevelClass = computed(() => ({
  'status-normal': riskLevel.value === 'NORMAL',
  'status-jitter': riskLevel.value === 'JITTER',
  'status-accumulation': riskLevel.value === 'ACCUMULATION',
  'status-avalanche': riskLevel.value === 'AVALANCHE',
}))

const riskLevelName = computed(() => {
  const names = {
    NORMAL: '✅ 系统正常',
    JITTER: '⚠️ 一级预警:网络抖动',
    ACCUMULATION: '🔴 二级预警:连接堆积',
    AVALANCHE: '🆘 三/四级预警:雪崩风险',
  }
  return names[riskLevel.value] || '未知'
})

const riskDescription = computed(() => {
  const descs = {
    NORMAL: '系统运行正常,各项指标在基线范围内',
    JITTER: '检测到网络抖动迹象,请关注链路状态',
    ACCUMULATION: '连接数堆积,HIS业务入口压力增大',
    AVALANCHE: '系统存在雪崩风险,请立即处置!',
  }
  return descs[riskLevel.value] || ''
})

const trendIcon = computed(() => {
  const icons = { rising: '📈', stable: '➡️', falling: '📉' }
  return icons[trend.value] || '➡️'
})

const layers = computed(() => [
  {
    name: 'link',
    label: '🔗 链路层',
    score: currentMetrics.value.link_score || 0,
    desc: 'Ping丢包 / TCP重传 / 网关',
  },
  {
    name: 'port',
    label: '🚪 端口层',
    score: currentMetrics.value.port_score || 0,
    desc: '5522连接状态',
  },
  {
    name: 'db',
    label: '🗄️ 数据库层',
    score: currentMetrics.value.db_score || 0,
    desc: '1432连接 / SQL等待',
  },
  {
    name: 'client',
    label: '👥 客户端层',
    score: currentMetrics.value.client_score || 0,
    desc: '登录失败 / 重试行为',
  },
])

const coreMetrics = computed(() => [
  {
    name: 'TCP重传',
    key: 'tcp_retransmit_rate',
    value: currentMetrics.value.tcp_retransmit_rate,
    unit: '',
    icon: '🔄',
    warn: 0.003,
    critical: 0.008,
  },
  {
    name: '5522连接数',
    key: 'port5522_established',
    value: currentMetrics.value.port5522_established,
    unit: '个',
    icon: '🚪',
    warn: 100,
    critical: 150,
  },
  {
    name: '5522 CLOSE_WAIT',
    key: 'port5522_close_wait',
    value: currentMetrics.value.port5522_close_wait,
    unit: '个',
    icon: '⏳',
    warn: 20,
    critical: 50,
  },
  {
    name: '1432连接数',
    key: 'port1432_established',
    value: currentMetrics.value.port1432_established,
    unit: '个',
    icon: '🗄️',
    warn: 40,
    critical: 60,
  },
  {
    name: '网关丢包率',
    key: 'gateway_packet_loss',
    value: currentMetrics.value.gateway_packet_loss,
    unit: '%',
    icon: '🌐',
    warn: 0.5,
    critical: 2.0,
  },
])

const chartTabs = [
  { key: 'port5522_established', label: '5522连接数' },
  { key: 'port5522_close_wait', label: 'CLOSE_WAIT' },
  { key: 'port1432_established', label: '1432连接数' },
  { key: 'tcp_retransmit_rate', label: 'TCP重传' },
  { key: 'ping_loss_rate', label: 'Ping丢包率' },
  { key: 'risk_score', label: '风险评分' },
]

const chartThresholds = {
  port5522_established: [
    { value: 100, color: '#FF8C00', label: '预警线' },
    { value: 150, color: '#FF0000', label: '严重线' },
  ],
  port5522_close_wait: [
    { value: 20, color: '#FF8C00', label: '预警线' },
    { value: 50, color: '#FF0000', label: '严重线' },
  ],
  // ...其他指标
}

// WebSocket连接
function connectWebSocket() {
  ws = new WebSocket('ws://localhost:8000/ws/realtime')

  ws.onmessage = (event) => {
    const data = JSON.parse(event.data)
    updateDashboard(data)
  }

  ws.onclose = () => {
    // 5秒后重连
    setTimeout(connectWebSocket, 5000)
  }

  ws.onerror = (error) => {
    console.error('WebSocket错误:', error)
  }
}

function updateDashboard(data) {
  score.value = data.total_score || 0
  riskLevel.value = data.risk_level || 'NORMAL'
  trend.value = data.trend || 'stable'
  recommendation.value = data.recommendation || ''
  triggeredRules.value = data.triggered_rules || []
  currentMetrics.value = data.metrics || {}
}

// 时间更新
let timeTimer = null
function updateTime() {
  currentTime.value = new Date().toLocaleString('zh-CN')
}

onMounted(() => {
  connectWebSocket()
  updateTime()
  timeTimer = setInterval(updateTime, 1000)
})

onUnmounted(() => {
  ws?.close()
  clearInterval(timeTimer)
})
</script>

<style scoped>
.dashboard {
  display: flex;
  flex-direction: column;
  height: 100vh;
  background: #0a0e1a;
  color: #e0e6f0;
  font-family: 'Microsoft YaHei', sans-serif;
}

.status-bar {
  display: flex;
  justify-content: space-between;
  align-items: center;
  padding: 12px 24px;
  border-bottom: 2px solid #1e3a5f;
  transition: background 0.5s;
}

.status-normal { background: #0d2137; border-color: #1a8a4a; }
.status-jitter { background: #2d1a00; border-color: #FF8C00; }
.status-accumulation { background: #2d0800; border-color: #FF4500; }
.status-avalanche { 
  background: #1a0000; 
  border-color: #FF0000;
  animation: pulse-red 1s infinite;
}

@keyframes pulse-red {
  0%, 100% { border-color: #FF0000; }
  50% { border-color: #ff6666; }
}

.main-content {
  display: grid;
  grid-template-columns: 300px 1fr 320px;
  gap: 12px;
  padding: 12px;
  flex: 1;
  overflow: hidden;
}

.panel {
  background: #0d2137;
  border: 1px solid #1e3a5f;
  border-radius: 8px;
  padding: 16px;
  margin-bottom: 12px;
}

.panel h3 {
  color: #7eb8e8;
  margin: 0 0 12px;
  font-size: 14px;
  border-bottom: 1px solid #1e3a5f;
  padding-bottom: 8px;
}

.risk-badge {
  background: rgba(255,255,255,0.1);
  padding: 4px 12px;
  border-radius: 16px;
  font-weight: bold;
  margin-right: 16px;
}

.score-display {
  font-size: 18px;
  font-weight: bold;
  color: #7eb8e8;
}

.bottom-panel {
  padding: 0 12px 12px;
  height: 200px;
}

.chart-tabs {
  display: flex;
  gap: 8px;
  margin-bottom: 8px;
}

.chart-tabs button {
  padding: 4px 12px;
  border: 1px solid #1e3a5f;
  background: transparent;
  color: #7eb8e8;
  border-radius: 4px;
  cursor: pointer;
  font-size: 12px;
}

.chart-tabs button.active {
  background: #1e3a5f;
  color: #fff;
}

.rules-list {
  list-style: none;
  padding: 0;
  margin: 0;
}

.rules-list li {
  padding: 6px 0;
  border-bottom: 1px solid #1e3a5f;
  font-size: 12px;
  color: #c0cfe0;
}
</style>

四、数据库设计

4.1 MySQL建表SQL

SQL
-- database/migrations/001_create_tables.sql

-- 告警记录表
CREATE TABLE alerts (
    id          BIGINT AUTO_INCREMENT PRIMARY KEY,
    alert_id    VARCHAR(64) NOT NULL UNIQUE,
    level       TINYINT NOT NULL COMMENT '1=抖动 2=堆积 3=雪崩',
    title       VARCHAR(256) NOT NULL,
    description TEXT,
    recommendation TEXT,
    score       FLOAT NOT NULL,
    link_score  FLOAT DEFAULT 0,
    port_score  FLOAT DEFAULT 0,
    db_score    FLOAT DEFAULT 0,
    client_score FLOAT DEFAULT 0,
    status      VARCHAR(32) DEFAULT 'firing' COMMENT 'firing/resolved/suppressed',
    triggered_at BIGINT NOT NULL COMMENT 'Unix时间戳',
    resolved_at  BIGINT DEFAULT NULL,
    created_at  DATETIME DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_status (status),
    INDEX idx_triggered_at (triggered_at)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='告警记录';

-- 告警触发规则详情表
CREATE TABLE alert_rules (
    id          BIGINT AUTO_INCREMENT PRIMARY KEY,
    alert_id    VARCHAR(64) NOT NULL,
    rule_text   VARCHAR(512) NOT NULL,
    created_at  DATETIME DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (alert_id) REFERENCES alerts(alert_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

-- 指标基线配置表
CREATE TABLE metric_baselines (
    id          INT AUTO_INCREMENT PRIMARY KEY,
    metric_name VARCHAR(128) NOT NULL UNIQUE,
    avg_value   FLOAT NOT NULL DEFAULT 0,
    std_value   FLOAT NOT NULL DEFAULT 0,
    p95_value   FLOAT NOT NULL DEFAULT 0,
    p99_value   FLOAT NOT NULL DEFAULT 0,
    sample_count INT DEFAULT 0,
    updated_at  DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    is_manual   TINYINT DEFAULT 0 COMMENT '是否手动设置'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='指标基线';

-- 系统配置表
CREATE TABLE system_config (
    id          INT AUTO_INCREMENT PRIMARY KEY,
    config_key  VARCHAR(128) NOT NULL UNIQUE,
    config_value TEXT NOT NULL,
    description VARCHAR(256),
    updated_at  DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='系统配置';

-- 初始化系统配置
INSERT INTO system_config (config_key, config_value, description) VALUES
('his_server_ip', '192.168.8.249', 'HIS服务器IP'),
('gateway_ip', '192.168.8.254', '网关IP'),
('his_port', '5522', 'HIS业务端口'),
('db_port', '1432', '数据库端口'),
('collect_interval_sec', '10', '采集间隔(秒)'),
('ping_count_per_round', '5', '每轮Ping包数'),
('alert_dingtalk_webhook', '', '钉钉机器人Webhook'),
('alert_dingtalk_secret', '', '钉钉机器人Secret'),
('alert_sms_phones', '', '短信告警手机号(逗号分隔)'),
('alert_email_to', '', '邮件告警收件人'),
('score_warn_threshold', '30', '一级预警分数阈值'),
('score_accumulation_threshold', '60', '二级预警分数阈值'),
('score_avalanche_threshold', '80', '三级预警分数阈值');

4.2 InfluxDB初始化

Bash
#!/bin/bash
# database/influx_init.sh

# 创建存储桶(保留30天数据)
influx bucket create \
  --name his_metrics \
  --retention 720h \
  --org his-guardian

# 创建用于长期保留的汇总存储桶(保留1年)
influx bucket create \
  --name his_metrics_summary \
  --retention 8760h \
  --org his-guardian

# 创建降采样任务(原始数据10s/点,1小时后降为1min/点)
influx task create --content '
option task = {name: "downsample_his_metrics", every: 1h}

data = from(bucket: "his_metrics")
  |> range(start: -2h)
  |> filter(fn: (r) => r._measurement == "his_monitor")

data
  |> aggregateWindow(every: 1m, fn: mean)
  |> to(bucket: "his_metrics_summary", org: "his-guardian")
'

五、部署配置

5.1 Docker Compose

YAML
# deploy/docker-compose.yml

version: '3.8'

services:
  # 时序数据库
  influxdb:
    image: influxdb:2.7
    container_name: his-influxdb
    ports:
      - "8086:8086"
    volumes:
      - influx_data:/var/lib/influxdb2
    environment:
      DOCKER_INFLUXDB_INIT_MODE: setup
      DOCKER_INFLUXDB_INIT_USERNAME: admin
      DOCKER_INFLUXDB_INIT_PASSWORD: his-guardian-2024
      DOCKER_INFLUXDB_INIT_ORG: his-guardian
      DOCKER_INFLUXDB_INIT_BUCKET: his_metrics
    restart: unless-stopped

  # MySQL
  mysql:
    image: mysql:8.0
    container_name: his-mysql
    ports:
      - "3306:3306"
    volumes:
      - mysql_data:/var/lib/mysql
      - ./database/migrations:/docker-entrypoint-initdb.d
    environment:
      MYSQL_ROOT_PASSWORD: his-guardian-2024
      MYSQL_DATABASE: his_guardian
      MYSQL_USER: his_guardian
      MYSQL_PASSWORD: his-guardian-2024
    restart: unless-stopped

  # Redis(数据总线 + 缓存)
  redis:
    image: redis:7.2
    container_name: his-redis
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    command: redis-server --appendonly yes --maxmemory 512mb --maxmemory-policy allkeys-lru
    restart: unless-stopped

  # 分析引擎
  analyzer:
    build:
      context: ./analyzer
      dockerfile: Dockerfile
    container_name: his-analyzer
    depends_on:
      - influxdb
      - mysql
      - redis
    volumes:
      - /var/lib/his-guardian:/var/lib/his-guardian
      - ./analyzer/config.yaml:/app/config.yaml
    environment:
      REDIS_URL: redis://redis:6379
      INFLUX_URL: http://influxdb:8086
      MYSQL_URL: mysql+pymysql://his_guardian:his-guardian-2024@mysql/his_guardian
    restart: unless-stopped
    network_mode: host  # 需要访问内网设备

  # API服务
  api:
    build:
      context: ./api
      dockerfile: Dockerfile
    container_name: his-api
    ports:
      - "8000:8000"
    depends_on:
      - mysql
      - redis
      - influxdb
    environment:
      REDIS_URL: redis://redis:6379
      INFLUX_URL: http://influxdb:8086
      MYSQL_URL: mysql+pymysql://his_guardian:his-guardian-2024@mysql/his_guardian
    restart: unless-stopped

  # 前端
  frontend:
    build:
      context: ./frontend
      dockerfile: Dockerfile
    container_name: his-frontend
    ports:
      - "80:80"
    depends_on:
      - api
    restart: unless-stopped

volumes:
  influx_data:
  mysql_data:
  redis_data:

5.2 采集器配置文件

YAML
# collector/config/config.yaml

# 监控目标配置
targets:
  his_server:
    ip: "192.168.8.249"
    ports:
      - 5522  # HIS业务端口
      - 1432  # 数据库端口
  gateway:
    ip: "192.168.8.254"
  
# 采集间隔配置
intervals:
  ping_collect_sec: 10        # Ping采集间隔
  port_collect_sec: 10        # 端口状态采集间隔
  database_collect_sec: 15    # 数据库采集间隔(避免频繁查询)
  netcard_collect_sec: 10     # 网卡统计采集间隔

# Ping配置
ping:
  count_per_round: 5          # 每轮发送包数
  timeout_ms: 3000            # 超时时间

# 数据库连接配置(用于采集SQL Server指标)
database:
  host: "192.168.8.249"
  port: 1432
  username: "monitor_user"   # 只读监控账号
  password: "${DB_PASSWORD}" # 从环境变量读取
  database: "master"
  long_query_threshold_sec: 30

# Redis配置(数据总线)
redis:
  url: "redis://localhost:6379"
  stream_key: "his_metrics_stream"
  stream_maxlen: 10000        # 流最大长度

# SNMP配置(交换机采集)
snmp:
  enabled: false              # 根据实际情况启用
  community: "public"
  switch_ip: "192.168.8.1"
  oid:
    if_in_octets: "1.3.6.1.2.1.2.2.1.10"
    if_out_octets: "1.3.6.1.2.1.2.2.1.16"
    if_in_discards: "1.3.6.1.2.1.2.2.1.13"
    if_out_discards: "1.3.6.1.2.1.2.2.1.19"

六、研发计划

text
第1-2周:基础框架搭建
├── 环境搭建(Docker、InfluxDB、MySQL、Redis)
├── 采集器框架(Go)
├── 数据总线(Redis Stream)
└── 基础API框架(FastAPI)

第3-4周:采集层开发
├── Ping采集器(含TCP fallback)
├── TCP端口状态采集器(5522/1432)
├── SQL Server监控采集器
└── 网卡流量采集器

第5-6周:分析引擎开发
├── 基线管理器
├── 风险评分引擎(核心)
├── 趋势分析器
└── 预警决策引擎(防抖/防重/升级)

第7-8周:告警通知开发
├── 钉钉/企微机器人
├── 邮件发送
├── 短信发送(对接短信平台)
└── 通知管理器(统一入口)

第9-11周:前端开发
├── 实时大屏(Dashboard)
├── 指标历史趋势图
├── 告警管理页面
└── 配置管理页面

第12-13周:联调测试
├── 单元测试
├── 集成测试
├── 压力测试(验证采集不影响HIS性能)
└── 场景测试(模拟各类故障)

第14-15周:现场部署
├── 环境部署
├── 基线学习(收集1-2周正常数据)
├── 阈值调优
└── 告警通知测试

第16周:交付文档
├── 部署手册
├── 操作手册
├── 告警处置手册
└── 运维培训

七、关键设计决策说明

text
决策1:为什么选Go做采集器?
→ Go编译为单二进制,部署简单
→ goroutine天然适合并发多目标采集
→ 对目标服务器性能影响极小

决策2:为什么用Redis Stream做总线?
→ 采集器和分析引擎解耦
→ 支持消费组,可扩展多个分析实例
→ 自动保留最近N条,不怕分析引擎短暂故障

决策3:为什么分InfluxDB + MySQL两个存储?
→ InfluxDB:存时序指标数据,查询高效,适合画趋势图
→ MySQL:存告警记录、配置、基线,关系型查询更方便

决策4:为什么要做基线而不用固定阈值?
→ 不同医院HIS规模不同,固定阈值不适用
→ 基线可以自动学习,减少误报和漏报

决策5:为什么分四级预警而不是直接告警雪崩?
→ 越早介入代价越低
→ 一级预警时只需查查链路,不需要重启服务
→ 让运维有时间在问题扩大前处置

核心设计理念:宁可早报,不可漏报;分级处置,按需响应。

这套系统的价值不在于故障发生后的记录,而在于故障发生前15-30分钟的预警,让运维团队有时间在用户感知之前完成处置。

 

posted on 2026-06-27 16:47  GKLBB  阅读(2)  评论(0)    收藏  举报