Files
Distributed-Prometheus/doc/ALERTING.md
jack ab1515dffb refactor: config/apps 目录重组、文档重构、架构图收窄
- 中央:config/(prometheus,alertmanager,grafana)、apps/(tile-cache,topology-editor)
- 边缘:config/(vmagent,blackbox,targets)、apps/(onvif-exporter)
- env: TRAEFIK_PROVIDER、prometheus/env.example 详细说明
- 文档:README/doc 重构,EDGE_CACHE 合并到 EDGE_AGENT_CONFIG
- targets.csv 更新流程说明,ARCHITECTURE 图收窄

Made-with: Cursor
2026-02-28 22:05:43 -05:00

53 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 告警与通知
本文说明告警规则、如何激活,以及 Alertmanager 的配置与通知渠道。
---
## 告警规则alert_rules.yml
中央已内置 `central-server/config/prometheus/alert_rules.yml`,主要包含:
| 规则 | 条件 | 说明 |
|------|------|------|
| ONVIFDeviceDown | `up{job="onvif-devices"} == 0` 持续 1m | ONVIF 设备离线 |
| NetworkDeviceDown | `probe_success{job="network-ping"} == 0` 持续 2m | 网络设备 Ping 不通 |
| HighNetworkLatency | `probe_duration_seconds{job="network-ping"} > 1` 持续 5m | Ping 延迟过高 |
**为何显示 Inactive**:规则依赖边缘推送的指标。需先部署边缘、配置 Ping/ONVIF 目标,数据经 remote_write 到 VictoriaMetrics 后,规则才会评估;无数据时保持 inactive。
**激活步骤**:完成边缘部署(见 [README](README.md)、[DEPLOYMENT_GUIDE](DEPLOYMENT_GUIDE.md))→ 在 Grafana 选 VictoriaMetrics 数据源确认有 `probe_success{job="network-ping"}` 等 → Prometheus 会从 VictoriaMetrics 取数并评估规则。
---
## Alertmanager 配置alertmanager.yml
路径:`central-server/config/alertmanager/alertmanager.yml`
- **route**分组group_by、等待时间group_wait、重复间隔repeat_interval、默认接收器receiver
- **receivers**:当前示例为 webhook `http://127.0.0.1:5001/`
**注意**:容器内 127.0.0.1 指向自身,若 webhook 在宿主机,应改为 `http://host.docker.internal:5001/` 或宿主机 IP。
- **inhibit_rules**critical 抑制同实例的 warning减少告警风暴。
**常用接收器类型**`email_configs``wechat_configs``dingtalk_configs``webhook_configs`。按需替换为邮件、企业微信、钉钉或自建 webhook。
**验证**`docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml`Web UIhttp://localhost:9093。
---
## 自定义告警规则
`config/prometheus/alert_rules.yml` 中追加或修改规则,例如:
```yaml
- alert: ExampleAlert
expr: your_metric > threshold
for: 5m
labels:
severity: warning
annotations:
summary: "示例告警"
```
修改后若 Prometheus 启用了 `--web.enable-lifecycle`,可 `curl -X POST http://localhost:9091/-/reload` 重载。