05-05-Prometheus 与 Grafana

使用 kube-prometheus-stack 建立基础可观测能力。

契约与真源

Helm values 示例：ansible/files/05-05/kube-prometheus-stack-values.example.yaml（见同目录 README.md）。
手动：下文 helm 命令；可将 -f ansible/files/05-05/kube-prometheus-stack-values.example.yaml 传给 helm upgrade --install。
自动：./ansible/bin/verify.sh run 05-05（与上述路径同一真源，便于对照；当前为 noop + 集群基线）。

TL;DR

自动化验收：./ansible/bin/verify.sh run 05-05
关键前置：按本文「前置条件」准备环境变量/Secret/入口 IP
成功判据：达到本文「预期」且 playbook 断言通过
排障：见本文「排障」

前置条件

集群已正常运行
Helm 已安装

操作步骤

添加仓库并更新索引
创建 monitoring 命名空间
安装 kube-prometheus-stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm upgrade --install monitoring prometheus-community/kube-prometheus-stack -n monitoring \
  -f ansible/files/05-05/kube-prometheus-stack-values.example.yaml

验证命令

kubectl -n monitoring get pods
kubectl -n monitoring get svc

预期

核心组件 Pod 处于 Running

下一步

06-02-运维小结.md

排障

先看 playbook 输出：失败时先定位是 deploy/wait/http_check 哪一步。
集群侧总览：kubectl get nodes -o wide、kubectl -n kube-system get pods -o wide。
事件与日志：kubectl -n <ns> describe ...、kubectl -n <ns> logs ... --tail=200。

1.7 KiB Raw Blame History Unescape Escape

05-05-Prometheus 与 Grafana

契约与真源

TL;DR

前置条件

操作步骤

验证命令

预期

下一步

排障

1.7 KiB

Raw Blame History