Files
Deploy-Laboratory/scripts/diag/entrypath/README.md
2026-03-21 04:36:06 +08:00

114 lines
3.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# entrypath 诊断脚本说明
`entrypath.sh` 用于排查 `client -> worker:80 -> kube-proxy DNAT -> Traefik Pod` 全链路问题。
## 命令
```bash
./scripts/diag/entrypath/entrypath.sh <command> [options]
```
- `run`:完整检查(默认)
- `preflight`:仅检查依赖与参数环境
- `capture`:强制开启抓包/trace能力后执行 run
- `analyze --log <path>`:离线分析日志
## 关键参数
- `--worker-host` / `--client-host`
- `--worker-ssh-key` / `--client-ssh-key`
- `--client-ip` / `--lb-ip`
- `--remote-check y|n`
- `--capture-mode y|n`
- `--nft-trace-mode y|n`
- `--return-trace-mode y|n`
- `--pod-netns-trace-mode y|n`
- `--non-interactive`
## 日志
- root 运行:`/root/netpol-diag-logs/entrypath-*.log`
- 非 root`~/netpol-diag-logs/entrypath-*.log`
## 典型用法
### 1) 预检查
```bash
./scripts/diag/entrypath/entrypath.sh preflight --non-interactive
```
### 2) 全功能在线诊断(默认值示例)
```bash
./scripts/diag/entrypath/entrypath.sh run \
--worker-host root@192.168.2.62 \
--client-host root@192.168.2.63 \
--worker-ssh-key ~/.ssh/id_ed25519_k3s_diag_worker \
--client-ssh-key ~/.ssh/id_ed25519_k3s_diag_client \
--client-ip 192.168.2.63 \
--lb-ip 192.168.2.62 \
--remote-check y \
--capture-mode y \
--capture-seconds 15 \
--nft-trace-mode y \
--nft-trace-seconds 10 \
--return-trace-mode y \
--return-trace-seconds 12 \
--pod-netns-trace-mode y \
--pod-netns-trace-seconds 12 \
--non-interactive
```
### 3) 离线日志判读
```bash
./scripts/diag/entrypath/entrypath.sh analyze \
--log ~/netpol-diag-logs/entrypath-20260310-195812.log
```
## 常见陷阱与修复
### 1) `62:80` 不通,但 worker 已 DNAT 到 Traefik
若日志同时出现:
- `nft 观测到 KUBE-EXT DNAT: yes`
- `ylc61(any) SYN/SYN-ACK: N/0`
- `filter_FORWARD_POLICIES ... reject with icmpx admin-prohibited`
通常是 `ylc61` 的 firewalld 转发策略阻断 `flannel.1 -> cni0`
修复(推荐):
```bash
sudo firewall-cmd --zone=trusted --add-interface=flannel.1
sudo firewall-cmd --zone=trusted --add-interface=cni0
sudo firewall-cmd --permanent --zone=trusted --add-interface=flannel.1
sudo firewall-cmd --permanent --zone=trusted --add-interface=cni0
sudo firewall-cmd --reload
```
### 2) `Worker CNI hostport DNAT 计数未增长` 是否异常
不一定。若 nft trace 明确显示走的是 `KUBE-EXT -> KUBE-SVC -> KUBE-SEP`,则 CNI hostport 计数不增长属于正常路径差异,不应作为故障根因。
### 3) 成功判据
至少满足以下任一组:
- 客户端对 `http://<lb-ip>:80` 返回 `404/200/...`(非连接失败)
- 自动判读中:
- `ylc62(ens18) SYN/SYN-ACK``N/N`
- `ylc61(any) SYN/SYN-ACK``N/N`
- `ylc61(cni0) SYN/SYN-ACK``N/N`
## 模块划分
- `lib/common.sh`:通用工具、参数默认值
- `lib/k8s_checks.sh`:本地 K8s 基线采样
- `lib/remote_checks.sh`:远端 worker 采样与复测
- `lib/capture.sh`tcpdump / nft / conntrack / pod netns
- `lib/analyze.sh`:实时/离线判读