Cilium Native HostGateway 模式使用说明
模式介绍
项目文档:https://docs.cilium.io/en/stable/network/concepts/routing/#native-routing
Native Routing 原生路由可以简单理解为 Host Gateway 模式,在 Cilium 中与 Host-Routing 是两种东西。
在 Native 原生路由模式下,Cilium 会把所发给非本机 Pod 的数据包,交给 Linux 内核路由处理。也就是说,这个包会被当成本机发出的包来转发。正因如此,集群节点之间的底层网络必须具备路由 Pod IP 网段的能力。所以使用 native 模式,网络必须具备以下要求:
- Cilium 节点之间的网络必须能转发 Pod IP;
- Cilium 节点的 Linux 内核必须知道怎么把包转给其他节点的 Pod。这可以通过两种方式实现:
- 节点本身不知道如何路由 Pod IP ,但路由器知道。这种情况下,把流量都交给路由器即可。参见 Google Cloud 、 AWS ENI 和 Azure IPAM;
- 所有节点都学习到 Pod IP 并将相应的路由信息添加到内核路由表中:
- 如果所有节点处于同一个 L2 网络,则可以通过启用
auto-direct-node-routes: true实现; - 否则,需要运行额外的系统组件(例如 BGP 守护进程)来分发路由:
- kube-router 与 BIDR 方式在 1.16 版本后已弃用;
- BGP Control Plane 方式在新版本转正,它是 Cilium 内置方式,无需安装第三方程序。
- 如果所有节点处于同一个 L2 网络,则可以通过启用


简单来说,虽然 Cilium 没有替代集群 kube-proxy,但 Pod 之间的 Cluster IP 访问,还是被 Cilium 在 tc 层处理了(Ingress 入口从下往上/Egress 出口从上往下)。
图片出处:
部署流程
通过 Kind 快速生成集群并部署 Cilium Native 模式
Cilium Helm Chart 中 ipam 不同值的作用参考官网文档
#!/bin/bash
set -v
# 1. Prepare NoCNI kubernetes environment
cat <<EOF | HTTP_PROXY= HTTPS_PROXY= http_proxy= https_proxy= kind create cluster --name=cilium-kubeproxy --image=kindest/node:v1.27.3 --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
disableDefaultCNI: true
nodes:
- role: control-plane
- role: worker
EOF
# 2. Remove kubernetes node taints
controller_node_ip=`kubectl get node -o wide --no-headers | grep -E "control-plane|bpf1" | awk -F " " '{print $6}'`
kubectl taint nodes $(kubectl get nodes -o name | grep control-plane) node-role.kubernetes.io/control-plane:NoSchedule-
# 3. Install CNI[Cilium 1.17.15]
cilium_version=v1.17.15
docker pull quay.io/cilium/cilium:$cilium_version && docker pull quay.io/cilium/operator-generic:$cilium_version
kind load docker-image quay.io/cilium/cilium:$cilium_version quay.io/cilium/operator-generic:$cilium_version --name cilium-kubeproxy
{ helm repo add cilium https://helm.cilium.io ; helm repo update; } > /dev/null 2>&1
# Direct Routing Options(--set routingMode=native --set autoDirectNodeRoutes=true --set ipv4NativeRoutingCIDR="10.0.0.0/8")
# ipam 不同值的作用参考官网文档:
# https://docs.cilium.io/en/stable/network/concepts/ipam/
# ipv4NativeRoutingCIDR 对应的是 k8s 集群部署时 --pod-network-cidr 的值
helm install cilium cilium/cilium \
--set k8sServiceHost=$controller_node_ip \
--set k8sServicePort=6443 \
--version 1.17.15 \
--namespace kube-system \
--set image.pullPolicy=IfNotPresent \
--set debug.enabled=true \
--set debug.verbose="datapath flow kvstore envoy policy" \
--set bpf.monitorAggregation=none \
--set monitor.enabled=true \
--set ipam.mode=kubernetes \
--set cluster.name=cilium-kubeproxy \
--set routingMode=native \
--set kubeProxyReplacement=false \
--set autoDirectNodeRoutes=true \
--set ipv4NativeRoutingCIDR="10.0.0.0/8"
# 4. Separate namesapce and cgroup v2 verify [https://github.com/cilium/cilium/pull/16259 && https://docs.cilium.io/en/stable/installation/kind/#install-cilium]
#for container in $(docker ps -a --format "table {{.Names}}" | grep cilium-kubeproxy);do docker exec $container ls -al /proc/self/ns/cgroup;done
#mount -l | grep cgroup && docker info | grep "Cgroup Version" | awk '$1=$1'
创建测试 Pod
本质上是 Nginx,用于后续抓包请求测试、iptables 规则查询
#!/bin/bash
controller_node_name=`kubectl get nodes -o wide | grep control-plane | awk -F " " '{print $1}'`
worker_node_name=`kubectl get nodes -o wide | awk -F " " '{print $1}' | grep 'worker$'`
# client pod and service
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
labels:
run: client
name: client
spec:
containers:
- name: client
image: burlyluo/nettool:9494
imagePullPolicy: Always
restartPolicy: Always
nodeName: ${controller_node_name}
EOF
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
labels:
run: client
name: clientsvc
spec:
type: NodePort
clusterIP: 10.96.94.94
ports:
- port: 9494
protocol: TCP
targetPort: 9494
nodePort: 30494
selector:
run: client
EOF
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
labels:
run: server
name: server
spec:
containers:
- name: server
image: burlyluo/nettool:9494
imagePullPolicy: Always
restartPolicy: Always
nodeName: ${worker_node_name}
EOF
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
labels:
run: server
name: serversvc
spec:
type: NodePort
clusterIP: 10.96.94.95
ports:
- port: 9494
protocol: TCP
targetPort: 9494
nodePort: 30494
selector:
run: server
EOF
查看部署结果
root@network-demo:~# kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
default client 1/1 Running 0 3m20s 10.244.0.49 cilium-kubeproxy-control-plane
default server 1/1 Running 0 3m20s 10.244.1.124 cilium-kubeproxy-worker
kube-system cilium-2bkqk 2/2 Running 0 9m29s 172.18.0.2 cilium-kubeproxy-worker
kube-system cilium-envoy-8qz9j 1/1 Running 0 9m29s 172.18.0.3 cilium-kubeproxy-control-plane
kube-system cilium-envoy-vb8n4 1/1 Running 0 9m29s 172.18.0.2 cilium-kubeproxy-worker
kube-system cilium-operator-86bc7ff44-627mg 1/1 Running 0 9m29s 172.18.0.2 cilium-kubeproxy-worker
kube-system cilium-operator-86bc7ff44-dg5c6 1/1 Running 0 9m29s 172.18.0.3 cilium-kubeproxy-control-plane
kube-system cilium-tjsld 2/2 Running 0 9m29s 172.18.0.3 cilium-kubeproxy-control-plane
kube-system coredns-5d78c9869d-28zfr 1/1 Running 0 12m 10.244.0.165 cilium-kubeproxy-control-plane
kube-system coredns-5d78c9869d-zpsfh 1/1 Running 0 12m 10.244.0.224 cilium-kubeproxy-control-plane
kube-system etcd-cilium-kubeproxy-control-plane 1/1 Running 0 12m 172.18.0.3 cilium-kubeproxy-control-plane
kube-system kube-apiserver-cilium-kubeproxy 1/1 Running 0 12m 172.18.0.3 cilium-kubeproxy-control-plane
kube-system kube-controller-manager-cilium-kubeproxy 1/1 Running 0 12m 172.18.0.3 cilium-kubeproxy-control-plane
kube-system kube-proxy-5fqdk 1/1 Running 0 12m 172.18.0.3 cilium-kubeproxy-control-plane
kube-system kube-proxy-d26xm 1/1 Running 0 11m 172.18.0.2 cilium-kubeproxy-worker
kube-system kube-scheduler-cilium-kubeproxy 1/1 Running 0 12m 172.18.0.3 cilium-kubeproxy-control-plane
验证效果
查看 Cilium 详细信息
1.查询 Cilium 运行状态
root@network-demo:~# kubectl exec -it -n kube-system cilium-tjsld -- cilium status
KVStore: Disabled
Kubernetes: Ok 1.27 (v1.27.3) [linux/amd64]
Kubernetes APIs: ["EndpointSliceOrEndpoint", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumCIDRGroup", "core/v1::Namespace", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
## Cilium 未替代 kube-proxy
KubeProxyReplacement: False
Host firewall: Disabled
SRv6: Disabled
CNI Chaining: none
CNI Config file: successfully wrote CNI configuration file to /host/etc/cni/net.d/05-cilium.conflist
Cilium: Ok 1.17.15 (v1.17.15-4206eaa5)
NodeMonitor: Listening for events on 8 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPAM: IPv4: 6/254 allocated from 10.244.0.0/24,
IPv4 BIG TCP: Disabled
IPv6 BIG TCP: Disabled
BandwidthManager: Disabled
## 使用 Native 网络策略
Routing: Network: Native Host: Legacy
Attach Mode: TCX
Device Mode: veth
Masquerading: IPTables [IPv4: Enabled, IPv6: Disabled]
Controller Status: 41/41 healthy
Proxy Status: OK, ip 10.244.0.26, 0 redirects active on ports 10000-20000, Envoy: external
Global Identity Range: min 256, max 65535
Hubble: Ok Current/Max Flows: 4095/4095 (100.00%), Flows/s: 58.42 Metrics: Disabled
Encryption: Disabled
Cluster health: 2/2 reachable (2026-05-03T03:11:57Z)
Name IP Node Endpoints
Modules Health: Stopped(0) Degraded(0) OK(60)
2.查询 Cilium Endpoint 信息
root@network-demo:~# kubectl exec -it -n kube-system cilium-tjsld -- cilium endpoint list
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv4 STATUS
ENFORCEMENT ENFORCEMENT
44 Disabled Disabled 1 k8s:node-role.kubernetes.io/control-plane ready
k8s:node.kubernetes.io/exclude-from-external-load-balancers
reserved:host
224 Disabled Disabled 4 reserved:health 10.244.0.22 ready
786 Disabled Disabled 59785 k8s:app=local-path-provisioner 10.244.0.213 ready
2112 Disabled Disabled 6602 k8s:io.cilium.k8s.namespacemetadata.name=kube-system 10.244.0.224 ready
k8s:io.cilium.k8s.policy.cluster=cilium-kubeproxy
k8s:io.cilium.k8s.policy.serviceaccount=coredns
k8s:io.kubernetes.pod.namespace=kube-system
k8s:k8s-app=kube-dns
2119 Disabled Disabled 6602 k8s:io.cilium.k8s.namespace/metadata.name=kube-system 10.244.0.165 ready
k8s:io.cilium.k8s.policy.cluster=cilium-kubeproxy
k8s:io.cilium.k8s.policy.serviceaccount=coredns
k8s:io.kubernetes.pod.namespace=kube-system
k8s:k8s-app=kube-dns
2234 Disabled Disabled 17248 k8s:io.cilium.k8s.namespace/metadata.name=default 10.244.0.49 ready
k8s:io.cilium.k8s.policy.cluster=cilium-kubeproxy
k8s:io.cilium.k8s.policy.serviceaccount=default
k8s:io.kubernetes.pod.namespace=default
k8s:run=client
3. 查询 Cilium Service 信息
kubeProxyReplacement=false时,Cilium 仅启用对 ClusterIP 的集群内负载均衡。
root@network-demo:~# kubectl exec -it -n kube-system cilium-tjsld -- cilium service list
ID Frontend Service Type Backend
1 10.96.0.1:443/TCP ClusterIP 1 => 172.18.0.3:6443/TCP (active)
2 10.96.238.242:443/TCP ClusterIP 1 => 172.18.0.3:4244/TCP (active)
3 10.96.0.10:53/UDP ClusterIP 1 => 10.244.0.224:53/UDP (active)
2 => 10.244.0.165:53/UDP (active)
4 10.96.0.10:53/TCP ClusterIP 1 => 10.244.0.224:53/TCP (active)
2 => 10.244.0.165:53/TCP (active)
5 10.96.0.10:9153/TCP ClusterIP 1 => 10.244.0.224:9153/TCP (active)
2 => 10.244.0.165:9153/TCP (active)
6 10.96.94.94:9494/TCP ClusterIP 1 => 10.244.0.49:9494/TCP (active)
7 10.96.94.95:9495/TCP ClusterIP 1 => 10.244.1.124:9495/TCP (active)
查询 iptables 规则
1.查询部署后 Cilium 使用的 iptables 规则
当前未请求测试 Pod,可以看到对应的 svc 与 pod 规则 pkts 命中次数均为 0
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL PREROUTING
Chain PREROUTING (policy ACCEPT 259 packets, 14196 bytes)
pkts bytes target prot opt in out source destination
58 3104 CILIUM_PRE_nat all -- * * 0.0.0.0/0 0.0.0.0/0 /* cilium-feeder: CILIUM_PRE_nat */
262 14426 KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */
2 170 DOCKER_OUTPUT all -- * * 0.0.0.0/0 172.18.0.1
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SERVICES | grep -E "NODEPORTS|SWRT7AX63WBUEU6W|KCJY4KYQ6LVWE356"
Chain KUBE-SERVICES (2 references)
pkts bytes target prot opt in ou source destination
0 0 KUBE-SVC-KCJY4KYQ6LVWE356 tcp -- * * 0.0.0.0/0 10.96.94.94 /* default/clientsvc cluster IP */ tcp dpt:9494
0 0 KUBE-SVC-SWRT7AX63WBUEU6W tcp -- * * 0.0.0.0/0 10.96.94.95 /* default/serversvc cluster IP */ tcp dpt:9494
1760 105K KUBE-NODEPORTS all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SVC-SWRT7AX63WBUEU6W
Chain KUBE-SVC-SWRT7AX63WBUEU6W (2 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ tcp -- * * !10.244.0.0/16 10.96.94.95 /* default/serversvc cluster IP */ tcp dpt:9494
0 0 KUBE-SEP-4EXVZ54ZVZAC4Q5C all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/serversvc -> 10.244.1.124:9494 */
2.通过 k8s node 访问 svc 后查询 iptables 规则
2.1.在 node 通过 Cluster IP + Port 方式访问
两次后,server svc 的命中率增加了 2 次。
root@cilium-kubeproxy-control-plane:/# curl -s 10.96.94.95:9494
PodName: server | PodIP: eth0 10.244.1.124/32 eth0 fe80::d8d4:71ff:fe3d:9e3a/64
root@cilium-kubeproxy-control-plane:/# curl -s 10.96.94.95:9494
PodName: server | PodIP: eth0 10.244.1.124/32 eth0 fe80::d8d4:71ff:fe3d:9e3a/64
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SERVICES | grep 'KUBE-SVC-SWRT7AX63WBUEU6W'
Chain KUBE-SERVICES (2 references)
pkts bytes target prot opt in out source destination
2 120 KUBE-SVC-SWRT7AX63WBUEU6W tcp -- * * 0.0.0.0/0 10.96.94.95 /* default/serversvc cluster IP */ tcp dpt:9494
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SVC-SWRT7AX63WBUEU6W
Chain KUBE-SVC-SWRT7AX63WBUEU6W (2 references)
pkts bytes target prot opt in out source destination
2 120 KUBE-MARK-MASQ tcp -- * * !10.244.0.0/16 10.96.94.95 /* default/serversvc cluster IP */ tcp dpt:9494
2 120 KUBE-SEP-4EXVZ54ZVZAC4Q5C all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/serversvc -> 10.244.1.124:9494 */
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SEP-4EXVZ54ZVZAC4Q5C
Chain KUBE-SEP-4EXVZ54ZVZAC4Q5C (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- * * 10.244.1.124 0.0.0.0/0 /* default/serversvc */
2 120 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* default/serversvc */ tcp to:10.244.1.124:9494
2.2.在 node 通过 Node IP + svcNodePort 方式访问
只有 svc --> pod 的 iptables 规则增加了 2 次
root@cilium-kubeproxy-control-plane:/# curl -s 172.18.0.3:30495
PodName: server | PodIP: eth0 10.244.1.124/32 eth0 fe80::d8d4:71ff:fe3d:9e3a/64
root@cilium-kubeproxy-control-plane:/# curl -s 172.18.0.3:30495
PodName: server | PodIP: eth0 10.244.1.124/32 eth0 fe80::d8d4:71ff:fe3d:9e3a/64
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SERVICES | grep 'KUBE-SVC-SWRT7AX63WBUEU6W'
Chain KUBE-SERVICES (2 references)
pkts bytes target prot opt in out source destination
2 120 KUBE-SVC-SWRT7AX63WBUEU6W tcp -- * * 0.0.0.0/0 10.96.94.95 /* default/serversvc cluster IP */ tcp dpt:9494
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SVC-SWRT7AX63WBUEU6W
Chain KUBE-SVC-SWRT7AX63WBUEU6W (2 references)
pkts bytes target prot opt in out source destination
2 120 KUBE-MARK-MASQ tcp -- * * !10.244.0.0/16 10.96.94.95 /* default/serversvc cluster IP */ tcp dpt:9494
4 240 KUBE-SEP-4EXVZ54ZVZAC4Q5C all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/serversvc -> 10.244.1.124:9494 */
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SEP-4EXVZ54ZVZAC4Q5C
Chain KUBE-SEP-4EXVZ54ZVZAC4Q5C (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- * * 10.244.1.124 0.0.0.0/0 /* default/serversvc */
4 240 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* default/serversvc */ tcp to:10.244.1.124:9494
3.通过 k8s pod 访问 svc 后查询 iptables 规则
3.1.在 Client Pod 通过 Node IP + NodePort
本节点 iptables 命中次数依然增加
💡 请求其他 Node IP 时应该看对应 IP 节点的 iptables 规则,不然命中次数肯定不会增加 😂,这里可以通过 KUBE-SERVICES 链中 target: KUBE-NODEPORTS 注释
NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL看到,只有目标是本机 IP 时,才进入 NodePort 处理链。
root@cilium-kubeproxy-control-plane:/# kubectl exec -it client -- curl -s 172.18.0.3:30495
PodName: server | PodIP: eth0 10.244.1.124/32 eth0 fe80::d8d4:71ff:fe3d:9e3a/64
root@cilium-kubeproxy-control-plane:/# kubectl exec -it client -- curl -s 172.18.0.3:30495
PodName: server | PodIP: eth0 10.244.1.124/32 eth0 fe80::d8d4:71ff:fe3d:9e3a/64
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
pkts bytes target prot opt in out source destination
2 120 KUBE-SVC-SWRT7AX63WBUEU6W tcp -- * * 0.0.0.0/0 10.96.94.95 /* default/serversvc cluster IP */ tcp dpt:9494
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SVC-SWRT7AX63WBUEU6W
Chain KUBE-SVC-SWRT7AX63WBUEU6W (2 references)
pkts bytes target prot opt in out source destination
2 120 KUBE-MARK-MASQ tcp -- * * !10.244.0.0/16 10.96.94.95 /* default/serversvc cluster IP */ tcp dpt:9494
6 360 KUBE-SEP-4EXVZ54ZVZAC4Q5C all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/serversvc -> 10.244.1.124:9494 */
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SEP-4EXVZ54ZVZAC4Q5C
Chain KUBE-SEP-4EXVZ54ZVZAC4Q5C (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- * * 10.244.1.124 0.0.0.0/0 /* default/serversvc */
6 360 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* default/serversvc */ tcp to:10.244.1.124:9494
3.2.在 Client Pod 通过 Cluster IP + Port 方式访问
此时会发现,无论访问多少次,iptables 命中数都不会增加,这就是官网文档最后提到的如果不使用 Cilium 代替 kube-proxy,则只会启用 ClusterIP services 的负载:
By default, Helm sets
kubeProxyReplacement=false, which only enables per-packet in-cluster load-balancing of ClusterIP services.
root@network-demo:~# kubectl get svc serversvc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
serversvc NodePort 10.96.94.95 <none> 9494:30495/TCP 139m
root@network-demo:~# kubectl exec -it client -- curl -s 10.96.94.95:9494
PodName: server | PodIP: eth0 10.244.1.124/32 eth0 fe80::d8d4:71ff:fe3d:9e3a/64
root@network-demo:~# kubectl exec -it client -- curl -s 10.96.94.95:9494
PodName: server | PodIP: eth0 10.244.1.124/32 eth0 fe80::d8d4:71ff:fe3d:9e3a/64
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
pkts bytes target prot opt in out source destination
2 120 KUBE-SVC-SWRT7AX63WBUEU6W tcp -- * * 0.0.0.0/0 10.96.94.95 /* default/serversvc cluster IP */ tcp dpt:9494
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SVC-SWRT7AX63WBUEU6W
Chain KUBE-SVC-SWRT7AX63WBUEU6W (2 references)
pkts bytes target prot opt in out source destination
2 120 KUBE-MARK-MASQ tcp -- * * !10.244.0.0/16 10.96.94.95 /* default/serversvc cluster IP */ tcp dpt:9494
6 360 KUBE-SEP-4EXVZ54ZVZAC4Q5C all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/serversvc -> 10.244.1.124:9494 */
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SEP-4EXVZ54ZVZAC4Q5C
Chain KUBE-SEP-4EXVZ54ZVZAC4Q5C (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- * * 10.244.1.124 0.0.0.0/0 /* default/serversvc */
6 360 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* default/serversvc */ tcp to:10.244.1.124:9494
Pod 网卡处抓包
root@network-demo:~# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
client 1/1 Running 0 145m 10.244.0.49 cilium-kubeproxy-control-plane
server 1/1 Running 0 145m 10.244.1.124 cilium-kubeproxy-worker
1.请求 NodePort 查看效果
可以看出,Server IP 并没有被 Cilium 更改,与常规的 CNI 网络一样。
root@network-demo:~# kubectl exec -it client -- curl -s 172.18.0.3:30495
PodName: server | PodIP: eth0 10.244.1.124/32 eth0 fe80::d8d4:71ff:fe3d:9e3a/64
root@network-demo:~# kubectl exec -it client -- tcpdump -pnei eth0
05:34:25.576890 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype IPv4 (0x0800), length 74: 10.244.0.49.43868 > 172.18.0.3.30495: Flags [S], seq 4266727014, win 64240, options [mss 1460,sackOK,TS val 1585845161 ecr 0,nop,wscale 7], length 0
05:34:25.577154 ea:b4:11:10:09:5b > 4a:de:3b:33:f8:4b, ethertype IPv4 (0x0800), length 74: 172.18.0.3.30495 > 10.244.0.49.43868: Flags [S.], seq 2053570008, ack 4266727015, win 65160, options [mss 1460,sackOK,TS val 289666935 ecr 1585845161,nop,wscale 7], length 0
05:34:25.577166 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype IPv4 (0x0800), length 66: 10.244.0.49.43868 > 172.18.0.3.30495: Flags [.], ack 1, win 502, options [nop,nop,TS val 1585845162 ecr 289666935], length 0
05:34:25.577256 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype IPv4 (0x0800), length 146: 10.244.0.49.43868 > 172.18.0.3.30495: Flags [P.], seq 1:81, ack 1, win 502, options [nop,nop,TS val 1585845162 ecr 289666935], length 80
05:34:25.577331 ea:b4:11:10:09:5b > 4a:de:3b:33:f8:4b, ethertype IPv4 (0x0800), length 66: 172.18.0.3.30495 > 10.244.0.49.43868: Flags [.], ack 81, win 509, options [nop,nop,TS val 289666935 ecr 1585845162], length 0
05:34:25.577498 ea:b4:11:10:09:5b > 4a:de:3b:33:f8:4b, ethertype IPv4 (0x0800), length 302: 172.18.0.3.30495 > 10.244.0.49.43868: Flags [P.], seq 1:237, ack 81, win 509, options [nop,nop,TS val 289666935 ecr 1585845162], length 236
05:34:25.577519 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype IPv4 (0x0800), length 66: 10.244.0.49.43868 > 172.18.0.3.30495: Flags [.], ack 237, win 501, options [nop,nop,TS val 1585845162 ecr 289666935], length 0
05:34:25.577750 ea:b4:11:10:09:5b > 4a:de:3b:33:f8:4b, ethertype IPv4 (0x0800), length 146: 172.18.0.3.30495 > 10.244.0.49.43868: Flags [P.], seq 237:317, ack 81, win 509, options [nop,nop,TS val 289666935 ecr 1585845162], length 80
05:34:25.577763 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype IPv4 (0x0800), length 66: 10.244.0.49.43868 > 172.18.0.3.30495: Flags [.], ack 317, win 501, options [nop,nop,TS val 1585845162 ecr 289666935], length 0
05:34:25.577905 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype IPv4 (0x0800), length 66: 10.244.0.49.43868 > 172.18.0.3.30495: Flags [F.], seq 81, ack 317, win 501, options [nop,nop,TS val 1585845162 ecr 289666935], length 0
05:34:25.578122 ea:b4:11:10:09:5b > 4a:de:3b:33:f8:4b, ethertype IPv4 (0x0800), length 66: 172.18.0.3.30495 > 10.244.0.49.43868: Flags [F.], seq 317, ack 82, win 509, options [nop,nop,TS val 289666935 ecr 1585845162], length 0
05:34:25.578131 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype IPv4 (0x0800), length 66: 10.244.0.49.43868 > 172.18.0.3.30495: Flags [.], ack 318, win 501, options [nop,nop,TS val 1585845163 ecr 289666935], length 0
2.请求 ClusterIP 查看效果
需要先明确一点:包不会在同一个接口上既走 ingress 又走 egress(除非是 lo 回环接口)。每个接口只负责一个方向
实际上 lxc 和宿主机 eth0 间通过内核 IP 层直连。内核做完路由决策后,直接把包放到 eth0 的 egress 路径上,不存在 ingress 步骤
发现无论在 Pod eth0 还是 Pod veth pair 网卡 lxceaefdfbf8f50 抓包,dst ip 并没有被 Cilium 更改为 Pod IP,还是 svc ClusterIP。
可以对比文章开头的网络分层图来看,因为 DNAT 的处理是在 Pod eth0 veth pair 上做的,也就是 node 节点 lxceaefdfbf8f50 设备,这里可以通过下方代码块 cilium monitor 证明。而对于 lxceaefdfbf8f50 设备来说,这一个 Ingress 入站请求,tcpdump 在 tc 之前,所以看到的是原始 dst ip:
- --> Netdevice/Drivers(tcpdump) --> Traffic Shaping(tc) --> IP 层查内核路由表
而到了 IP 层看到的 dst ip 是被 tc 更改后的 Pod IP。lxceaefdfbf8f50 和 eth0 都在同一个内核里,内核路由表决定转发时,是内核内部的操作,直接把包从 IP 层放到 eth0 的 egress 路径上。可以把内核类比为交换机:lxceaefdfbf8f50 收到包后交给内核交换机,交换机查路由表把包送到宿主机 eth0 的 egress 出口:
- --> Netdevice/Drivers(tcpdump) --> Traffic Shaping(tc) --> IP 层查内核路由表 --> 宿主机 eth0
此时压根没从 Netdevice/Drivers(tcpdump) 处出去,自然也就抓不到被 tc 更改后的 Pod IP。不过可以通过宿主机 eth0 抓包验证 tc 操作
root@network-demo:~# docker exec -it cilium-kubeproxy-control-plane tcpdump -pnei lxceaefdfbf8f50
05:50:26.215635 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype IPv4 (0x0800), length 74: 10.244.0.49.57444 > 10.96.94.95.9494: Flags [S], seq 1259547284, win 64240, options [mss 1460,sackOK,TS val 1736199449 ecr 0,nop,wscale 7], length 0
05:50:26.215856 ea:b4:11:10:09:5b > 4a:de:3b:33:f8:4b, ethertype IPv4 (0x0800), length 74: 10.96.94.95.9494 > 10.244.0.49.57444: Flags [S.], seq 2548819030, ack 1259547285, win 65160, options [mss 1460,sackOK,TS val 3988848119 ecr 1736199449,nop,wscale 7], length 0
05:50:26.215872 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype IPv4 (0x0800), length 66: 10.244.0.49.57444 > 10.96.94.95.9494: Flags [.], ack 1, win 502, options [nop,nop,TS val 1736199449 ecr 3988848119], length 0
05:50:26.215955 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype IPv4 (0x0800), length 146: 10.244.0.49.57444 > 10.96.94.95.9494: Flags [P.], seq 1:81, ack 1, win 502, options [nop,nop,TS val 1736199449 ecr 3988848119], length 80
05:50:26.216071 ea:b4:11:10:09:5b > 4a:de:3b:33:f8:4b, ethertype IPv4 (0x0800), length 66: 10.96.94.95.9494 > 10.244.0.49.57444: Flags [.], ack 81, win 509, options [nop,nop,TS val 3988848119 ecr 1736199449], length 0
05:50:26.216212 ea:b4:11:10:09:5b > 4a:de:3b:33:f8:4b, ethertype IPv4 (0x0800), length 302: 10.96.94.95.9494 > 10.244.0.49.57444: Flags [P.], seq 1:237, ack 81, win 509, options [nop,nop,TS val 3988848120 ecr 1736199449], length 236
05:50:26.216234 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype IPv4 (0x0800), length 66: 10.244.0.49.57444 > 10.96.94.95.9494: Flags [.], ack 237, win 501, options [nop,nop,TS val 1736199450 ecr 3988848120], length 0
05:50:26.216361 ea:b4:11:10:09:5b > 4a:de:3b:33:f8:4b, ethertype IPv4 (0x0800), length 146: 10.96.94.95.9494 > 10.244.0.49.57444: Flags [P.], seq 237:317, ack 81, win 509, options [nop,nop,TS val 3988848120 ecr 1736199450], length 80
05:50:26.216374 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype IPv4 (0x0800), length 66: 10.244.0.49.57444 > 10.96.94.95.9494: Flags [.], ack 317, win 501, options [nop,nop,TS val 1736199450 ecr 3988848120], length 0
05:50:26.216660 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype IPv4 (0x0800), length 66: 10.244.0.49.57444 > 10.96.94.95.9494: Flags [F.], seq 81, ack 317, win 501, options [nop,nop,TS val 1736199450 ecr 3988848120], length 0
05:50:26.216928 ea:b4:11:10:09:5b > 4a:de:3b:33:f8:4b, ethertype IPv4 (0x0800), length 66: 10.96.94.95.9494 > 10.244.0.49.57444: Flags [F.], seq 317, ack 82, win 509, options [nop,nop,TS val 3988848120 ecr 1736199450], length 0
05:50:26.216945 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype IPv4 (0x0800), length 66: 10.244.0.49.57444 > 10.96.94.95.9494: Flags [.], ack 318, win 501, options [nop,nop,TS val 1736199450 ecr 3988848120], length 0
在 Client Pod 节点对应的 Cilium Pod 中通过 cilium monitor 验证 tc BPF 效果:
tc 将 svc ip 改为 pod ip,到达 iptables 规则时自然无法命中
root@network-demo:~# kubectl exec -it -n kube-system cilium-tjsld -- cilium monitor --type trace --from 2234 -v
time="2026-05-03T06:18:55.085420215Z" level=info msg="Initializing dissection cache..." subsys=monitor
<- endpoint 2234 flow 0x2add59e2 , identity 17248->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.0.49:56798 -> 10.96.94.95:9494 tcp SYN
## 此处由 17248->unknown 变为 17248->64959,到达 iptables 时已经不是 ClusterIP 了
-> stack flow 0x2add59e2 , identity 17248->64959 state new ifindex 0 orig-ip 0.0.0.0: 10.244.0.49:56798 -> 10.244.1.124:9494 tcp SYN
-> endpoint 2234 flow 0x90110859 , identity 64959->17248 state reply ifindex lxceaefdfbf8f50 orig-ip 10.244.1.124: 10.96.94.95:9494 -> 10.244.0.49:56798 tcp SYN, ACK
<- endpoint 2234 flow 0x2add59e2 , identity 17248->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.0.49:56798 -> 10.96.94.95:9494 tcp ACK
-> stack flow 0x2add59e2 , identity 17248->64959 state established ifindex 0 orig-ip 0.0.0.0: 10.244.0.49:56798 -> 10.244.1.124:9494 tcp ACK
<- endpoint 2234 flow 0x2add59e2 , identity 17248->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.0.49:56798 -> 10.96.94.95:9494 tcp ACK
-> stack flow 0x2add59e2 , identity 17248->64959 state established ifindex 0 orig-ip 0.0.0.0: 10.244.0.49:56798 -> 10.244.1.124:9494 tcp ACK
-> endpoint 2234 flow 0x90110859 , identity 64959->17248 state reply ifindex lxceaefdfbf8f50 orig-ip 10.244.1.124: 10.96.94.95:9494 -> 10.244.0.49:56798 tcp ACK
-> endpoint 2234 flow 0x90110859 , identity 64959->17248 state reply ifindex lxceaefdfbf8f50 orig-ip 10.244.1.124: 10.96.94.95:9494 -> 10.244.0.49:56798 tcp ACK
<- endpoint 2234 flow 0x2add59e2 , identity 17248->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.0.49:56798 -> 10.96.94.95:9494 tcp ACK
-> stack flow 0x2add59e2 , identity 17248->64959 state established ifindex 0 orig-ip 0.0.0.0: 10.244.0.49:56798 -> 10.244.1.124:9494 tcp ACK
-> endpoint 2234 flow 0x90110859 , identity 64959->17248 state reply ifindex lxceaefdfbf8f50 orig-ip 10.244.1.124: 10.96.94.95:9494 -> 10.244.0.49:56798 tcp ACK
<- endpoint 2234 flow 0x2add59e2 , identity 17248->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.0.49:56798 -> 10.96.94.95:9494 tcp ACK
-> stack flow 0x2add59e2 , identity 17248->64959 state established ifindex 0 orig-ip 0.0.0.0: 10.244.0.49:56798 -> 10.244.1.124:9494 tcp ACK
<- endpoint 2234 flow 0x2add59e2 , identity 17248->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.0.49:56798 -> 10.96.94.95:9494 tcp ACK, FIN
-> stack flow 0x2add59e2 , identity 17248->64959 state established ifindex 0 orig-ip 0.0.0.0: 10.244.0.49:56798 -> 10.244.1.124:9494 tcp ACK, FIN
-> endpoint 2234 flow 0x90110859 , identity 64959->17248 state reply ifindex lxceaefdfbf8f50 orig-ip 10.244.1.124: 10.96.94.95:9494 -> 10.244.0.49:56798 tcp ACK, FIN
<- endpoint 2234 flow 0x2add59e2 , identity 17248->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.0.49:56798 -> 10.96.94.95:9494 tcp ACK
-> stack flow 0x2add59e2 , identity 17248->64959 state established ifindex 0 orig-ip 0.0.0.0: 10.244.0.49:56798 -> 10.244.1.124:9494 tcp ACK
Node 网卡处抓包
1.Client Pod 请求 ClusterIP 时在 Node eth0 抓包
上一流程中,在 Client Pod 处请求 Server Pod ClusterIP,发现无论是 Client Pod eth0 还是 veth pair 设备抓包,看到的结果都是 Client Pod IP --> ClusterIP,此时在 Client Pod 宿主机 eth0 处抓包,就可以发现 dst ip 已经是被 tc 更改后的 Pod IP 了:
root@network-demo:~# docker exec -it cilium-kubeproxy-control-plane tcpdump -pnei eth0 host 10.244.1.124
08:36:15.371232 32:29:77:be:87:c6 > ae:d7:63:54:c6:7d, ethertype IPv4 (0x0800), length 74: 10.244.0.49.40014 > 10.244.1.124.9494: Flags [S], seq 949473642, win 64240, options [mss 1460,sackOK,TS val 1746148605 ecr 0,nop,wscale 7], length 0
08:36:15.371346 ae:d7:63:54:c6:7d > 32:29:77:be:87:c6, ethertype IPv4 (0x0800), length 74: 10.244.1.124.9494 > 10.244.0.49.40014: Flags [S.], seq 1377165028, ack 949473643, win 65160, options [mss 1460,sackOK,TS val 3998797275 ecr 1746148605,nop,wscale 7], length 0
08:36:15.371400 32:29:77:be:87:c6 > ae:d7:63:54:c6:7d, ethertype IPv4 (0x0800), length 66: 10.244.0.49.40014 > 10.244.1.124.9494: Flags [.], ack 1, win 502, options [nop,nop,TS val 1746148605 ecr 3998797275], length 0
08:36:15.371559 32:29:77:be:87:c6 > ae:d7:63:54:c6:7d, ethertype IPv4 (0x0800), length 146: 10.244.0.49.40014 > 10.244.1.124.9494: Flags [P.], seq 1:81, ack 1, win 502, options [nop,nop,TS val 1746148605 ecr 3998797275], length 80
08:36:15.371611 ae:d7:63:54:c6:7d > 32:29:77:be:87:c6, ethertype IPv4 (0x0800), length 66: 10.244.1.124.9494 > 10.244.0.49.40014: Flags [.], ack 81, win 509, options [nop,nop,TS val 3998797275 ecr 1746148605], length 0
08:36:15.371793 ae:d7:63:54:c6:7d > 32:29:77:be:87:c6, ethertype IPv4 (0x0800), length 302: 10.244.1.124.9494 > 10.244.0.49.40014: Flags [P.], seq 1:237, ack 81, win 509, options [nop,nop,TS val 3998797275 ecr 1746148605], length 236
08:36:15.371885 32:29:77:be:87:c6 > ae:d7:63:54:c6:7d, ethertype IPv4 (0x0800), length 66: 10.244.0.49.40014 > 10.244.1.124.9494: Flags [.], ack 237, win 501, options [nop,nop,TS val 1746148605 ecr 3998797275], length 0
08:36:15.371983 ae:d7:63:54:c6:7d > 32:29:77:be:87:c6, ethertype IPv4 (0x0800), length 146: 10.244.1.124.9494 > 10.244.0.49.40014: Flags [P.], seq 237:317, ack 81, win 509, options [nop,nop,TS val 3998797275 ecr 1746148605], length 80
08:36:15.372035 32:29:77:be:87:c6 > ae:d7:63:54:c6:7d, ethertype IPv4 (0x0800), length 66: 10.244.0.49.40014 > 10.244.1.124.9494: Flags [.], ack 317, win 501, options [nop,nop,TS val 1746148605 ecr 3998797275], length 0
08:36:15.372367 32:29:77:be:87:c6 > ae:d7:63:54:c6:7d, ethertype IPv4 (0x0800), length 66: 10.244.0.49.40014 > 10.244.1.124.9494: Flags [F.], seq 81, ack 317, win 501, options [nop,nop,TS val 1746148606 ecr 3998797275], length 0
08:36:15.372539 ae:d7:63:54:c6:7d > 32:29:77:be:87:c6, ethertype IPv4 (0x0800), length 66: 10.244.1.124.9494 > 10.244.0.49.40014: Flags [F.], seq 317, ack 82, win 509, options [nop,nop,TS val 3998797276 ecr 1746148606], length 0
08:36:15.372618 32:29:77:be:87:c6 > ae:d7:63:54:c6:7d, ethertype IPv4 (0x0800), length 66: 10.244.0.49.40014 > 10.244.1.124.9494: Flags [.], ack 318, win 501, options [nop,nop,TS val 1746148606 ecr 3998797276], length 0
2.Node 节点请求 ClusterIP 时在 cilium_host 处抓包
在 k8s node 中通过 curl 请求 ClusterIP,通过节点内核路由表发现,如果被 iptables 规则负载到本地 Pod,那就是通过本地 cilium_host 设备进行转发。
同时在本地 Server Pod eth0 与 cilium_host 抓包:
root@cilium-kubeproxy-control-plane:/# curl -s 10.96.170.11
PodName: pod-1 | PodIP: eth0 10.244.0.215/32
Node cilium_host 设备处抓包,发现什么数据都没有:
root@cilium-kubeproxy-control-plane:/# ip address show cilium_host
5: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 12:19:e0:82:77:96 brd ff:ff:ff:ff:ff:ff
inet 10.244.0.26/32 scope global cilium_host
valid_lft forever preferred_lft forever
root@cilium-kubeproxy-control-plane:/# tcpdump -pnei cilium_host
## 空的
在 Server Pod eth0 设备处抓包,发现 Client IP 是 cilium_host 设备 IP:
root@cilium-kubeproxy-control-plane:/# kubectl exec -it pod-1 -- tcpdump -pnei eth0
02:59:47.275584 82:b7:7f:40:b0:59 > f2:ee:5b:48:5c:d5, ethertype IPv4 (0x0800), length 74: 10.244.0.26.28907 > 10.244.0.215.80: Flags [S], seq 3859345334, win 64240, options [mss 1460,sackOK,TS val 4144267874 ecr 0,nop,wscale 7], length 0
02:59:47.275600 f2:ee:5b:48:5c:d5 > 82:b7:7f:40:b0:59, ethertype IPv4 (0x0800), length 74: 10.244.0.215.80 > 10.244.0.26.28907: Flags [S.], seq 936865112, ack 3859345335, win 65160, options [mss 1460,sackOK,TS val 704717770 ecr 4144267874,nop,wscale 7], length 0
02:59:47.275657 82:b7:7f:40:b0:59 > f2:ee:5b:48:5c:d5, ethertype IPv4 (0x0800), length 66: 10.244.0.26.28907 > 10.244.0.215.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 4144267874 ecr 704717770], length 0
02:59:47.275732 82:b7:7f:40:b0:59 > f2:ee:5b:48:5c:d5, ethertype IPv4 (0x0800), length 142: 10.244.0.26.28907 > 10.244.0.215.80: Flags [P.], seq 1:77, ack 1, win 502, options [nop,nop,TS val 4144267874 ecr 704717770], length 76: HTTP: GET / HTTP/1.1
02:59:47.275751 f2:ee:5b:48:5c:d5 > 82:b7:7f:40:b0:59, ethertype IPv4 (0x0800), length 66: 10.244.0.215.80 > 10.244.0.26.28907: Flags [.], ack 77, win 509, options [nop,nop,TS val 704717770 ecr 4144267874], length 0
02:59:47.275859 f2:ee:5b:48:5c:d5 > 82:b7:7f:40:b0:59, ethertype IPv4 (0x0800), length 302: 10.244.0.215.80 > 10.244.0.26.28907: Flags [P.], seq 1:237, ack 77, win 509, options [nop,nop,TS val 704717770 ecr 4144267874], length 236: HTTP: HTTP/1.1 200 OK
02:59:47.275912 82:b7:7f:40:b0:59 > f2:ee:5b:48:5c:d5, ethertype IPv4 (0x0800), length 66: 10.244.0.26.28907 > 10.244.0.215.80: Flags [.], ack 237, win 501, options [nop,nop,TS val 4144267874 ecr 704717770], length 0
02:59:47.275928 f2:ee:5b:48:5c:d5 > 82:b7:7f:40:b0:59, ethertype IPv4 (0x0800), length 111: 10.244.0.215.80 > 10.244.0.26.28907: Flags [P.], seq 237:282, ack 77, win 509, options [nop,nop,TS val 704717770 ecr 4144267874], length 45: HTTP
02:59:47.275966 82:b7:7f:40:b0:59 > f2:ee:5b:48:5c:d5, ethertype IPv4 (0x0800), length 66: 10.244.0.26.28907 > 10.244.0.215.80: Flags [.], ack 282, win 501, options [nop,nop,TS val 4144267874 ecr 704717770], length 0
02:59:47.276280 82:b7:7f:40:b0:59 > f2:ee:5b:48:5c:d5, ethertype IPv4 (0x0800), length 66: 10.244.0.26.28907 > 10.244.0.215.80: Flags [F.], seq 77, ack 282, win 501, options [nop,nop,TS val 4144267875 ecr 704717770], length 0
02:59:47.276375 f2:ee:5b:48:5c:d5 > 82:b7:7f:40:b0:59, ethertype IPv4 (0x0800), length 66: 10.244.0.215.80 > 10.244.0.26.28907: Flags [F.], seq 282, ack 78, win 509, options [nop,nop,TS val 704717771 ecr 4144267875], length 0
02:59:47.276440 82:b7:7f:40:b0:59 > f2:ee:5b:48:5c:d5, ethertype IPv4 (0x0800), length 66: 10.244.0.26.28907 > 10.244.0.215.80: Flags [.], ack 283, win 501, options [nop,nop,TS val 4144267875 ecr 704717771], length 0
其实具体原因与上文在 Client Pod 处请求 ClusterIP 逻辑差不多,从下面代码块中路由走向可以看出:
-
请求 ClusterIP 时,数据包会从 eth0 发出,交给网关 172.18.0.1 转发,并用 172.18.0.3 作为源 IP;
1.1 其实也没走 eth0,对比文章开头分层图来看,已经被 iptables 将 ClusterIP 10.96.170.11 转换为 Pod IP 10.244.0.215 了。
-
如果请求 ClusterIP 分配到了本机 Pod IP 10.244.0.215,那就从 cilium_host 网卡发出,交给网关 10.244.0.26 用 10.244.0.26 作为 源 IP,但其实网关/源 IP 都是 cilium_host;
-
在 ip route show 中看到路由到 Pod 10.244.0.215 是一条三层路由信息,因为 10.244.0.26 就是 cilium_host 自身的 IP,内核在 route get 做路由查找时发现下一跳是本机接口,直接简化为 "通过 cilium_host 发出" 了。
📝 重要总结:
- Client Pod 访问 ClusterIP 时,veth pair lxc 可以抓到流量,是因为:抓取的 Client Pod 过来的 Ingress 入站流量。所以 dst ip 仍然是 ClusterIP;
- k8s Node 访问 ClusterIP 时,cilium_host 抓取时什么也没有是因为:请求通过路由时发现 via 10.244.0.26 dev cilium_host 路由中的 via ip 就是自己,所以省略了转发的这一步,直接进入到了 cilium_host egress 出站这一步。而在这步中 cilium_host 上的 eBPF 程序实际上已经提前把 MAC 改好了,用 bpf_redirect() 函数送到 Server Pod lxc 中,跳过了 tcpdump 的 AF_PACKET 捕获点,所以抓不到流量。
## 1.
## 因为本环境是通过 kind + docker 形式部署的,所以此处 via 是 docker 生成的 bridge 设备地址
## 正常来说应该是本机上的某个设备
root@cilium-kubeproxy-control-plane:/# ip route get 10.96.170.11
10.96.170.11 via 172.18.0.1 dev eth0 src 172.18.0.3 uid 0
cache
## 1.1
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SERVICES | grep 'KUBE-SVC-PQCIGBIECMLVBHFY'
pkts bytes target prot opt in out source destination
5 300 KUBE-SVC-PQCIGBIECMLVBHFY tcp -- * * 0.0.0.0/0 10.96.170.11 /* default/pod:http cluster IP */ tcp dpt:80
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SVC-PQCIGBIECMLVBHFY
pkts bytes target prot opt in out source destination
18 1080 KUBE-MARK-MASQ tcp -- * * !10.244.0.0/16 10.96.170.11 /* default/pod:http cluster IP */ tcp dpt:80
7 420 KUBE-SEP-2ZKMHUCYGIQ55D4S all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/pod:http -> 10.244.0.215:80 */ statistic mode random probability 0.50000000000
14 840 KUBE-SEP-7D54XOF4R7QCT7KQ all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/pod:http -> 10.244.1.51:80 */
root@cilium-kubeproxy-control-plane:/# iptables -t nat -nvL KUBE-SEP-2ZKMHUCYGIQ55D4S
Chain KUBE-SEP-2ZKMHUCYGIQ55D4S (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- * * 10.244.0.215 0.0.0.0/0 /* default/pod:http */
7 420 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* default/pod:http */ tcp to:10.244.0.215:80
## 2.
root@cilium-kubeproxy-control-plane:/# ip route show | grep '10.244.0.0'
10.244.0.0/24 via 10.244.0.26 dev cilium_host proto kernel src 10.244.0.26
## 3.
root@cilium-kubeproxy-control-plane:/# ip route get 10.244.0.215
10.244.0.215 dev cilium_host src 10.244.0.26 uid 0
cache
通过同节点 cilium pod 查询 monitor 信息,可以看到具体的转换过程
root@cilium-kubeproxy-control-plane:/# bpftool net show | grep 'cilium_host'
xdp:
tc:
cilium_host(5) tcx/ingress cil_to_host prog_id 5346 link_id 130
cilium_host(5) tcx/egress cil_from_host prog_id 5349 link_id 131
flow_dissector:
netfilter:
root@cilium-kubeproxy-control-plane:/# kubectl exec -it -n kube-system cilium-tjsld -- cilium monitor --type trace -v --from 1693
## 这里 identity host 已经可以看到是从主机过来的
## orig-ip 10.244.0.26
## cilium_host 通过 BPF 程序 cil_from_host redirect 到 Server Pod lxc490b726473cb 上
## 具体的转发逻辑在下一代码块中验证
-> endpoint 1693 flow 0xaa48e037 , identity host->30501 state new ifindex lxc490b726473cb orig-ip 10.244.0.26: 10.244.0.26:37056 -> 10.244.0.215:80 tcp SYN
<- endpoint 1693 flow 0x5d0bfe02 , identity 30501->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.0.215:80 -> 10.244.0.26:37056 tcp SYN, ACK
-> stack flow 0x5d0bfe02 , identity 30501->host state reply ifindex 0 orig-ip 0.0.0.0: 10.244.0.215:80 -> 10.244.0.26:37056 tcp SYN, ACK
-> endpoint 1693 flow 0xaa48e037 , identity host->30501 state established ifindex lxc490b726473cb orig-ip 10.244.0.26: 10.244.0.26:37056 -> 10.244.0.215:80 tcp ACK
-> endpoint 1693 flow 0xaa48e037 , identity host->30501 state established ifindex lxc490b726473cb orig-ip 10.244.0.26: 10.244.0.26:37056 -> 10.244.0.215:80 tcp ACK
<- endpoint 1693 flow 0x5d0bfe02 , identity 30501->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.0.215:80 -> 10.244.0.26:37056 tcp ACK
-> stack flow 0x5d0bfe02 , identity 30501->host state reply ifindex 0 orig-ip 0.0.0.0: 10.244.0.215:80 -> 10.244.0.26:37056 tcp ACK
<- endpoint 1693 flow 0x5d0bfe02 , identity 30501->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.0.215:80 -> 10.244.0.26:37056 tcp ACK
-> stack flow 0x5d0bfe02 , identity 30501->host state reply ifindex 0 orig-ip 0.0.0.0: 10.244.0.215:80 -> 10.244.0.26:37056 tcp ACK
-> endpoint 1693 flow 0xaa48e037 , identity host->30501 state established ifindex lxc490b726473cb orig-ip 10.244.0.26: 10.244.0.26:37056 -> 10.244.0.215:80 tcp ACK
<- endpoint 1693 flow 0x5d0bfe02 , identity 30501->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.0.215:80 -> 10.244.0.26:37056 tcp ACK
-> stack flow 0x5d0bfe02 , identity 30501->host state reply ifindex 0 orig-ip 0.0.0.0: 10.244.0.215:80 -> 10.244.0.26:37056 tcp ACK
-> endpoint 1693 flow 0xaa48e037 , identity host->30501 state established ifindex lxc490b726473cb orig-ip 10.244.0.26: 10.244.0.26:37056 -> 10.244.0.215:80 tcp ACK
-> endpoint 1693 flow 0xaa48e037 , identity host->30501 state established ifindex lxc490b726473cb orig-ip 10.244.0.26: 10.244.0.26:37056 -> 10.244.0.215:80 tcp ACK, FIN
<- endpoint 1693 flow 0x5d0bfe02 , identity 30501->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.0.215:80 -> 10.244.0.26:37056 tcp ACK, FIN
-> stack flow 0x5d0bfe02 , identity 30501->host state reply ifindex 0 orig-ip 0.0.0.0: 10.244.0.215:80 -> 10.244.0.26:37056 tcp ACK, FIN
-> endpoint 1693 flow 0xaa48e037 , identity host->30501 state established ifindex lxc490b726473cb orig-ip 10.244.0.26: 10.244.0.26:37056 -> 10.244.0.215:80 tcp ACK
上一代码块中描述提到的 redirect 并非 bpf_redirect_peer() 函数,而是 bpf_redirect():
cilium_host 通过 BPF 程序 cil_from_host redirect 到 Server Pod lxc490b726473cb 上
通过上面 bpftool net show 看到 cilium_host egress cil_from_host eBPF 程序在内核中的 ID 为 5349,以人类可读的反汇编格式,转储(dump)出它的 eBPF 虚拟机指令
从输出结果来看,cil_from_host 程序中没有直接做 redirect,而是通过 bpf_tail_call 跳转到另一个 BPF 程序去做的:
## bpftool net show 输出:
cilium_host(5) tcx/egress cil_from_host prog_id 5349 link_id 131
root@cilium-kubeproxy-control-plane:/# bpftool prog dump xlated id 5349
; tail_call_static(ctx, CALLS_MAP, index);
368: (bf) r1 = r6
369: (18) r2 = map[id:1169] # ← CALLS_MAP: 1169
371: (b7) r3 = 1 # ← 跳转到 index 1 的程序
372: (85) call bpf_tail_call#12
373: (b4) w7 = 2
374: (05) goto pc+44
; tail_call_static(ctx, CALLS_MAP, index);
468: (bf) r1 = r6
469: (18) r2 = map[id:1169] # CALLS_MAP: 1169
471: (b7) r3 = 22 # 跳转到 index 22 的程序
472: (85) call bpf_tail_call#12
473: (b4) w1 = 79560960
查看 CALLS_MAP 1169 里的程序列表:
root@cilium-kubeproxy-control-plane:/# bpftool map dump id 1169
key: 01 00 00 00 value: df 14 00 00
key: 07 00 00 00 value: e0 14 00 00
key: 16 00 00 00 value: e4 14 00 00
根据十进制方式计算,e4 14 00 00 = 0x000014e4 = 1×4096 + 4×256 + 14×16 + 4 = 5348
上面两个 value 按照此方式验证后发现都不是,精简后只留下正确结果
从查询结果来看,5348 又做了一次 tail call 跳转到策略程序 1039:
root@cilium-kubeproxy-control-plane:/# bpftool prog dump xlated id 5348
; tail_call(ctx, map, slot);
245: (bf) r1 = r6
246: (18) r2 = map[id:1039] # ← 策略程序 map
248: (85) call bpf_tail_call#12
249: (b4) w0 = -203
查看 CALLS_MAP 1039 里的程序列表,发现 key 对应的 9d 06 00 00 = 0x0000069d = 6×256 + 9×16 + 13 = 1693 是 pod-1 的 endpoint ID
root@cilium-kubeproxy-control-plane:/# bpftool map dump id 1039
key: 2c 00 00 00 value: e3 14 00 00
key: e0 00 00 00 value: 04 15 00 00
key: 12 03 00 00 value: 25 15 00 00
key: 9d 06 00 00 value: e2 15 00 00
key: 40 08 00 00 value: 32 15 00 00
key: 47 08 00 00 value: 1a 15 00 00
key: ba 08 00 00 value: 59 15 00 00
root@network-demo:~# kubectl exec -it -n kube-system cilium-tjsld -- cilium endpoint list | grep 1693
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv4 STATUS
ENFORCEMENT ENFORCEMENT
1693 Disabled Disabled 30501 k8s:app=nginx 10.244.0.215 ready
查询 key: 9d 06 00 00 对应的 value: e2 15 00 00 = 0x000015e2 = 1×4096 + 5×256 + 14×16 + 2 = 5602,是一个 sched_cls 类型的 BPF 程序,负责 endpoint 1693(pod-1)的策略检查和最终 bpf_redirect 转发:
root@cilium-kubeproxy-control-plane:/# bpftool prog dump xlated id 5602
; return redirect(ifindex, flags);
2005: (b4) w2 = 0
2006: (85) call bpf_redirect#12800944
root@cilium-kubeproxy-control-plane:/# bpftool prog show id 5602
5602: sched_cls name handle_policy tag 1a9b43b8794e80f1 gpl
loaded_at 2026-05-03T12:11:28+0000 uid 0
xlated 17840B jited 10602B memlock 20480B map_ids 1229,1033,1227,1041,1042,1030,1228,1038,1032,1007,1027,1028,1029,1226,1043
btf_id 1858
分析 Pod/Node 路由表、ARP 表
1.查看 Pod 路由表
从 Pod 路由表看出,所有出去的流量都要走 cilium_host 这个网关,但实则不然:因为 Pod eth0 和 host 的 lxc 是通过 veth pair 物理直连的,Pod 出去的包只能走这一条路。可以通过下方步骤查询 ARP 表验证:
root@network-demo:~# kubectl exec -it client -- ip route show
default via 10.244.0.26 dev eth0 mtu 1500
10.244.0.26 dev eth0 scope link
root@network-demo:~# docker exec -it cilium-kubeproxy-control-plane ip address show cilium_host
5: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 12:19:e0:82:77:96 brd ff:ff:ff:ff:ff:ff
inet 10.244.0.26/32 scope global cilium_host
valid_lft forever preferred_lft forever
root@network-demo:~# docker exec -it cilium-kubeproxy-control-plane ip -d link show cilium_host
5: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 12:19:e0:82:77:96 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
veth addrgenmode eui64 numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535
2.查询 Pod ARP 表
可以看出,网关 IP:10.244.0.26 对应的 MAC 地址变成了 veth pair lxc 的 MAC 地址
root@network-demo:~# docker exec -it cilium-kubeproxy-control-plane ip address show lxceaefdfbf8f50
15: lxceaefdfbf8f50@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ea:b4:11:10:09:5b brd ff:ff:ff:ff:ff:ff link-netns cni-6afc5666-963b-3b06-7209-3a1ef6c28288
root@network-demo:~# kubectl exec -it client -- ip neighbor show
10.244.0.26 dev eth0 lladdr ea:b4:11:10:09:5b STALE
root@network-demo:~# kubectl exec -it client -- arp -n
Address HWtype HWaddress Flags Mask Iface
10.244.0.26 ether ea:b4:11:10:09:5b C eth0
重新请求外部 IP,触发 ARP 广播后抓包。可以看出,回复时的 MAC 就是 lxc 的,Pod 以为在跟网关 cilium_host 通信,实际对端是 lxc。
root@network-demo:~# kubectl exec -it client -- tcpdump -pnei eth0 arp
09:22:20.518104 4a:de:3b:33:f8:4b > ea:b4:11:10:09:5b, ethertype ARP (0x0806), length 42: Request who-has 10.244.0.26 tell 10.244.0.49, length 28
09:22:20.518161 ea:b4:11:10:09:5b > 4a:de:3b:33:f8:4b, ethertype ARP (0x0806), length 42: Reply 10.244.0.26 is-at ea:b4:11:10:09:5b, length 28
3.查询 Node 路由表
结合 Client Pod 请求 ClusterIP 来看,在 lxc 处由 tc 将 10.96.94.95 更改为 10.244.1.124 后,通过内核路由表由宿主机 eth0 网口发给 172.18.0.2。
root@network-demo:~# docker exec -it cilium-kubeproxy-control-plane ip route show | grep '10.244.1.0/24'
10.244.1.0/24 via 172.18.0.2 dev eth0 proto kernel
4.查询 Node APR 表
通过 ARP 表确认对端 172.18.0.2 节点 MAC 地址
root@network-demo:~# docker exec -it cilium-kubeproxy-control-plane ip neighbor show | grep '172.18.0.2'
172.18.0.2 dev eth0 lladdr ae:d7:63:54:c6:7d REACHABLE

浙公网安备 33010602011771号