背景:

k8s v1.23集群部署metrics-server后发现一直处于0/1 running状态,查看日志发现如下报错

E0902 09:18:17.917559    1 scraper.go:140] "Failed to scrape node" err="Get \"https://30.xx.xx.91:10250/metrics/resource\": x509: certificate signed by unknown authority" node="30.xx.xx.91"
E0902 09:18:17.917559    1 scraper.go:140] "Failed to scrape node" err="Get \"https://30.xx.xx.96:10250/metrics/resource\": x509: certificate signed by unknown authority" node="30.xx.xx.96"
E0902 09:18:17.917559    1 scraper.go:140] "Failed to scrape node" err="Get \"https://30.xx.xx.97:10250/metrics/resource\": x509: certificate signed by unknown authority" node="30.xx.xx.97"
E0902 09:18:17.917559    1 scraper.go:140] "Failed to scrape node" err="Get \"https://30.xx.xx.221:10250/metrics/resource\": x509: certificate signed by unknown authority" node="30.xx.xx.221"
...omit...

解决方案:

google后查到相关github:
https://github.com/kubernetes-sigs/metrics-server/issues/146
跟国内的文章差不多,最高赞通过对metrics-server启动项添加--kubelet-insecure-tls跳过tls认证,出于安全考虑,尝试根据github中的方案添加csr及证书配置:
https://github.com/kubernetes-sigs/metrics-server/issues/146#issuecomment-459239615
image

#1. Create a Certificate Signing Request
cat <<EOF | cfssl genkey - | cfssljson -bare kubelet-server
{
  "hosts": [
    "node-name-1",
    "node-name-2",
    ""...",
    "node-ip-1",
    "node-ip-2",
    "..."
  ],
  "CN": "kubelet-server",
  "key": {
    "algo": "ecdsa",
    "size": 256
  }
}
EOF

cat <<EOF | kubectl create -f -
apiVersion: certificates.k8s.io/v1beta1
kind: CertificateSigningRequest
metadata:
  name: kubelet-server
spec:
  groups:
  - system:nodes
  - system:authenticated
  request: $(cat kubelet-server.csr | base64 | tr -d '\n')
  usages:
  - digital signature
  - key encipherment
  - server auth
EOF
#2. kubectl describe csr kubelet-server
#3. kubectl certificate approve kubelet-server
#4. kubectl get csr kubelet-server -o jsonpath='{.status.certificate}' | base64 --decode > kubelet-server.pem
#5. Copy to all nodes in /var/lib/kubelet/pki/
#6. Add to kubelet config.yaml tls cert & key
see: https://godoc.org/k8s.io/kubernetes/pkg/kubelet/apis/config#KubeletConfiguration
echo "tlsPrivateKeyFile: /var/lib/kubelet/pki/kubelet-server-key.pem" >> /var/lib/kubelet/config.yaml
echo "tlsCertFile: /var/lib/kubelet/pki/kubelet-server.pem" >> /var/lib/kubelet/config.yaml
#7. restart kubelet

一番折腾后依然失败了,如果有兴趣可以参考上面文档和k8s官方文档进行尝试添加证书配置:
https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster/


github的issue继续往下翻发现了第二种解决方案:
https://github.com/kubernetes-sigs/metrics-server/issues/146#issuecomment-472655656
相关k8s官方文档:
https://kubernetes.io/zh-cn/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/
部分说明如下:
image


该方案是在集群搭建时,添加config参数实现,由于集群已经搭建完毕,需要在各节点的/var/lib/kubelet/config.yaml中及名为kubelet-config-1.23的configmap中添加如下参数

#cat /var/lib/kubelet/config.yaml
...omit...
serverTLSBootstrap: true

# 修改各节点kubelet config.yaml文件后需要重启kubelet以读取配置
systemctl daemon-reload && systemctl restart kubelet 

# 由于添加该参数后,csr需要人为引导,所以执行如下命令对csr进行签发
kubectl get csr |awk 'NR>1{print $1}' |xargs -n 1 kubectl certificate approve 

# 修改kubelet-config-1.23并添加该参数
kubectl edit cm kubelet-config-1.23 -n kube-system 
...omit...
  serverTLSBootstrap: true

重启metrics-server pod后,已恢复1/1 running,且可正常执行kubectl top nodes等命令


总结:

方案对比:
  1. 在metrcis-server启动项中添加--kubelet-insecure-tls参数
    优点:kubeadm搭建的集群默认使用集群内自签证书,该方法方便解决metrics-server这种外部向k8s发起的tls认证
    缺点:由于跳过tls认证,metrics存在安全隐患,集群安全依赖于集群外安全策略等基础设施进行保证。该方案metrics官方表示不建议在生产环境使用

  2. 添加serverTLSBootstrap: true参数
    优点:保证集群内metrics通信采用tls认证,安全性得到保障
    缺点:后续证书的 CSR(证书签名请求)不能被 kube-controller-manager 中默认的 签名组件 kubernetes.io/kubelet-serving 批准,如后续集群扩容时,需要手动通过kubectl certificate approve <new-csr-id>对csr进行批准

  3. 手动添加metrics-server所需的tls证书配置
    该方法没验证成功,对不熟悉证书及相关机制的人不太友好,有兴趣可以参考上述issue及k8s文档进行尝试

最优解补充:

可以通过CA签发kubelet server证书彻底解决metrics-tls认证问题
https://www.cnblogs.com/ki11-9/articles/18083565

附录:
kubeletconfiguration参数:
https://kubernetes.io/zh-cn/docs/reference/config-api/kubelet-config.v1beta1/

 posted on 2022-09-05 22:50  shelterCJJ  阅读(1063)  评论(0)    收藏  举报