EKS 1.20 创建 ALB 失败,问题排查,及解决

简述问题

创建一个 ekstest cluster 环境供测试使用。最近aws 推出了 1.20 及 1.21 版本。本次使用1.20版本测试。先前用的 1.15 1.17;
当创建 rancher 时 alb 无法拉起,先前的版本没有遇到这个问题,但是 1.20 版本中遇到该问题。

问题解决思路

首先 eks 是aws托管的k8s。如果出现问题,有两个层面,一个是k8s集群自己的问题。另外一个是集成aws部分的问题;
一点点来吧

# 查看k8s部分
# 由于是 ingress 部分不能拉起 alb ,首先查看ingress的 describe,这里以 2048 为例
ec2-user:~/environment $ kubectl -n game-2048 describe ingress
Name:             ingress-2048
Namespace:        game-2048
Address:          
Default backend:  default-http-backend:80 (<error: endpoints "default-http-backend" not found>)
Rules:
  Host        Path  Backends
  ----        ----  --------
  *           
              /*   service-2048:80 (10.20.37.169:80,10.20.61.206:80,10.20.67.98:80 + 2 more...)
Annotations:  alb.ingress.kubernetes.io/scheme: internet-facing
              alb.ingress.kubernetes.io/target-type: ip
              kubernetes.io/ingress.class: alb
Events:
  Type     Reason            Age   From     Message
  ----     ------            ----  ----     -------
  Warning  FailedBuildModel  39m   ingress  Failed build model due to couldn't auto-discover subnets: UnauthorizedOperation: You are not authorized to perform this operation.
           status code: 403, request id: 0926f7fc-9407-4cf4-bea5-51b2d7367813
  Warning  FailedBuildModel  39m  ingress  Failed build model due to couldn't auto-discover subnets: UnauthorizedOperation: You are not authorized to perform this operation.
           status code: 403, request id: 128d32d1-1156-448b-b118-95f036db3ce7
  Warning  FailedBuildModel  39m  ingress  Failed build model due to couldn't auto-discover subnets: UnauthorizedOperation: You are not authorized to perform this operation.
           status code: 403, request id: dcd373c0-68c7-4b52-b945-cc7c249a3999
  Warning  FailedBuildModel  39m  ingress  Failed build model due to couldn't auto-discover subnets: UnauthorizedOperation: You are not authorized to perform this operation.
           status code: 403, request id: 039c7caa-2202-4706-ad8f-7b20a843b3e0
  Warning  FailedBuildModel  39m  ingress  Failed build model due to couldn't auto-discover subnets: UnauthorizedOperation: You are not authorized to perform this operation.
           status code: 403, request id: 87cc1a20-fb1a-4537-b433-789a482464bc
  Warning  FailedBuildModel  39m  ingress  Failed build model due to couldn't auto-discover subnets: UnauthorizedOperation: You are not authorized to perform this operation.
           status code: 403, request id: a1a17e0a-9c4f-4856-a272-5a42457d0734
  Warning  FailedBuildModel  39m  ingress  Failed build model due to couldn't auto-discover subnets: UnauthorizedOperation: You are not authorized to perform this operation.
           status code: 403, request id: 515f107e-577f-4da0-a20c-a1fe4c280f4a
  Warning  FailedBuildModel  39m  ingress  Failed build model due to couldn't auto-discover subnets: UnauthorizedOperation: You are not authorized to perform this operation.
           status code: 403, request id: f1477058-2a90-4084-9fcf-331ff50b7f4c
  Warning  FailedBuildModel  39m  ingress  Failed build model due to couldn't auto-discover subnets: UnauthorizedOperation: You are not authorized to perform this operation.
           status code: 403, request id: 528c987d-5a6e-4b06-b289-a9678d372343
  Warning  FailedBuildModel  31m (x9 over 39m)  ingress  (combined from similar events): Failed build model due to couldn't auto-discover subnets: UnauthorizedOperation: You are not authorized to perform this operation.
           status code: 403, request id: 5ffebf17-f4d7-46f3-8cc4-1a77a0cf3fc2
		   

⚠️:看到上边有 UnauthorizedOperation: You are not authorized to perform this operation. 证明ingress是想执行操作,但是没有授权
突然想到了 eks 去拉起 alb 是需要一个授权的。
内容如下:https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/deploy/installation/#iam-permissions

里边提到一个 IAM 权限问题:
控制器在工作程序节点上运行,因此它需要通过 IAM 权限访问 AWS ALB/NLB 资源。IAM 权限可以通过 ServiceAccount 的 IAM 角色设置,也可以直接附加到工作程序节点 IAM 角色.
image

我们 之前 1.15 版本的eks也没有出现拉起alb失败的情况,现在出现了这个问题,可能跟这个权限有关系,所以我们这里下载这个iam-policy.json,来对比之前 名为 AWSLoadBalancerControllerIAMPolicy 的 IAM 策略
发现之前的版本少了一个 ec2:DescribeAvailabilityZones
iam-policy 下载链接:curl -o iam-policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.2.1/docs/install/iam_policy.json
image

问题解决

我们找到名为 AWSLoadBalancerControllerIAMPolicy 的 IAM 策略
添加了缺失项。
然后重新部署了测试项目,alb 被成功拉起;问题解决

ec2-user:~/environment $ kubectl -n game-2048 get ingress
NAME           CLASS    HOSTS   ADDRESS                                                                   PORTS   AGE
ingress-2048   <none>   *       k8s-game2048-ingress2-8e72ecc-1439841.us-west-2.elb.amazonaws.com   80      3h39m

总结

真正的问题排查远非这么简单,我们查看了eks中相关的 ingress svc pod ,最终找到 403 错误。又通过 aws CloudTrail 的api调用日志,最终找到这个问题。
不难发现,aws 官方文档更新的是有点儿慢的,但是google之后,也还是能找到一些有价值的参考链接,比如:https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/deploy/installation/#iam-permissions

这中间需要自己冷静的分析,以及对整个 eks 工作流程的熟悉;

继续吧,一个小问题又被解决了

posted @ 2021-09-18 15:06  Star-Hitian  阅读(998)  评论(0)    收藏  举报