aws ecs服务更新和任务重启的发生场景/事件/API调用记录

参考资料

创建ecs服务的相关api调用

注意:以下所有的api调用都按照事件顺序从上到下排列

手动创建服务的trail调用记录如下

  1. 创建服务和网卡的tag

    注意:fargate创建网卡的

    • description为arn:aws-cn:ecs:cn-north-1:xxxxxxx:attachment/64195c73-bba9-4ba3-8ef1-f97bf7c251f6

    • tag上带有集群和服务的名称

    奇怪的点在于,没有创建eni网卡的api调用,即CreateNetworkINterface

    1. CreateTags,ecs-eni-provisioning,ec2.amazonaws.com -	-
    2. CreateService,username,ecs.amazonaws.com
    
  2. 创建application autoscaling跟踪

    1. UpdateService,username,ecs.amazonaws.com
    2. PutMetricAlarm,username,monitoring.amazonaws.com 
    3. PutMetricAlarm,username,monitoring.amazonaws.com 
    4. PutScalingPolicy,username,autoscaling.amazonaws.com
    5. RegisterScalableTarget,username,autoscaling.amazonaws.com
    
  3. 将任务注册到elb

    1. RegisterTargets,ecs-service-scheduler,elasticloadbalancing.amazonaws.com
    
  4. 创建log组

    1. CreateLogStream,ECS_STS_Session,logs.amazonaws.com
    

将service的任务数量从0增加到1的api调用

1. RegisterTargets,ecs-service-scheduler,elasticloadbalancing.amazonaws.com
2. CreateLogStream,61c787040c6140be8f4c0c70c0068d6d,logs.amazonaws.com
3. CreateTags,ecs-eni-provisioning,ec2.amazonaws.com -	-
4. UpdateInstanceInformation,i-072d61382ddbe7ba3,ssm.amazonaws.com
5. UpdateService,username,ecs.amazonaws.com

部署控制器的类型有三种,对应两种更新类型

  • ECS,对应滚动更新,可以手动触发或通过codepipeline触发
  • CODE_DEPLOY,对应蓝绿部署
  • EXTERNAL,对应第三方部署控制器的部署逻辑

ELB健康检查失败

https://aws.amazon.com/cn/premiumsupport/knowledge-center/ecs-fargate-health-check-failures/

在ecs服务的事件界面查看到如下日志,表明由于任务健康检查失败导致了任务的重启

(service AWS-service) (port 8080) is unhealthy in (target-grouparn:uxyztargetgroup/aws-targetgroup/123456789) due to (reason Health checks failed with these codes: [502]) or [request timeout]
(service AWS-Service) (port 8080) is unhealthy in target-group tf-20190411170 due to (reason Health checks failed)

导致任务健康检查失败的原因可能有

  • elb和ecs任务事件的网络配置,子网和安全组流量放行
  • 健康检查的端口和路径是否配置正确
  • elb健康检查宽限期,在任务能够响应之前等待的时间
  • 服务的cpu和内存使用量过高
  • 应用程序错误,需要查看具体的错误日志
  • 任务应用程序的依赖正常运行(例如,后端数据库服务)

fargate底层维护

https://docs.amazonaws.cn/en_us/AmazonECS/latest/userguide/task-maintenance.html

对于基于fatagte平台启动的ecs服务,可能会由于底层的维护导致任务的替换

在ecs服务事件中能看到如下日志

stoppedReason:ECS is performing maintenance on the underlying infrastructure hosting the task

fargate的维护通常是由于主机问题和安全漏洞修复,对于独立任务和服务任务有所区别

  • 独立任务,主机问题和漏洞修复会发送邮件通知
  • 服务任务,主机问题不会发送邮件

Service Autoscaling更新

创建autoscaling将任务的cpu平均使用率锁定在某个值,为了测试方便指定一个较低的值20%

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-h9E8jHwZ-1678203537502)(assets/image-20230307220624657.png)]

cw控制自动创建新告警

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1cBVrekX-1678203537503)(assets/image-20230307220938945.png)]

访问施加压力观察扩缩行为

$ webbench -c 10 -t 3600 http://main-alb-1897344746.cn-north-1.elb.amazonaws.com.cn:8097/

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nAItLx07-1678203537504)(assets/image-20230307222528143.png)]

扩容行为

查看ecs控制台服务事件

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OTCpRkoN-1678203537504)(assets/image-20230307223051562.png)]

记录的trail调用如下

4.RegisterTargets	March 07, 2023, 22:16:03 (UTC+08:00)	ecs-service-scheduler	elasticloadbalancing.amazonaws.com
3.CreateLogStream	March 07, 2023, 22:15:47 (UTC+08:00)	500589909f0a49119ff2ea8cf4e6aa9c	logs.amazonaws.com
2.CreateTags	March 07, 2023, 22:15:38 (UTC+08:00)	ecs-eni-provisioning	ec2.amazonaws.com
1.UpdateService	March 07, 2023, 22:15:18 (UTC+08:00)	AutoScaling-UpdateDesiredCapacity	ecs.amazonaws.com

缩容行为

查看ecs控制台服务事件

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pt7T1oZ7-1678203537505)(assets/image-20230307223242701.png)]

记录的trail调用如下

1. UpdateService,AutoScaling-UpdateDesiredCapacity,ecs.amazonaws.com
2. DeregisterTargets,ecs-service-scheduler,elasticloadbalancing.amazonaws.com
3. DeleteNetworkInterface,ecs-eni-provisioning,ec2.amazonaws.com

滚动更新

手动触发滚动更新

在ecs控制台上的事件如下

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UaJsh3PG-1678203537505)(assets/image-20230307161402983.png)]

对于fargate平台的服务更新,在trail记录中存在如下调用

1. UpdateService,username,ecs.amazonaws.com
2. CreateTags,ecs-eni-provisioning,ec2.amazonaws.com
3. CreateLogStream,53b51680f06640bd9e63a4272607d60b,logs.amazonaws.com
4. RegisterTargets,ecs-service-scheduler,elasticloadbalancing.amazonaws.com
5. DeregisterTargets,ecs-service-scheduler,elasticloadbalancing.amazonaws.com
6. DeleteNetworkInterface,ecs-eni-provisioning,ec2.amazonaws.com

由于我的fargate任务实际上绑定了elb,因此会出现额外的事件

  • 更新ecs服务
  • 创建网卡的tag,这里奇怪的一点在于没有创建网卡的事件
  • 创建新的日志流,日志流对应了任务id
  • 注册新目标
  • 取消注册目标
  • 删除旧任务的eni

codepipeline触发滚动更新

创建一个新的pipeline,source为s3桶,跳过build节点,deploy阶段选择deploy(指定集群名称和服务名称)

创建imagedefinitions.json并打包成zip文件,上传到source的s3中。其中

  • name为container的名称
  • imageuri为image的uri
[
  {
    "name": "nginx",
    "imageUri": "xxxxxxxx.dkr.ecr.cn-north-1.amazonaws.com.cn/nginx:1-alpine-perl"
  }
]

最终的pipeline如下

$ aws codepipeline get-pipeline --name test-ecs-inplace
{
    "pipeline": {
        "name": "test-ecs-inplace",
        "roleArn": "arn:aws-cn:iam::xxxxxxxx:role/service-role/AWSCodePipelineServiceRole-cn-north-1-test-ecs-inplace",
        "artifactStore": {
            "type": "S3",
            "location": "codepipeline-cn-north-1-482183469511"
        },
        "stages": [
            {
                "name": "Source",
                "actions": [
                    {
                        "name": "Source",
                        "actionTypeId": {
                            "category": "Source",
                            "owner": "AWS",
                            "provider": "S3",
                            "version": "1"
                        },
                        "runOrder": 1,
                        "configuration": {
                            "S3Bucket": "xxxxxxx",
                            "PollForSourceChanges": "false",
                            "S3ObjectKey": "imagedefinitions.zip"
                        },
                        "outputArtifacts": [
                            {
                                "name": "SourceArtifact"
                            }
                        ],
                        "inputArtifacts": [],
                        "region": "cn-north-1",
                        "namespace": "SourceVariables"
                    }
                ]
            },
            {
                "name": "Deploy",
                "actions": [
                    {
                        "name": "Deploy",
                        "actionTypeId": {
                            "category": "Deploy",
                            "owner": "AWS",
                            "provider": "ECS",
                            "version": "1"
                        },
                        "runOrder": 1,
                        "configuration": {
                            "ServiceName": "testtemp",
                            "ClusterName": "workfargate"
                        },
                        "outputArtifacts": [],
                        "inputArtifacts": [
                            {
                                "name": "SourceArtifact"
                            }
                        ],
                        "region": "cn-north-1",
                        "namespace": "DeployVariables"
                    }
                ]
            }
        ],
        "version": 1
    }
}

手动触发部署并查看ecs服务的事件

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8Er0TEj8-1678203537506)(assets/image-20230307225547912.png)]

查看相关的api调用,可以看到实际上和手动更新逻辑一致

8.DeleteNetworkInterface	March 07, 2023, 22:51:54 (UTC+08:00)	ecs-eni-provisioning	ec2.amazonaws.com
7.DeregisterTargets	March 07, 2023, 22:51:06 (UTC+08:00)	ecs-service-scheduler	elasticloadbalancing.amazonaws.com
6.RegisterTargets	March 07, 2023, 22:50:09 (UTC+08:00)	ecs-service-scheduler	elasticloadbalancing.amazonaws.com
5.CreateLogStream	March 07, 2023, 22:50:00 (UTC+08:00)	f7daf86824ff40d5a30722c5eec741e5	logs.amazonaws.com
4.CreateTags	March 07, 2023, 22:49:53 (UTC+08:00)	ecs-eni-provisioning	ec2.amazonaws.com
3.RegisterTaskDefinition	March 07, 2023, 22:49:14 (UTC+08:00)	1678200554416	ecs.amazonaws.com
2.UpdateService	March 07, 2023, 22:49:14 (UTC+08:00)	1678200554416	ecs.amazonaws.com
1.CreatePipeline	March 07, 2023, 22:49:12 (UTC+08:00)	zhaojie	codepipeline.amazonaws.com

codedeploy蓝绿部署

注意:本次测试的蓝绿部署没有设置测试监听器

ec2平台的蓝绿部署

https://docs.amazonaws.cn/zh_cn/AmazonECS/latest/developerguide/deployment-type-bluegreen.htm

ecs蓝绿部署的创建方式有两种

  • ecs控制台创建服务时,集成codedeploy,自动创建application(AppECS前缀)和部署组(DgpECS前缀)

  • 手动创建codedeploy应用,并创建一个以ecs服务为目标的部署组

指定部署策略为CodeDeployDefault.ECSAllAtOnce(加快部署进度),并手动填写Revision如下

version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: arn:aws-cn:ecs:cn-north-1:xxxxxxxx:task-definition/hello-server-prod:9
        LoadBalancerInfo:
          ContainerName: "hello-server"
          ContainerPort: 80

为了测试方便将部署策略和部署时间修改短一点

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-c7fvG9oh-1678203537506)(assets/image-20230307232607285.png)]

手动触发ecs服务的蓝绿部署

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ccVuSHGH-1678203537507)(assets/image-20230307232045708.png)]

和滚动更新不同的是,ecs会创建taskset来进行流量的切换,其中部署id就是codedeploy的id号

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-k5GKPUky-1678203537508)(assets/image-20230307231142561.png)]

查看ecs控制台事件记录,taskset同样是独立的任务控制单元

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rzy1RFps-1678203537508)(assets/image-20230307231835359.png)]

观察trail的相关事件,其中taskset的id为deployment的id号码d092O68729

7.DeregisterTargets	March 07, 2023, 23:17:26 (UTC+08:00)	ecs-service-scheduler	elasticloadbalancing.amazonaws.com
6.DeleteTaskSet	March 07, 2023, 23:16:43 (UTC+08:00)	d092O68729	ecs.amazonaws.com
5.ModifyListener	March 07, 2023, 23:11:41 (UTC+08:00)	d092O68729	elasticloadbalancing.amazonaws.com
4.UpdateServicePrimaryTaskSet	March 07, 2023, 23:11:41 (UTC+08:00)	d092O68729	ecs.amazonaws.com
3.RegisterTargets	March 07, 2023, 23:10:15 (UTC+08:00)	ecs-service-scheduler	elasticloadbalancing.amazonaws.com
2.CreateTaskSet	March 07, 2023, 23:09:36 (UTC+08:00)	d092O68729	ecs.amazonaws.com
1.CreateDeployment	March 07, 2023, 23:09:33 (UTC+08:00)	zhaojie	codedeploy.amazonaws.com

fargate平台的蓝绿部署

同样按照ec2的逻辑,在ecs服务创建时选择蓝绿部署集成,自动创建codedeploy项目

对部署组使用以下修订创建部署

version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: arn:aws-cn:ecs:cn-north-1:xxxxxxxx:task-definition/nginx-fargate:3
        LoadBalancerInfo:
          ContainerName: "nginx"
          ContainerPort: 80

部署id为d-WPJC7NDCL,查看ecs事件

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5EUbn1Up-1678203537510)(assets/image-20230307233559021.png)]

查看cloudtrail记录

9.DeleteNetworkInterface	March 07, 2023, 23:35:12 (UTC+08:00)	ecs-eni-provisioning	ec2.amazonaws.com
8.DeregisterTargets	March 07, 2023, 23:34:24 (UTC+08:00)	ecs-service-scheduler	elasticloadbalancing.amazonaws.com
7.ModifyListener	March 07, 2023, 23:28:36 (UTC+08:00)	dWPJC7NDCL	elasticloadbalancing.amazonaws.com
6.UpdateServicePrimaryTaskSet	March 07, 2023, 23:28:36 (UTC+08:00)	dWPJC7NDCL	ecs.amazonaws.com
5.RegisterTargets	March 07, 2023, 23:27:03 (UTC+08:00)	ecs-service-scheduler	elasticloadbalancing.amazonaws.com
4.CreateLogStream	March 07, 2023, 23:26:54 (UTC+08:00)	536c5073054e48dc8249551a8df999f0	logs.amazonaws.com
3.CreateTags	March 07, 2023, 23:26:47 (UTC+08:00)	ecs-eni-provisioning	ec2.amazonaws.com
2.CreateTaskSet	March 07, 2023, 23:26:31 (UTC+08:00)	dWPJC7NDCL	ecs.amazonaws.com
1.CreateDeployment	March 07, 2023, 23:26:28 (UTC+08:00)	zhaojie	codedeploy.amazonaws.com

外部部署

用户使用aws的第三方部署控制器,以完全控制 ecs服务的部署过程

服务的详细信息由服务管理 API 操作(CreateServiceUpdateServiceDeleteService)或任务集管理 API 操作(CreateTaskSetUpdateTaskSetUpdateServicePrimaryTaskSetDeleteTaskSet)管理。

  • UpdateService API 操作更新服务的预期数量和运行状况检查宽限期参数。如果需要更新启动类型、平台版本、负载均衡器详细信息、网络配置或任务定义,必须创建一个新任务集。
  • UpdateTaskSet API 操作仅更新任务集的扩展参数。
  • UpdateServicePrimaryTaskSet API 操作修改服务中的哪个任务集是主要任务集。当您调用 DescribeServices API 操作时,它将返回为主要任务集指定的所有字段

题外话

本次的所有测试ecs服务都没有使用cloudmap服务发现集成,如果使用了cloudmap可能会多出以下两个api调用

RegisterInstance
DeregisterInstance
posted @ 2023-03-07 23:41  zhaojie10  阅读(23)  评论(0)    收藏  举报  来源