aws ecs服务更新和任务重启的发生场景/事件/API调用记录
参考资料
创建ecs服务的相关api调用
注意:以下所有的api调用都按照事件顺序从上到下排列
手动创建服务的trail调用记录如下
-
创建服务和网卡的tag
注意:fargate创建网卡的
-
description为
arn:aws-cn:ecs:cn-north-1:xxxxxxx:attachment/64195c73-bba9-4ba3-8ef1-f97bf7c251f6
-
tag上带有集群和服务的名称
奇怪的点在于,没有创建eni网卡的api调用,即
CreateNetworkINterface
1. CreateTags,ecs-eni-provisioning,ec2.amazonaws.com - - 2. CreateService,username,ecs.amazonaws.com
-
-
创建application autoscaling跟踪
1. UpdateService,username,ecs.amazonaws.com 2. PutMetricAlarm,username,monitoring.amazonaws.com 3. PutMetricAlarm,username,monitoring.amazonaws.com 4. PutScalingPolicy,username,autoscaling.amazonaws.com 5. RegisterScalableTarget,username,autoscaling.amazonaws.com
-
将任务注册到elb
1. RegisterTargets,ecs-service-scheduler,elasticloadbalancing.amazonaws.com
-
创建log组
1. CreateLogStream,ECS_STS_Session,logs.amazonaws.com
将service的任务数量从0增加到1的api调用
1. RegisterTargets,ecs-service-scheduler,elasticloadbalancing.amazonaws.com
2. CreateLogStream,61c787040c6140be8f4c0c70c0068d6d,logs.amazonaws.com
3. CreateTags,ecs-eni-provisioning,ec2.amazonaws.com - -
4. UpdateInstanceInformation,i-072d61382ddbe7ba3,ssm.amazonaws.com
5. UpdateService,username,ecs.amazonaws.com
部署控制器的类型有三种,对应两种更新类型
- ECS,对应滚动更新,可以手动触发或通过codepipeline触发
- CODE_DEPLOY,对应蓝绿部署
- EXTERNAL,对应第三方部署控制器的部署逻辑
ELB健康检查失败
https://aws.amazon.com/cn/premiumsupport/knowledge-center/ecs-fargate-health-check-failures/
在ecs服务的事件界面查看到如下日志,表明由于任务健康检查失败导致了任务的重启
(service AWS-service) (port 8080) is unhealthy in (target-grouparn:uxyztargetgroup/aws-targetgroup/123456789) due to (reason Health checks failed with these codes: [502]) or [request timeout]
(service AWS-Service) (port 8080) is unhealthy in target-group tf-20190411170 due to (reason Health checks failed)
导致任务健康检查失败的原因可能有
- elb和ecs任务事件的网络配置,子网和安全组流量放行
- 健康检查的端口和路径是否配置正确
- elb健康检查宽限期,在任务能够响应之前等待的时间
- 服务的cpu和内存使用量过高
- 应用程序错误,需要查看具体的错误日志
- 任务应用程序的依赖正常运行(例如,后端数据库服务)
fargate底层维护
https://docs.amazonaws.cn/en_us/AmazonECS/latest/userguide/task-maintenance.html
对于基于fatagte平台启动的ecs服务,可能会由于底层的维护导致任务的替换
在ecs服务事件中能看到如下日志
stoppedReason:ECS is performing maintenance on the underlying infrastructure hosting the task
fargate的维护通常是由于主机问题和安全漏洞修复,对于独立任务和服务任务有所区别
- 独立任务,主机问题和漏洞修复会发送邮件通知
- 服务任务,主机问题不会发送邮件
Service Autoscaling更新
创建autoscaling将任务的cpu平均使用率锁定在某个值,为了测试方便指定一个较低的值20%
cw控制自动创建新告警
访问施加压力观察扩缩行为
$ webbench -c 10 -t 3600 http://main-alb-1897344746.cn-north-1.elb.amazonaws.com.cn:8097/
扩容行为
查看ecs控制台服务事件
记录的trail调用如下
4.RegisterTargets March 07, 2023, 22:16:03 (UTC+08:00) ecs-service-scheduler elasticloadbalancing.amazonaws.com
3.CreateLogStream March 07, 2023, 22:15:47 (UTC+08:00) 500589909f0a49119ff2ea8cf4e6aa9c logs.amazonaws.com
2.CreateTags March 07, 2023, 22:15:38 (UTC+08:00) ecs-eni-provisioning ec2.amazonaws.com
1.UpdateService March 07, 2023, 22:15:18 (UTC+08:00) AutoScaling-UpdateDesiredCapacity ecs.amazonaws.com
缩容行为
查看ecs控制台服务事件
记录的trail调用如下
1. UpdateService,AutoScaling-UpdateDesiredCapacity,ecs.amazonaws.com
2. DeregisterTargets,ecs-service-scheduler,elasticloadbalancing.amazonaws.com
3. DeleteNetworkInterface,ecs-eni-provisioning,ec2.amazonaws.com
滚动更新
手动触发滚动更新
在ecs控制台上的事件如下
对于fargate平台的服务更新,在trail记录中存在如下调用
1. UpdateService,username,ecs.amazonaws.com
2. CreateTags,ecs-eni-provisioning,ec2.amazonaws.com
3. CreateLogStream,53b51680f06640bd9e63a4272607d60b,logs.amazonaws.com
4. RegisterTargets,ecs-service-scheduler,elasticloadbalancing.amazonaws.com
5. DeregisterTargets,ecs-service-scheduler,elasticloadbalancing.amazonaws.com
6. DeleteNetworkInterface,ecs-eni-provisioning,ec2.amazonaws.com
由于我的fargate任务实际上绑定了elb,因此会出现额外的事件
- 更新ecs服务
- 创建网卡的tag,这里奇怪的一点在于没有创建网卡的事件
- 创建新的日志流,日志流对应了任务id
- 注册新目标
- 取消注册目标
- 删除旧任务的eni
codepipeline触发滚动更新
创建一个新的pipeline,source为s3桶,跳过build节点,deploy阶段选择deploy(指定集群名称和服务名称)
创建imagedefinitions.json
并打包成zip文件,上传到source的s3中。其中
- name为container的名称
- imageuri为image的uri
[
{
"name": "nginx",
"imageUri": "xxxxxxxx.dkr.ecr.cn-north-1.amazonaws.com.cn/nginx:1-alpine-perl"
}
]
最终的pipeline如下
$ aws codepipeline get-pipeline --name test-ecs-inplace
{
"pipeline": {
"name": "test-ecs-inplace",
"roleArn": "arn:aws-cn:iam::xxxxxxxx:role/service-role/AWSCodePipelineServiceRole-cn-north-1-test-ecs-inplace",
"artifactStore": {
"type": "S3",
"location": "codepipeline-cn-north-1-482183469511"
},
"stages": [
{
"name": "Source",
"actions": [
{
"name": "Source",
"actionTypeId": {
"category": "Source",
"owner": "AWS",
"provider": "S3",
"version": "1"
},
"runOrder": 1,
"configuration": {
"S3Bucket": "xxxxxxx",
"PollForSourceChanges": "false",
"S3ObjectKey": "imagedefinitions.zip"
},
"outputArtifacts": [
{
"name": "SourceArtifact"
}
],
"inputArtifacts": [],
"region": "cn-north-1",
"namespace": "SourceVariables"
}
]
},
{
"name": "Deploy",
"actions": [
{
"name": "Deploy",
"actionTypeId": {
"category": "Deploy",
"owner": "AWS",
"provider": "ECS",
"version": "1"
},
"runOrder": 1,
"configuration": {
"ServiceName": "testtemp",
"ClusterName": "workfargate"
},
"outputArtifacts": [],
"inputArtifacts": [
{
"name": "SourceArtifact"
}
],
"region": "cn-north-1",
"namespace": "DeployVariables"
}
]
}
],
"version": 1
}
}
手动触发部署并查看ecs服务的事件
查看相关的api调用,可以看到实际上和手动更新逻辑一致
8.DeleteNetworkInterface March 07, 2023, 22:51:54 (UTC+08:00) ecs-eni-provisioning ec2.amazonaws.com
7.DeregisterTargets March 07, 2023, 22:51:06 (UTC+08:00) ecs-service-scheduler elasticloadbalancing.amazonaws.com
6.RegisterTargets March 07, 2023, 22:50:09 (UTC+08:00) ecs-service-scheduler elasticloadbalancing.amazonaws.com
5.CreateLogStream March 07, 2023, 22:50:00 (UTC+08:00) f7daf86824ff40d5a30722c5eec741e5 logs.amazonaws.com
4.CreateTags March 07, 2023, 22:49:53 (UTC+08:00) ecs-eni-provisioning ec2.amazonaws.com
3.RegisterTaskDefinition March 07, 2023, 22:49:14 (UTC+08:00) 1678200554416 ecs.amazonaws.com
2.UpdateService March 07, 2023, 22:49:14 (UTC+08:00) 1678200554416 ecs.amazonaws.com
1.CreatePipeline March 07, 2023, 22:49:12 (UTC+08:00) zhaojie codepipeline.amazonaws.com
codedeploy蓝绿部署
注意:本次测试的蓝绿部署没有设置测试监听器
ec2平台的蓝绿部署
https://docs.amazonaws.cn/zh_cn/AmazonECS/latest/developerguide/deployment-type-bluegreen.htm
ecs蓝绿部署的创建方式有两种
-
ecs控制台创建服务时,集成codedeploy,自动创建application(AppECS前缀)和部署组(DgpECS前缀)
-
手动创建codedeploy应用,并创建一个以ecs服务为目标的部署组
指定部署策略为CodeDeployDefault.ECSAllAtOnce(加快部署进度),并手动填写Revision如下
version: 0.0
Resources:
- TargetService:
Type: AWS::ECS::Service
Properties:
TaskDefinition: arn:aws-cn:ecs:cn-north-1:xxxxxxxx:task-definition/hello-server-prod:9
LoadBalancerInfo:
ContainerName: "hello-server"
ContainerPort: 80
为了测试方便将部署策略和部署时间修改短一点
手动触发ecs服务的蓝绿部署
和滚动更新不同的是,ecs会创建taskset来进行流量的切换,其中部署id就是codedeploy的id号
查看ecs控制台事件记录,taskset同样是独立的任务控制单元
观察trail的相关事件,其中taskset的id为deployment的id号码d092O68729
7.DeregisterTargets March 07, 2023, 23:17:26 (UTC+08:00) ecs-service-scheduler elasticloadbalancing.amazonaws.com
6.DeleteTaskSet March 07, 2023, 23:16:43 (UTC+08:00) d092O68729 ecs.amazonaws.com
5.ModifyListener March 07, 2023, 23:11:41 (UTC+08:00) d092O68729 elasticloadbalancing.amazonaws.com
4.UpdateServicePrimaryTaskSet March 07, 2023, 23:11:41 (UTC+08:00) d092O68729 ecs.amazonaws.com
3.RegisterTargets March 07, 2023, 23:10:15 (UTC+08:00) ecs-service-scheduler elasticloadbalancing.amazonaws.com
2.CreateTaskSet March 07, 2023, 23:09:36 (UTC+08:00) d092O68729 ecs.amazonaws.com
1.CreateDeployment March 07, 2023, 23:09:33 (UTC+08:00) zhaojie codedeploy.amazonaws.com
fargate平台的蓝绿部署
同样按照ec2的逻辑,在ecs服务创建时选择蓝绿部署集成,自动创建codedeploy项目
对部署组使用以下修订创建部署
version: 0.0
Resources:
- TargetService:
Type: AWS::ECS::Service
Properties:
TaskDefinition: arn:aws-cn:ecs:cn-north-1:xxxxxxxx:task-definition/nginx-fargate:3
LoadBalancerInfo:
ContainerName: "nginx"
ContainerPort: 80
部署id为d-WPJC7NDCL
,查看ecs事件
查看cloudtrail记录
9.DeleteNetworkInterface March 07, 2023, 23:35:12 (UTC+08:00) ecs-eni-provisioning ec2.amazonaws.com
8.DeregisterTargets March 07, 2023, 23:34:24 (UTC+08:00) ecs-service-scheduler elasticloadbalancing.amazonaws.com
7.ModifyListener March 07, 2023, 23:28:36 (UTC+08:00) dWPJC7NDCL elasticloadbalancing.amazonaws.com
6.UpdateServicePrimaryTaskSet March 07, 2023, 23:28:36 (UTC+08:00) dWPJC7NDCL ecs.amazonaws.com
5.RegisterTargets March 07, 2023, 23:27:03 (UTC+08:00) ecs-service-scheduler elasticloadbalancing.amazonaws.com
4.CreateLogStream March 07, 2023, 23:26:54 (UTC+08:00) 536c5073054e48dc8249551a8df999f0 logs.amazonaws.com
3.CreateTags March 07, 2023, 23:26:47 (UTC+08:00) ecs-eni-provisioning ec2.amazonaws.com
2.CreateTaskSet March 07, 2023, 23:26:31 (UTC+08:00) dWPJC7NDCL ecs.amazonaws.com
1.CreateDeployment March 07, 2023, 23:26:28 (UTC+08:00) zhaojie codedeploy.amazonaws.com
外部部署
用户使用aws的第三方部署控制器,以完全控制 ecs服务的部署过程
服务的详细信息由服务管理 API 操作(CreateService
、UpdateService
和 DeleteService
)或任务集管理 API 操作(CreateTaskSet
、UpdateTaskSet
、UpdateServicePrimaryTaskSet
和 DeleteTaskSet
)管理。
UpdateService
API 操作更新服务的预期数量和运行状况检查宽限期参数。如果需要更新启动类型、平台版本、负载均衡器详细信息、网络配置或任务定义,必须创建一个新任务集。UpdateTaskSet
API 操作仅更新任务集的扩展参数。UpdateServicePrimaryTaskSet
API 操作修改服务中的哪个任务集是主要任务集。当您调用DescribeServices
API 操作时,它将返回为主要任务集指定的所有字段
题外话
本次的所有测试ecs服务都没有使用cloudmap服务发现集成,如果使用了cloudmap可能会多出以下两个api调用
RegisterInstance
DeregisterInstance