argo rollouts
1、简介
基于crd开发的对K8s部署功能的一种补充
1、支持传统架构的蓝绿,金丝雀等部署模式
2、具备金丝雀分析,渐进式交付等功能
3、可以被istio和ingress集成完成复杂的流量管理功能
4、细粒度、加权的流量转移
5、自动回滚和升级
6、人工判断
7、可定制的度量查询和业务 KPI 分析
2、基础概念
2.1 rollout
一个crd,类似于deployment,属于上位替代,实现更为复杂的功能
2.2 Progressive Delivery
渐进式交付,是cicd系统的功能升级,以可控和渐进的方式发布产品更新的过程,从而降低发布风险,通常结合自动化和度量分析来驱动更新的自动升级或回滚,比如通过暴露prom指标来进行观察以决定是否继续升级或回滚,来实现高度自动化
2.3 Deployment Strategies
部署策略
1、Rolling Update
用旧版本代替新版本,也就是我们常说的滚动升级,是deployment的默认策略
2、Recreate
部署之前删除旧版本
3、蓝绿
同时部署两套环境
4、Canary 金丝雀
将固定副本的应用以比例部署新旧版本,新版本被称为金丝雀,可以通过istio等进行流量比例切分实现同时具有两个版本和可控流量,再通过渐进式交付逐渐将版本更新到新版本
3、架构
![]()
3.1 rollout controller
crd控制器,实现crd定义的一系列逻辑
3.2 rollout
crd 通deployment可以实现更复杂的部署功能
3.3 ingress
可以集成istio等服务网格,来对服务的流量进行精细化的控制
3.4 AnalysisTemplate 和 AnalysisRun
Analysis
将rollout连接到prom等指标度量工具,并未这些指标提供一个阈值,可以根据此来判断更新是否成功,以此来决定是否将继续部署或将其回滚
AnalysisTemplate和ClusterAnalysisTemplate
一个是针对某一资源的资源,一个是集群级别的资源,所有集群下的资源都可以使用
主要包含了查询指标的指令,比如promql
3.5 Metric providers
指标度量提供者,一般使用prom
4、安装
5、使用
5.1 基本功能介绍
5.1.1 部署rollout
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: rollouts-demo
spec:
replicas: 5 # 指定副本数
strategy: # 指定发布策略
canary: # 指定为金丝雀模式
steps: # 指定金丝雀步骤
- setWeight: 20 # 指定金丝雀版本流量比例
- pause: {} # 可以理解为暂停发布,进行观察的时间,若为空则认为永久暂停,若初次进行部署,默认使用100%的流量
- setWeight: 40
- pause: {duration: 10}
- setWeight: 60
- pause: {duration: 10}
- setWeight: 80
- pause: {duration: 10}
revisionHistoryLimit: 2 # 保留几个版本配置,常用来作为回滚
selector: # 和deploy类似的标签选择机制
matchLabels:
app: rollouts-demo
template:
metadata:
labels:
app: rollouts-demo
spec:
containers:
- name: rollouts-demo
image: argoproj/rollouts-demo:blue
ports:
- name: http
containerPort: 8080
protocol: TCP
resources:
requests:
memory: 32Mi
cpu: 5m
kubectl argo rollouts get rollout rollouts-demo --watch # 观察rollout的发布过程
5.1.2 更新
kubectl argo rollouts set image rollouts-demo rollouts-demo=argoproj/rollouts-demo:yellow
# 将按照我们5.1定义的策略进行滚动更新
# 上面的更新例子我们可以看到定义的策略为20%的金丝雀流量完成后会暂停发布,这个命令相当于手动确认后面的发布过程
kubectl argo rollouts promote rollouts-demo
5.1.4 abort
# 和5.3类似,将放弃此次更新,回滚到stable(稳定版本) 也就是旧版本
kubectl argo rollouts abort rollouts-demo
5.2 集成istio
5.2.1 配置rollout
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: rollouts-demo
spec:
strategy: # 指定发布模式
canary: # 金丝雀设置
canaryService: rollouts-demo-canary # 金丝雀流量使用的svc
stableService: rollouts-demo-stable # 旧版本流量使用的svc
trafficRouting: # 流量路由设置
istio: # 使用istio
virtualServices: # 指定vs
- name: rollouts-demo-vsvc1
routes: # http流量配置
- http-primary
tlsRoutes: # https/tls 配置
- port: 443 # 端口
sniHosts: # 允许匹配哪些域名访问
- reviews.bookinfo.com
- localhost
# 同上面配置
- name: rollouts-demo-vsvc2
routes:
- http-secondary
tlsRoutes:
- port: 443
sniHosts:
- reviews.bookinfo.com
- localhost
tcpRoutes:
- port: 8020
5.2.2 生成vs
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: rollouts-demo-vsvc1 # 我们配置的vs之一
spec:
gateways:
- rollouts-demo-gateway # 指定vs使用的gw
hosts:
- rollouts-demo-vsvc1.local # 允许哪些域名访问
http:
- name: http-primary # 定义后端svc
route: # 路由规则(一个列表)
- destination:
host: rollouts-demo-stable # svc名称
port:
number: 15372
weight: 100 # 转发到此svc的流量比例
- destination:
host: rollouts-demo-canary # svc名称
port:
number: 15372
weight: 0 # 转发到此svc的流量比例
tls: # tls配置
- match:
- port: 443
sniHosts:
- reviews.bookinfo.com
- localhost
route:
- destination:
host: rollouts-demo-stable
weight: 100
- destination:
host: rollouts-demo-canary
weight: 0
tcp:
- match:
- port: 8020
route:
- destination:
host: rollouts-demo-stable
weight: 100
- destination:
host: rollouts-demo-canary
weight: 0
5.3 测试
可以看到初始流量只有老版本有流量,当我们修改rollout时 istio的配置也会随之更改,以适应流量规则
6、资源详解
6.1 rollout
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: example-rollout-canary
spec:
# Number of desired pods.
# Defaults to 1.
replicas: 5
analysis:
# 成功分析保存的数量,默认为5
successfulRunHistoryLimit: 10
# 没有成功分析保存的数量,默认为5
unsuccessfulRunHistoryLimit: 10
# 和下面的selector二选一,可以选择直接引用deploy
workloadRef:
apiVersion: apps/v1
kind: Deployment
name: rollout-ref-deployment
# "never": deploy不进行缩减
# "onsuccess": rollout正常后deploy将被缩减
# "progressively": 动态进行缩减
scaleDown: never|onsuccess|progressively
# 和工作负载一样,主要是选择一个pod,和其二选一
# 匹配pod模板
selector:
matchLabels:
app: guestbook
template:
spec:
containers:
- name: guestbook
image: argoproj/rollouts-demo:blue
# 新pod running后多久开始视为正常接受流量
minReadySeconds: 30
# 要保留的旧的rs副本数,默认为10
revisionHistoryLimit: 3
# 是否允许用户使用命令行随时手动暂停.
paused: true
# 更新程序的最长时间,默认600s超时则返回失败
progressDeadlineSeconds: 600
# 超过deadline的时间是否停止更新默认为否
progressDeadlineAbort: false
# 定义一个时间戳 按顺序重启所有pod 并且控制器确保创建时间都大于此值
restartAt: "2020-03-30T21:19:35Z"
# 定义回滚版本,默认上一个
rollbackWindow:
revisions: 3
# 发布策略
strategy:
# 蓝绿部署
blueGreen:
# Reference to service that the rollout modifies as the active service.
# Required.
activeService: active-service
# Pre-promotion analysis run which performs analysis before the service
# cutover. +optional
prePromotionAnalysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: guestbook-svc.default.svc.cluster.local
# Post-promotion analysis run which performs analysis after the service
# cutover. +optional
postPromotionAnalysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: guestbook-svc.default.svc.cluster.local
# Name of the service that the rollout modifies as the preview service.
# +optional
previewService: preview-service
# The number of replicas to run under the preview service before the
# switchover. Once the rollout is resumed the new ReplicaSet will be fully
# scaled up before the switch occurs +optional
previewReplicaCount: 1
# Indicates if the rollout should automatically promote the new ReplicaSet
# to the active service or enter a paused state. If not specified, the
# default value is true. +optional
autoPromotionEnabled: false
# Automatically promotes the current ReplicaSet to active after the
# specified pause delay in seconds after the ReplicaSet becomes ready.
# If omitted, the Rollout enters and remains in a paused state until
# manually resumed by resetting spec.Paused to false. +optional
autoPromotionSeconds: 30
# Adds a delay before scaling down the previous ReplicaSet. If omitted,
# the Rollout waits 30 seconds before scaling down the previous ReplicaSet.
# A minimum of 30 seconds is recommended to ensure IP table propagation
# across the nodes in a cluster.
scaleDownDelaySeconds: 30
# Limits the number of old RS that can run at once before getting scaled
# down. Defaults to nil
scaleDownDelayRevisionLimit: 2
# Add a delay in second before scaling down the preview replicaset
# if update is aborted. 0 means not to scale down. Default is 30 second
abortScaleDownDelaySeconds: 30
# Anti Affinity configuration between desired and previous ReplicaSet.
# Only one must be specified
antiAffinity:
requiredDuringSchedulingIgnoredDuringExecution: {}
preferredDuringSchedulingIgnoredDuringExecution:
weight: 1 # Between 1 - 100
# activeMetadata will be merged and updated in-place into the ReplicaSet's spec.template.metadata
# of the active pods. +optional
activeMetadata:
labels:
role: active
# Metadata which will be attached to the preview pods only during their preview phase.
# +optional
previewMetadata:
labels:
role: preview
# 金丝雀部署
canary:
# 金丝雀服务所使用的svc
canaryService: canary-service
# 老版本使用的svc
stableService: stable-service
# 附加到金丝雀pod的元数据,类似webhook注入
canaryMetadata:
annotations:
role: canary
labels:
role: canary
# 附加到老版本pod的元数据,类似webhook注入
stableMetadata:
annotations:
role: stable
labels:
role: stable
# 更新期间不可用的pod最大数量,若不满足条件会暂停
maxUnavailable: 1
# T每次更新的百分比,向下取整
maxSurge: "20%"
# 缩减旧的版本rs的延迟时间
scaleDownDelaySeconds: 30
# 可选:路由到金丝雀的流量的rs最小pod数,主要是保障高可用,默认为1
minPodsPerReplicaSet: 2
# 旧的rs每次可缩减的pod数量
scaleDownDelayRevisionLimit: 2
# 分析所用的模板名称和应用名称,相当于使用prom SQL对服务进行查询
analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: guestbook-svc.default.svc.cluster.local
# 提供旧版本或者最后一个版本的podTemplateHashValue
- name: stable-hash
valueFrom:
podTemplateHashValue: Stable
- name: latest-hash
valueFrom:
podTemplateHashValue: Latest
# 允许引用元数据label给分析
- name: region
valueFrom:
fieldRef:
fieldPath: metadata.labels['region']
# 定义了金丝雀更新时的步骤列表
steps:
# 设置金丝雀比例为20%
- setWeight: 20
# 暂停更新,支持单位: s, m, h
- pause:
duration: 1h
# 永久暂停直到手动进行更新
- pause: {}
# 将金丝雀比例设置为显示技术而非按流量权重,也就是说值负责扩容pod而流量另算
- setCanaryScale:
replicas: 3
# 通上,将数量改为比例,只负责更新流量比例另算
- setCanaryScale:
weight: 25
# 设置金丝雀的比例匹配金丝雀的权重,默认选项
- setCanaryScale:
matchTrafficWeight: true
# 以下配置仅支持istio
- setHeaderRoute:
# 将被创建在spec.strategy.canary.trafficRouting.managedRoutes
name: "header-route-1"
# 标头的匹配路由规则
match:
# 标头名称
- headerName: "version"
# 必须包含一个精确、正则表达式或前缀字段
headerValue:
# 完全匹配
exact: "2"
# 正则表达式
regex: "2.0.(.*)"
# 前缀
prefix: "2.0"
# 以下配置仅支持istio
- setMirrorRoute:
# 将被创建到spec.strategy.canary.trafficRouting.managedRoutes
name: "header-route-1"
# 路由到金丝雀的百分比
percentage: 100
# 匹配规则,若无将会删除该路由,单个块内的所有规则逻辑或,多个规则块逻辑与
match:
- method: # 基于请求方法,完全匹配,正则,路径
exact: "GET"
regex: "P.*"
prefix: "POST"
path: # 基于请求路径,分别为完全匹配,正则,路径
exact: "/test"
regex: "/test/.*"
prefix: "/"
headers: # 基于请求头,分别为完全匹配,正则,路径
agent-1b:
exact: "firefox"
regex: "firefox2(.*)"
prefix: "firefox"
# 一个分析步骤
- analysis:
templates:
- templateName: success-rate
# 一个实验步骤
- experiment:
duration: 1h
templates:
- name: baseline
specRef: stable
# optional, creates a service for the experiment if set
service:
# optional, service: {} is also acceptable if name is not included
name: test-service
- name: canary
specRef: canary
# optional, set the weight of traffic routed to this version
weight: 10
analyses:
- name : mann-whitney
templateName: mann-whitney
# Metadata which will be attached to the AnalysisRun.
analysisRunMetadata:
labels:
app.service.io/analysisType: smoke-test
annotations:
link.argocd.argoproj.io/external-link: http://my-loggin-platform.com/pre-generated-link
# 反亲和配置,二选一
antiAffinity:
requiredDuringSchedulingIgnoredDuringExecution: {}
preferredDuringSchedulingIgnoredDuringExecution:
weight: 1 # Between 1 - 100
# 流量路由指定ingress或者istio
trafficRouting:
# 此选项仅支持nginx,控制流量权重
maxTrafficWeight: 1000
# 定义了路由列表的创建顺序,按照先后具有优先级
managedRoutes:
- name: set-header
- name: mirror-route
# istio流量设置
istio:
# 配置vs
virtualService:
name: rollout-vsvc # 必须给定vs名称
routes:
- primary # vs中单个路由为可选,多个为必须,需要定义主要和次要
virtualServices:
# 配置多个虚拟服务
- name: rollouts-vsvc1 # 名称
routes:
- primary
- name: rollouts-vsvc2
routes:
- secondary
# NGINX Ingress Controller routing configuration
nginx:
# Either stableIngress or stableIngresses must be configured, but not both.
stableIngress: primary-ingress
stableIngresses:
- primary-ingress
- secondary-ingress
- tertiary-ingress
annotationPrefix: customingress.nginx.ingress.kubernetes.io # optional
additionalIngressAnnotations: # optional
canary-by-header: X-Canary
canary-by-header-value: iwantsit
# ALB Ingress Controller routing configuration
alb:
ingress: ingress # required
servicePort: 443 # required
annotationPrefix: custom.alb.ingress.kubernetes.io # optional
# Service Mesh Interface routing configuration
smi:
rootService: root-svc # optional
trafficSplitName: rollout-example-traffic-split # optional
# Add a delay in second before scaling down the canary pods when update
# is aborted for canary strategy with traffic routing (not applicable for basic canary).
# 0 means canary pods are not scaled down. Default is 30 seconds.
abortScaleDownDelaySeconds: 30
status:
pauseConditions:
- reason: StepPause
startTime: 2019-10-00T1234
- reason: BlueGreenPause
startTime: 2019-10-00T1234
- reason: AnalysisRunInconclusive
startTime: 2019-10-00T1234
6.1.1 Canary 金丝雀
金丝雀值得时在更新我们的负载时,可以将一部分流量切换到新版本进行观察,待满足条件后再进行滚动更新指导最后完全转换为新版本,在rollout中当我们更改template字段时会触发我们定义的金丝雀规则
# 工作机制
setWeight 字段指定应该发送到金丝雀的流量百分比,
pause 结构指示 rollout 暂停。
当控制器到达 rollout 的 pause 步骤时,
它将向 .status.PauseConditions 字段添加一个 PauseCondition 结构。如果 pause 结构中的 duration 字段被设置,那么 rollout 将不会进展到下一个步骤,直到它等待 duration 字段的值。否则,rollout 将无限期等待,直到 Pause 条件被删除
6.1.1.1 基础模式
基础模式流量会随着副本比例数量增加
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: example-rollout
spec:
replicas: 10
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.15.4
ports:
- containerPort: 80
minReadySeconds: 30
revisionHistoryLimit: 3
strategy:
canary:
maxSurge: "25%"
maxUnavailable: 0
steps:
- setWeight: 10 # 指定金丝雀版本流量百分比
- pause: # 暂停
duration: 1h # 1小时 支持秒 默认单位(s) 分钟(m) 不适用于长期保持两个版本
- setWeight: 20
- pause: {} # 暂停直到手动执行更新
6.1.1.2 动态金丝雀扩展模式
spec:
strategy:
canary:
steps:
- setCanaryScale: # 控制规模和权重
replicas: 3 # 不改变流量权重的情况下显示计算副本
- setCanaryScale:
weight: 25 # 不改变流量权重的情况指定权重百分比
- setCanaryScale:
matchTrafficWeight: true # 是否开启匹配金丝雀的流量权重
- setWeight: 90 # 若不开启匹配权重,那么90%的流量将流向金丝雀版本 开启后,后续的setWeight将创建与流量权重匹配的金丝雀副本
6.1.1.3 动态稳定扩展模式
默认情况下当使用金丝雀部署时,旧的版本始终会报错100%的副本,优势在于发布失败后流量可以立即切回而没有启动延迟,此选项可以动态的进行扩容,保持始终只存在这么多pod而不需要承担额外的副本,比如多个应用扩容时导致节点无可用冗余pod数量可用
spec:
strategy:
canary:
dynamicStableScale: true # 开启此模式
abortScaleDownDelaySeconds: 600 # 指示金丝雀版本权重更改后旧的副本保存时间默认为秒
6.1.1.4 滚动更新模式
和deploy一样,指定每次更新的最大比例和最多有几个pod不可用
strategy:
canary:
maxSurge: "25%"
maxUnavailable: 0
6.1.1.5 analysis
# 可选项,配置分析手段对指标进行分析,若不满足条件则进行回滚
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: example-rollout-canary
spec:
replicas: 5 # 副本数,默认为1
analysis:
successfulRunHistoryLimit: 10 # 成功分析运行的存储数量
unsuccessfulRunHistoryLimit: 10 # 不成功的数量,和上面默认都是5
6.1.1.6 antiAffinity
# 反亲和配置
6.1.1.7 canaryService
# 引用一个服务该服务只向金丝雀版本发送流量
6.1.1.8 stableService
# 同上,只允许访问老版本的服务,相当于长期将服务进行隔离
6.1.1.9 maxSurge
# 每次滚动更新的比例默认25%
6.1.1.10 maxUnavailable
# 在更新期间可以不可用的pods的最大数量或百分比默认25%
6.1.1.11 trafficRouting
流量路由设置