k8s调度详解

主要讲的就是pod怎么被调度的

k8s调度

1、调度核心和流程

就是关于pod调度，决定了pod应该运行在哪一个节点上，调度过程是由kube-scheduler组件完成的
因此的话字段是和container同级的，是对pod的调度
调度流程
- 过滤掉不满足条件的节点
- 对通过的节点进行优先排序
- 从中选择优先级最高的节点

2、调度策略分类

因为我这个机器3个节点，只有一个node节点，将master02上的污点取消，充当一个节点

1、NodeName

根据节点名来进行调度
这个不是很常用的，因为调度肯定是越多的节点越好，而不是全部都在这个一个节点上面
如果这个资源对象是deploy的话，所有pod都在一个节点上面，不是最优解

[root@master01 test]# cat n3.yml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: n3
  name: n3
spec:
  nodeName: "master02"   # nodename 指定调度到的节点
  containers:
  - image: nginx
    name: n3
    resources: {}

2、NodeSelector

根据节点上面的标签来进行调度
已经调度在节点上面了，如果删除了这个标签，还是不变，但是如果重新apply的话，就会重新部署了
这个是常用的

[root@master01 test]# cat n3.yml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: n3
  name: n3
spec:
  nodeSelector:
    aa: "123"
    bb: "567"  # 节点需要同时满足这2个标签
  containers:
  - image: nginx
    name: n3

3、污点和容忍

污点设置在节点上面的，也是一个key-value的形式
- 污点有三个等级
  - PreferNoSchedule pod尽量不要调度在这个节点上面，还是可以调度在上面的
  - NoSchedule 新pod不要调度到这个节点，已经运行的不驱逐
  - NoExecute 新pod不要调度，已经存在的pod被驱逐
- 一个节点上有污点的话，默认情况下，pod不会调度在这上面
容忍设置在pod上
- 一个节点有多个污点，想要调度在上面，pod就需要容忍多个污点
- 如果此时还有一个没有污点的节点，2个节点都可以调度
- 如果还是想要调度在污点上面，就需要容忍污点和标签选择器来调度到上面
污点和容忍通常和标签选择器一起使用，实现专机专用，这个就是调度策略

# 打上污点  

kubectl taint node node qq=13:NoExecute

# 查看污点
kubectl describe nodes node1 | grep Taints


# 删除污点
kubectl taint nodes node1 key1:NoSchedule-  

# 或者只需要带上key - 
kubectl taint node node qq-

容忍测试

# 正常调度，不会调度到node节点上面

[root@master01 test]# kubectl taint nodes node qq=123:NoSchedule
node/node tainted

[root@master01 test]# cat n3.yml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: n3
  name: n3
spec:
  nodeSelector:
    aa: "123"  # 节点选择器，调度在node上面
  containers:
  - image: nginx
    name: n3

# 发现是pending 
[root@master01 test]# kubectl get pod 
NAME   READY   STATUS    RESTARTS   AGE
n3     0/1     Pending   0          52s

# 查看详细信息，发现是因为有污点，所以无法调度在上面

  Warning  FailedScheduling  6s    default-scheduler  0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 1 node(s) had taint {qq: 123}, that the pod didn't tolerate.

# 容忍这个节点

[root@master01 test]# cat n3.yml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: n3
  name: n3
spec:
  tolerations: # 容忍字段
  - key: "qq"  # 容忍 qq=123:NoSchedule 这个污点 以这个-开头的，可以容忍多个key
    operator: "Equal"
    value: "123"
    effect: NoSchedule
  nodeSelector:
    aa: "123"
  containers:
  - image: nginx
    name: n3

操作符详解 operator
- Equal key和value必须相等
- Exists 只关心key存在即可，不关系这个值，
网络插件不管有什么污点，都能运行在上面运行，使用的操作符就是Exists ,key和value和effect都可以不写，就容忍了所有的污点

4、Node Affinity(节点亲和性)

比标签选择器更加的高级的用法，标签选择器只能是逻辑与的操作，这个都可以实现，逻辑与，逻辑或，逻辑非
将pod运行在带有特性的标签的节点上面，它支持对标签选择更高级的条件
看起来跟这个之前的标签选择器一样，如果是多个标签的话，就是与的关系
- 这个可以有或的关系
- 就是disk=hadd|saa 这个2个都可以，但是之前的标签选择器就不能实现了(只能实现逻辑与的关系)

1、硬策略(必须满足)

在一个nodeSelectorTerms 下面，多个key的话，就是逻辑与的关系，节点必须同时存在多个标签才能被调度
如果不满足的话，就是pending状态

[root@master01 test]# cat nodeaff.yml 
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: nodeaff
  name: nodeaff
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nodeaff
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nodeaff
    spec:
      affinity: # 亲和性
        nodeAffinity:  # 节点亲和性
          requiredDuringSchedulingIgnoredDuringExecution:  # 硬亲和性
            nodeSelectorTerms:
            - matchExpressions:
              - key: "app1"  # 第一个标签
                values: 
                - "l1"
                operator: In
              - key: "app2"  # 第二个标签  逻辑与，2个标签都必须满足，才调度，因为都在一个matchExpressions下，一个匹配规则下
                values:
                - "l2"
                operator: In
      containers:
      - image: nginx
        name: nginx
        resources: {}

逻辑或的关系


    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:  # 多个标签匹配的话，就是逻辑或的关系
              - key: "app1"
                values: 
                - "l1"
                operator: In
            - matchExpressions:
              - key: "app2"
                values:
                - "l2"
                operator: In

# 节点上面有app1=l1 或者app2=l2 都可以进行调度

2、软策略(尽可能的满足)

就是你没有标签适合，但是还能调度在你这个上面，不可控
实验，没有标签能够满足，还是能调度到其他节点上面

    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: "app3"  # 调度到 app3=l3的节点上面
                values:
                - "l3"
                operator: In 
            weight: 1


# 我的节点没有这个标签，但是还是可以进行调度的

[root@master01 test]# kubectl get pod -o wide
NAME                       READY   STATUS    RESTARTS   AGE   IP              NODE       NOMINATED NODE   READINESS GATES
nodeaff-656458468d-m42k6   1/1     Running   0          61s   10.246.73.135   node       <none>           <none>
nodeaff-656458468d-zhfx8   1/1     Running   0          58s   10.244.59.194   master02   <none>           <none>

3、操作符详解

in 标签值在列表中，任意一个即可，逻辑或的关系
notint 标签值不在列表中
Exists 标签存在（不关心值），只有这个key存在即可
DoesNotExist 标签不存在
Gt 标签值大于数值（字符串比较）
Lt 标签值小于数值

5、Pod Affinity

对象是pod标签，而不是节点标签了，根据这pod的标签来调度在这个pod的节点上面
字段 topologyKey 先根据标签来进行划分，把哪些相同的节点当成一个组，在这个组内执行亲和性和反亲和性

# pod标签

apiVersion: v1
kind: Pod
metadata:
  labels:
    app1: p1  # 标签为 app1=p1
  name: pod1
spec:
  containers:
  - image: nginx
    name: nginx

# deploy调度

    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:  # 硬亲和性
          - topologyKey: kubernetes.io/arch  # 根据这个架构来划分范围，2个节点都是这个架构，因此2个节点看成一个组，在这个组里面执行亲和性策略
            labelSelector:
              matchExpressions:
              - key: "app1"  # 调度在pod上面含有 app1=p1的标签的节点上面
                values:
                - "p1"
                operator: In

6、Anti-Affinity

反亲和一般用这个软策略，万一这个pod在所有节点都运行了
跟自己反亲和，这样的话，所有pod都打散了，软策略，尽可能的打散，4个节点，6个pod,那么就要使用软策略，否则会有一个是pending的状态
hostname这个区，每个节点都有

# 2个副本数量，位于不同的节点上面

# 根据反亲和性，调度在不同节点上，打散

[root@master01 test]# cat podaff.yml 
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: podaff
  name: podaff
spec:
  replicas: 2
  selector:
    matchLabels:
      app: podaff
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:   # pod的标签为 app=podaff
        app: podaff
    spec:
      affinity:
        podAntiAffinity:   # 反亲和性
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname  # 根据主机名来进行划分，每个节点的主机名不同，因此就是2个组
            labelSelector:
              matchExpressions:
              - key: "app"  # 自己跟自己反亲和性，因此2个pod不会位于同一个节点
                values:
                - "podaff"
                operator: In
      containers:
      - image: nginx
        name: nginx

# 位于不同的节点上面
[root@master01 test]# kubectl get pod -o wide
NAME                      READY   STATUS    RESTARTS   AGE    IP              NODE       NOMINATED NODE   READINESS GATES
podaff-6d54f6df74-c47tn   1/1     Running   0          3m9s   10.246.73.140   node       <none>           <none>
podaff-6d54f6df74-lrpq7   1/1     Running   0          3m9s   10.244.59.195   master02   <none>           <none>

上面的这个做法是最常用的，防止一个节点挂掉了，容灾，就是将多个pod打散，分散在不同的节点上面运行

7、拓扑分布式域

这个是腾讯云开源的，贡献给k8s社区的
也是一个标签，选择一组带有这个标签的节点上面进行调度
就是容灾，就是多个可用区，a区，b区,5个pod,2个运行a区，3个运行在b区
- 不满足的时候，不调度
- maxSkew 最大允许的差值
- 不同区的数量，之间差异不能大于1 a去2个，b区3个， 3-2=1
pod更加分布均匀

[root@master01 test]# cat podaff.yml 
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: podaff
  name: podaff
spec:
  replicas: 2
  selector:      
    matchLabels:
      app: podaff 
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: podaff
    spec:
      topologySpreadConstraints:  # 拓扑分布式域 
      - maxSkew: 1     # pod数量差值不能超过1
        topologyKey: kubernetes.io/hostname  # 根据主机名来进行划分，不同的主机名，2个组(2个节点)
        whenUnsatisfiable: DoNotSchedule  # 如果不满足的话，就不要调度，还有一个策略是要调度的
        labelSelector:
          matchLabels:
            app: podaff  # pod带有这个标签在不同组上进行调度且数量差值不能大于1
      - maxSkew: 1
        topologyKey: kubernetes.io/arch  # 根据这个架构(通常是可用区zone,而不是架构)，2个节点的架构是一样，视为一个组
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: podaff
      containers:
      - image: nginx
        name: nginx

# node1节点 0个 node2节点 0个
# amd64 0 个

# 开始调度

# node1节点 1个 node2节点 0个
# amd64 1个

# 还有一个pod要调度了，如果调度在node1节点上面的话，差值就是2，不能满足，因此需要调度在node2节点上

# node1 节点1个 node2节点 1个  1-1=0
# amd64 2个  2-2=0

# 符合条件

在生产环境中的话，主机名和可用区来进行划分，打散，均匀的分布

posted @ 2026-03-28 16:21 乔的港口阅读(2) 评论(0) 收藏举报

刷新页面返回顶部

707c

k8s调度详解

k8s调度

1、调度核心和流程

2、调度策略分类

1、NodeName

2、NodeSelector

3、污点和容忍

4、Node Affinity(节点亲和性)

1、硬策略(必须满足)

2、软策略(尽可能的满足)

3、操作符详解

5、Pod Affinity

6、Anti-Affinity

7、拓扑分布式域

公告