【K8S】Kubernetes 调度器深度解析：原理与源码分析

一、调度器架构概述
- 1.1 核心架构设计
- 1.2 调度器工作流程
二、调度队列机制
- 2.1 优先级队列实现
- 2.2 Pod 优先级与抢占
三、调度框架与插件系统
- 3.1 框架扩展点
- 3.2 插件注册与执行
四、调度周期详细分析
五、绑定机制与并发控制
- 5.1 绑定过程详解
- 5.2 缓存一致性设计
六、高级调度特性源码
- 6.1 Pod 间亲和性/反亲和性
- 6.2 拓扑分布约束
七、性能优化机制
- 7.1 调度器性能分析
- 7.2 扩展器支持
八、调度器配置与扩展
- 8.1 调度器配置详解
- 8.2 自定义插件开发
九、调度器演进与最佳实践
- 9.1 调度器演进历程
- 9.2 最佳实践
十、总结

一、调度器架构概述

Kubernetes 调度器是集群控制平面的核心组件，负责将新创建的 Pod 分配到合适的节点上运行。其设计采用了生产者-消费者模型和插件化架构，实现了高扩展性和灵活性。调度器通过监听 API Server 的状态变化，处理未调度的 Pod，经过复杂的决策过程后将其绑定到最优节点。

1.1 核心架构设计

调度器的主要组件包括：

调度队列：管理待调度的 Pod，实现优先级排序
调度缓存：维护集群状态的快照，提高调度效率
调度框架：提供插件扩展点，实现调度逻辑的模块化
调度算法：核心决策逻辑，包括过滤和评分两阶段

在源码中，调度器的主要结构体定义如下：

// pkg/scheduler/scheduler.go
type Scheduler struct {
    SchedulerCache internalcache.Cache  // 调度缓存
    Algorithm core.ScheduleAlgorithm     // 调度算法
    Profiles profile.Map                 // 调度配置
    client clientset.Interface           // Kubernetes API 客户端
    // ...
}

1.2 调度器工作流程

调度器的工作流程可分为以下几个关键阶段：

监听与入队：监听 API Server 的 Pod 创建事件，将未调度 Pod 加入队列
调度周期：从队列取出 Pod 进行调度决策
- 过滤阶段：筛选符合条件的节点
- 评分阶段：为候选节点打分
绑定周期：将 Pod 绑定到选定节点
后处理：执行绑定后的清理工作

flowchart TD A[API Server监听] --> B[调度队列] B --> C[调度周期] C --> D[过滤阶段] D --> E[评分阶段] E --> F[节点选择] F --> G[绑定周期] G --> H[API绑定]

二、调度队列机制

2.1 优先级队列实现

调度器使用多级优先级队列管理待调度的 Pod，确保高优先级 Pod 优先调度。队列实现位于 pkg/scheduler/internal/queue/scheduling_queue.go：

type PriorityQueue struct {
    activeQ *heap.Heap                // 活跃队列（优先级排序）
    podBackoffQ *heap.Heap            // 退避队列（调度失败的 Pod）
    unschedulablePods *UnschedulablePodsMap // 不可调度 Pod 映射
    // ...
}

func (p *PriorityQueue) Add(pod *v1.Pod) error {
    if p.unschedulablePods.get(pod) != nil {
        // 从不可调度队列移除
    }
    if p.podBackoffQ.Get(pod) != nil {
        // 从退避队列移除
    }
    err := p.activeQ.Add(podInfo)
    // ...
}

队列管理的关键特性：

优先级排序：基于 Pod 的 PriorityClass 值排序
退避机制：调度失败的 Pod 会进入退避队列，避免频繁重试
不可调度处理：暂时无法调度的 Pod 单独存储
事件触发：节点资源变化时重新评估不可调度 Pod

2.2 Pod 优先级与抢占

Kubernetes 实现了基于 PriorityClass 的优先级系统，允许高优先级 Pod 抢占低优先级 Pod 的资源：

// pkg/scheduler/framework/plugins/queuesort/priority_sort.go
func (pl *PrioritySort) Less(pInfo1, pInfo2 *framework.QueuedPodInfo) bool {
    p1 := corev1helpers.PodPriority(pInfo1.Pod)
    p2 := corev1helpers.PodPriority(pInfo2.Pod)
    return p1 > p2  // 优先级值越大，优先级越高
}

抢占流程：

当高优先级 Pod 无法调度时，调度器寻找可牺牲的低优先级 Pod
驱逐目标节点上的低优先级 Pod
将高优先级 Pod 调度到目标节点

三、调度框架与插件系统

Kubernetes 1.15 引入的调度框架将调度过程分解为多个扩展点，实现了高度模块化。

3.1 框架扩展点

调度框架定义了 11 个扩展点，覆盖调度全生命周期：

type Framework interface {
    RunPreFilterPlugins(ctx context.Context, state *CycleState, pod *v1.Pod) *Status
    RunFilterPlugins(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status
    RunPostFilterPlugins(ctx context.Context, state *CycleState, pod *v1.Pod, filteredNodeStatusMap NodeToStatusMap) *Status
    RunPreScorePlugins(ctx context.Context, state *CycleState, pod *v1.Pod, nodes []*v1.Node) *Status
    RunScorePlugins(ctx context.Context, state *CycleState, pod *v1.Pod, nodes []*v1.Node) (PluginToNodeScores, *Status)
    RunPreBindPlugins(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) *Status
    RunBindPlugins(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) *Status
    RunPostBindPlugins(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string)
    // ...
}

3.2 插件注册与执行

内置插件在 pkg/scheduler/framework/plugins/registry.go 注册：

func NewRegistry() Registry {
    return Registry{
        queuesort.Name: queuesort.New,
        prioritysort.Name: prioritysort.New,
        // 过滤插件
        noderesources.Name: noderesources.NewFit,
        nodeaffinity.Name: nodeaffinity.New,
        tainttoleration.Name: tainttoleration.New,
        // 评分插件  
        noderesources.Name: noderesources.NewBalancedAllocation,
        imagelocality.Name: imagelocality.New,
        interpodaffinity.Name: interpodaffinity.New,
        podtopologyspread.Name: podtopologyspread.New,
    }
}

插件执行流程特点：

顺序执行：插件按注册顺序执行
短路机制：过滤插件失败则跳过后续检查
权重聚合：评分插件分数按权重合并
状态传递：通过 CycleState 在插件间传递数据

四、调度周期详细分析

4.1 调度算法入口

核心调度逻辑位于 pkg/scheduler/core/generic_scheduler.go：

func (g *genericScheduler) Schedule(ctx context.Context, extenders []framework.Extender, state *framework.CycleState, pod *v1.Pod) (result ScheduleResult, err error) {
    // 1. 获取节点快照
    nodeInfoSnapshot := g.nodeInfoSnapshot
    
    // 2. 执行预过滤插件
    status := g.framework.RunPreFilterPlugins(ctx, state, pod)
    if !status.IsSuccess() {
        return result, status.AsError()
    }
    
    // 3. 过滤阶段
    feasibleNodes, err := g.findNodesThatFit(ctx, state, pod)
    if err != nil {
        return result, err
    }
    
    // 4. 后过滤（抢占）
    if len(feasibleNodes) == 0 {
        feasibleNodes, err = g.runPostFilterPlugins(ctx, state, pod, feasibleNodes)
    }
    
    // 5. 执行预评分插件
    if status := g.framework.RunPreScorePlugins(ctx, state, pod, feasibleNodes); !status.IsSuccess() {
        return result, status.AsError()
    }
    
    // 6. 评分阶段
    priorityList, err := g.prioritizeNodes(ctx, state, pod, feasibleNodes)
    if err != nil {
        return result, err
    }
    
    // 7. 选择最优节点
    host, err := g.selectHost(priorityList)
    
    return ScheduleResult{SuggestedHost: host}, nil
}

4.2 过滤阶段深度分析

过滤阶段并行执行所有过滤插件，确保高效处理大规模集群：

func (g *genericScheduler) findNodesThatFit(ctx context.Context, state *framework.CycleState, pod *v1.Pod) ([]*v1.Node, error) {
    allNodes := g.nodeInfoSnapshot.ListNodes()
    feasibleNodes := make([]*v1.Node, 0, len(allNodes))
    
    // 创建检查函数闭包
    checkNode := func(i int) {
        nodeInfo := allNodes[i]
        status := g.framework.RunFilterPlugins(ctx, state, pod, nodeInfo)
        if status.IsSuccess() {
            feasibleNodes = append(feasibleNodes, nodeInfo.Node())
        }
    }
    
    // 并行执行过滤
    parallelize.Until(ctx, len(allNodes), checkNode)
    return feasibleNodes, nil
}

并行处理的关键优化：

工作窃取算法：动态分配任务给空闲 worker
批量处理：每次处理多个节点减少上下文切换
锁优化：使用线程安全的数据结构
内存复用：避免频繁内存分配

4.3 关键过滤插件实现

4.3.1 节点资源过滤

资源检查是调度器最基础的过滤条件：

func (f *Fit) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    if !f.hasEnoughResources(pod, nodeInfo) {
        return framework.NewStatus(framework.Unschedulable, "Insufficient cpu/memory")
    }
    return framework.NewStatus(framework.Success)
}

func (f *Fit) hasEnoughResources(pod *v1.Pod, nodeInfo *framework.NodeInfo) bool {
    allocatable := nodeInfo.Allocatable
    requested := computePodResourceRequest(pod)
    
    // 检查 CPU
    if allocatable.MilliCPU < requested.MilliCPU+nodeInfo.Requested.MilliCPU {
        return false
    }
    
    // 检查内存
    if allocatable.Memory < requested.Memory+nodeInfo.Requested.Memory {
        return false
    }
    
    // 检查扩展资源（GPU等）
    for rName, rQuant := range requested.ScalarResources {
        if allocatable.ScalarResources[rName] < rQuant+nodeInfo.Requested.ScalarResources[rName] {
            return false
        }
    }
    
    return true
}

资源计算考虑因素：

Pod 请求资源：容器 requests 的总和
节点已分配资源：已调度 Pod 请求的总和
节点可分配资源：节点容量减去系统预留
扩展资源：GPU、FPGA 等特殊硬件

4.3.2 节点亲和性过滤

节点亲和性实现高级节点选择逻辑：

func (pl *NodeAffinity) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    node := nodeInfo.Node()
    
    if pod.Spec.Affinity != nil && pod.Spec.Affinity.NodeAffinity != nil {
        nodeAffinity := pod.Spec.Affinity.NodeAffinity
        
        // 检查硬性要求
        if nodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil {
            terms := nodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms
            if !v1helper.MatchNodeSelectorTerms(node, terms) {
                return framework.NewStatus(framework.UnschedulableAndUnresolvable, "node(s) didn't match node selector")
            }
        }
        
        // 检查软性要求（影响评分）
        // ...
    }
    return framework.NewStatus(framework.Success)
}

亲和性规则类型：

requiredDuringScheduling：必须满足的硬性要求
preferredDuringScheduling：优先满足的软性要求
节点选择器：简单的标签匹配

4.3.3 污点与容忍度

污点机制实现节点排斥和专用节点：

func (pl *TaintToleration) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    if len(nodeInfo.Node().Spec.Taints) == 0 {
        return framework.NewStatus(framework.Success)
    }
    
    // 检查容忍度匹配
    if !v1helper.TolerationsTolerateTaintsWithFilter(pod.Spec.Tolerations, nodeInfo.Node().Spec.Taints, func(t *v1.Taint) bool {
        return t.Effect == v1.TaintEffectNoSchedule || t.Effect == v1.TaintEffectNoExecute
    }) {
        return framework.NewStatus(framework.UnschedulableAndUnresolvable, "node(s) had taints that the pod didn't tolerate")
    }
    
    return framework.NewStatus(framework.Success)
}

污点效果类型：

NoSchedule：禁止调度新 Pod（已存在 Pod 不受影响）
PreferNoSchedule：尽量避免调度
NoExecute：驱逐不满足容忍的已存在 Pod

4.4 评分阶段深度分析

评分阶段为候选节点计算权重分数：

func (g *generic_scheduler) prioritizeNodes(
    ctx context.Context,
    state *framework.CycleState,
    pod *v1.Pod,
    nodes []*v1.Node,
) (framework.NodeScoreList, error) {
    // 并行执行所有评分插件
    results := make(framework.PluginToNodeScores)
    for _, pl := range g.framework.ListScorePlugins() {
        nodeScoreList, status := pl.Score(ctx, state, pod, nodes)
        if !status.IsSuccess() {
            return nil, status.AsError()
        }
        results[pl.Name()] = nodeScoreList
    }
    
    // 执行分数标准化扩展
    if status := g.framework.RunScoreExtensionPlugins(ctx, state, pod, results); !status.IsSuccess() {
        return nil, status.AsError()
    }
    
    // 合并分数
    result := make(framework.NodeScoreList, len(nodes))
    for i := range nodes {
        result[i] = framework.NodeScore{Name: nodes[i].Name, Score: 0}
        for _, pl := range g.framework.ListScorePlugins() {
            result[i].Score += results[pl.Name()][i].Score * int64(pl.Weight())
        }
    }
    
    return result, nil
}

评分阶段优化策略：

分数标准化：将不同插件的分数映射到统一范围（0-100）
权重分配：不同插件有不同权重影响最终决策
并行计算：独立插件可并行执行
结果缓存：可复用中间计算结果

4.5 关键评分插件实现

4.5.1 资源均衡分配

资源均衡插件优化集群资源利用率：

func (r *BalancedAllocation) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
    nodeInfo, err := r.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    
    allocatable := nodeInfo.Allocatable
    requested := nodeInfo.Requested
    podRequest := computePodResourceRequest(pod)
    
    // 计算资源占比
    cpuFraction := fractionOfResource(requested.MilliCPU+podRequest.MilliCPU, allocatable.MilliCPU)
    memoryFraction := fractionOfResource(requested.Memory+podRequest.Memory, allocatable.Memory)
    
    // 计算资源分配方差
    variance := math.Pow(cpuFraction-memoryFraction, 2)
    for rName := range allocatable.ScalarResources {
        rFraction := fractionOfResource(requested.ScalarResources[rName]+podRequest.ScalarResources[rName], allocatable.ScalarResources[rName])
        variance += math.Pow(cpuFraction-rFraction, 2)
    }
    
    // 方差越小得分越高
    score := int64((1 - variance) * float64(framework.MaxNodeScore))
    return score, nil
}

func fractionOfResource(requested, capacity int64) float64 {
    if capacity == 0 {
        return 0
    }
    return float64(requested) / float64(capacity)
}

资源均衡的目标：

避免热点：防止某些节点过载
资源平衡：保持 CPU、内存等资源使用比例均衡
预留空间：为新 Pod 和突发流量预留资源

4.5.2 镜像本地性

镜像本地性优化 Pod 启动速度：

func (pl *ImageLocality) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
    nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    
    totalSize := int64(0)
    for _, container := range pod.Spec.Containers {
        if imageState, found := pl.imageStates[container.Image]; found {
            if _, exists := imageState.nodes[nodeName]; exists {
                totalSize += imageState.size
            }
        }
    }
    
    // 根据镜像大小计算分数
    score := int64(0)
    switch {
    case totalSize >= pl.maxImageSize:
        score = framework.MaxNodeScore
    case totalSize > 0:
        score = int64(framework.MaxNodeScore * (float64(totalSize) / float64(pl.maxImageSize)))
    }
    
    return score, nil
}

镜像缓存管理：

节点镜像状态：跟踪各节点已缓存的镜像
镜像热度统计：优先保留常用镜像
垃圾回收：定期清理未使用镜像
预取机制：预测性加载可能需要的镜像

五、绑定机制与并发控制

5.1 绑定过程详解

绑定是将调度决策持久化的关键步骤：

func (sched *Scheduler) bind(ctx context.Context, assumed *v1.Pod, targetNode string) error {
    binding := &v1.Binding{
        ObjectMeta: metav1.ObjectMeta{Name: assumed.Name, UID: assumed.UID},
        Target: v1.ObjectReference{Kind: "Node", Name: targetNode},
    }
    
    // 执行预绑定插件
    if status := sched.Framework.RunPreBindPlugins(ctx, state, assumed, targetNode); !status.IsSuccess() {
        return status.AsError()
    }
    
    // 执行绑定操作
    err := sched.Client.CoreV1().Pods(assumed.Namespace).Bind(ctx, binding, metav1.CreateOptions{})
    if err != nil {
        // 处理绑定失败
        sched.SchedulerCache.ForgetPod(assumed)
        return err
    }
    
    // 执行绑定后插件
    sched.Framework.RunPostBindPlugins(ctx, state, assumed, targetNode)
    
    return nil
}

绑定阶段注意事项：

原子性：绑定操作需保证原子性
状态同步：绑定后更新调度器缓存
错误处理：处理网络故障和冲突
最终一致性：依赖 API Server 的持久化

5.2 缓存一致性设计

调度缓存维护集群状态的本地视图：

type Cache interface {
    AssumePod(pod *v1.Pod) error
    ForgetPod(pod *v1.Pod) error
    AddPod(pod *v1.Pod) error
    UpdatePod(oldPod, newPod *v1.Pod) error
    RemovePod(pod *v1.Pod) error
    GetPod(pod *v1.Pod) *v1.Pod
    IsAssumedPod(pod *v1.Pod) bool
    // ...
}

缓存一致性保证机制：

假设机制：调度过程中临时假设 Pod 已绑定
事件监听：监听 API Server 事件更新缓存
定期同步：定时全量同步集群状态
版本控制：使用资源版本号检测冲突

六、高级调度特性源码

6.1 Pod 间亲和性/反亲和性

Pod 间亲和性实现复杂拓扑约束：

func (pl *InterPodAffinity) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    node := nodeInfo.Node()
    
    // 检查反亲和性
    if !satisfyPodAntiAffinity(pod, nodeInfo, pl) {
        return framework.NewStatus(framework.Unschedulable, "node(s) didn't satisfy pod anti-affinity")
    }
    
    // 检查亲和性
    if !satisfyPodAffinity(pod, nodeInfo, pl) {
        return framework.NewStatus(framework.Unschedulable, "node(s) didn't satisfy pod affinity")
    }
    
    return framework.NewStatus(framework.Success)
}

应用场景：

高可用部署：将同一服务的 Pod 分散到不同节点
共置部署：将紧密协作的服务部署在同一节点
专有节点：确保某些 Pod 独占节点资源

6.2 拓扑分布约束

拓扑分布实现精细化的故障域控制：

func (pl *PodTopologySpread) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
    nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    
    // 获取拓扑域键值
    topologyKey := pl.defaultConstraints[0].TopologyKey
    topologyValue := nodeInfo.Node().Labels[topologyKey]
    
    // 计算拓扑域中的 Pod 数量
    podCount := pl.topologyPairToPodCounts[topologyPair{key: topologyKey, value: topologyValue}]
    
    // 计算最小 Pod 数量
    minCount := math.MaxInt32
    for _, count := range pl.topologyPairToPodCounts {
        if count < minCount {
            minCount = count
        }
    }
    
    // 计算偏离度
    skew := podCount - minCount
    if skew < 0 {
        skew = 0
    }
    
    // 计算分数（偏离度越小分数越高）
    score := framework.MaxNodeScore - int64(skew)*framework.MaxNodeScore/int64(pl.defaultConstraints[0].MaxSkew)
    
    return score, nil
}

拓扑分布策略：

硬性约束：必须满足的分布要求
软性偏好：尽量满足的分布目标
多拓扑域：支持跨多个拓扑层级（区域 > 机架 > 节点）
权重分配：不同拓扑域可设置不同权重

七、性能优化机制

7.1 调度器性能分析

大规模集群调度性能优化策略：

节点信息快照：

type nodeInfoSnapshot struct {
    nodeInfoMap map[string]*framework.NodeInfo
    generation  int64
    mu          sync.RWMutex
}

调度周期开始时创建快照
避免调度过程中状态变化导致的决策不一致

并行处理优化：
```
parallelize.Until(ctx, len(nodes), checkNode, parallelize.ChunkSize)
```
- 动态调整并行度
- 工作窃取算法平衡负载
- 批量处理减少锁竞争
缓存机制：
- 节点信息缓存
- Pod 状态缓存
- 镜像状态缓存
增量处理：
- 只处理变化的 Pod 和节点
- 事件驱动更新

7.2 扩展器支持

对于需要外部决策的复杂场景：

func (g *genericScheduler) runExtenders(
    ctx context.Context,
    pod *v1.Pod,
    feasibleNodes []*v1.Node,
) ([]*v1.Node, error) {
    for _, extender := range g.extenders {
        if !extender.IsInterested(pod) {
            continue
        }
        feasibleNodes, err = extender.Filter(pod, feasibleNodes)
        if err != nil {
            return nil, err
        }
    }
    return feasibleNodes, nil
}

扩展器适用场景：

自定义资源调度：特殊硬件资源管理
跨集群调度：联邦集群场景
策略引擎集成：复杂业务规则
资源预留系统：与外部资源管理系统对接

八、调度器配置与扩展

8.1 调度器配置详解

通过 KubeSchedulerConfiguration 配置调度策略：

apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    plugins:
      preFilter:
        enabled:
          - name: NodeResourcesFit
          - name: NodeAffinity
      filter:
        enabled:
          - name: NodeResourcesFit
          - name: NodeAffinity
          - name: VolumeBinding
      postFilter:
        enabled:
          - name: DefaultPreemption
      score:
        enabled:
          - name: NodeResourcesBalancedAllocation
            weight: 1
          - name: ImageLocality
            weight: 1
          - name: InterPodAffinity
            weight: 2
    pluginConfig:
      - name: InterPodAffinity
        args:
          hardPodAffinityWeight: 5
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: MostAllocated

配置关键项：

插件启用/禁用：控制各扩展点使用的插件
插件权重：调整评分插件的相对重要性
插件参数：定制插件行为
多调度器配置：支持运行多个调度器实例

8.2 自定义插件开发

开发自定义调度插件的步骤：

实现插件接口：

type Plugin interface {
    Name() string
}

type FilterPlugin interface {
    Plugin
    Filter(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status
}

type ScorePlugin interface {
    Plugin
    Score(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) (int64, *Status)
}

注册插件：

func NewCustomPlugin(_ runtime.Object, handle framework.Handle) (framework.Plugin, error) {
    return &CustomPlugin{handle: handle}, nil
}

打包部署：
- 编译为独立二进制文件
- 配置调度器使用自定义插件

九、调度器演进与最佳实践

9.1 调度器演进历程

初始版本：基于谓词和优先函数的简单调度
多调度器支持：允许集群运行多个调度器
调度框架引入：1.15 版本引入插件化架构
调度器配置 API：标准化配置方式
调度器性能优化：持续改进大规模集群表现

9.2 最佳实践

资源请求设置：

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "128Mi"
    cpu: "500m"

合理设置 requests 保证调度质量
设置 limits 防止资源耗尽

亲和性配置：

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.kubernetes.io/zone
          operator: In
          values:
          - us-west-2a
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - store
      topologyKey: "kubernetes.io/hostname"

拓扑分布约束：

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: my-app

优先级使用：
```
priorityClassName: high-priority
```

十、总结

Kubernetes 调度器是一个高度复杂且精密的系统，其设计体现了以下核心原则：

可扩展性：通过插件架构支持无限扩展
高效性：并行处理和缓存优化保证性能
灵活性：支持多种调度策略和约束
可靠性：完善的错误处理和状态管理
公平性：优先级和抢占机制保证重要负载

理解调度器的内部机制对于以下场景至关重要：

优化集群资源利用率
排查调度性能问题
设计高可用应用部署
开发自定义调度策略
集成复杂业务需求

随着 Kubernetes 的持续演进，调度器将继续引入更多创新功能，如：

机器学习驱动的调度决策
实时资源动态调整
跨集群联邦调度
与边缘计算场景的深度集成

通过深入理解调度器的原理和实现，我们可以更好地驾驭 Kubernetes 的强大能力，构建高效、可靠、灵活的云原生基础设施。

posted @ 2025-10-03 15:20 NeoLshu 阅读(13) 评论(0) 收藏举报

刷新页面返回顶部

neolshu