k8s中Controller-Manager和Scheduler的选主逻辑

K8s中的control-plane包括了apiserver、controller-manager、scheduler、etcd,当搭建高可用集群时就会涉及到部分组件的选主问题。etcd是整个集群所有状态信息的存储,涉及数据的读写和多个etcd之间数据的同步,对数据的一致性要求严格,所以使用较复杂的raft算法来选择用于提交数据的主节点。而apiserver作为集群入口,本身是无状态的web服务器,多个apiserver服务之间直接负载请求并不需要做选主。Controller-Manager和Scheduler作为任务类型的组件,比如controller-manager内置的k8s各种资源对象的控制器实时的watch apiserver获取对象最新的变化事件做期望状态和实际状态调整,调度器watch未绑定节点的pod做节点选择,显然多个这些任务同时工作是完全没有必要的,所以controller-manager和scheduler也是需要选主的,但是选主逻辑和etcd不一样的,这里只需要保证从多个controller-manager和scheduler之间选出一个进入工作状态即可,而无需考虑它们之间的数据一致和同步。

 

kube-scheduler中关于leader选择的参数描述

/ # kube-scheduler -h 2>&1 | grep -i leader--leader-elect                                                      Start a leader election client and gain leadership before executing the main loop. Enable this when running replicated components for high availability. (default true)
      --leader-elect-lease-duration duration                              The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. (default 15s)
      --leader-elect-renew-deadline duration                              The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than or equal to the lease duration. This is only applicable if leader election is enabled. (default 10s)
      --leader-elect-resource-lock endpoints                              The type of resource object that is used for locking during leader election. Supported options are endpoints (default) and `configmaps`. (default "endpoints")
      --leader-elect-retry-period duration                                The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. (default 2s)

 

基于k8s 1.11源码分析,Lock Resouce为Endpoint

1、调度器启动时先选举leader,再回调schuduler的run方法进入调度逻辑

// https://sourcegraph.com/github.com/kubernetes/kubernetes@release-1.11/-/blob/cmd/kube-scheduler/app/server.go

func Run(c schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}) error {
......
// Prepare a reusable run function.
    run := func(stopCh <-chan struct{}) {
        sched.Run()
        <-stopCh
    }

    // If leader election is enabled, run via LeaderElector until done and exit.
    if c.LeaderElection != nil {
        c.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
            OnStartedLeading: run,
            OnStoppedLeading: func() {
                utilruntime.HandleError(fmt.Errorf("lost master"))
            },
        }
        leaderElector, err := leaderelection.NewLeaderElector(*c.LeaderElection)
        leaderElector.Run()
}
}

 

2、直接调用Acquire方法来尝试竞选为leader

// Run starts the leader election loop
func (le *LeaderElector) Run() {
    defer func() {
        runtime.HandleCrash()
        le.config.Callbacks.OnStoppedLeading()
    }()
    le.acquire()
    stop := make(chan struct{})
    go le.config.Callbacks.OnStartedLeading(stop)
    le.renew()
    close(stop)
}

 

3、Acquire方法以leader-elect-retry-period指定的时间为间隔,循环调用TryAcquireOrRenew方法,其中的le.config.Lock类型为EndpointsLock,EndpointsLock.Identity()方法返回自己的主机名,EndpointsLock.Get方法请求apiServer获取保存在etcd中的选举记录。

如果从apiserver获取ep选举记录对象失败,则尝试自己作为leader

以自己观察到的observe时间来看,如果租约(15s)未到,并且自己不是leader,不能去抢占为leader,所以就没有其他可以做的了

如果当前自己就是leader,不管租约是否到期,都以当前时间尝试续约,竞选时间acquireTime保持、leader切换次数保持,否则切换次数加1

向apiserver发送更新ep选举记录对象的请求,由apiserver来保证多个客户端的原子更新操作,通过对比resourceVersion版本号(对应etcd中的modifiedindex编号),保证只有一个client能修改成功,其余的返回409

Lock被初始化为EndpointsLock
type EndpointsLock struct {
    // EndpointsMeta should contain a Name and a Namespace of an
    // Endpoints object that the LeaderElector will attempt to lead.
    EndpointsMeta metav1.ObjectMeta
    Client        corev1client.EndpointsGetter
    LockConfig    ResourceLockConfig
    e             *v1.Endpoints
}

// Get returns the election record from a Endpoints Annotation
func (el *EndpointsLock) Get() (*LeaderElectionRecord, error) {
    var record LeaderElectionRecord
    el.e, err = el.Client.Endpoints(el.EndpointsMeta.Namespace).Get(el.EndpointsMeta.Name, metav1.GetOptions{})
    if recordBytes, found := el.e.Annotations[LeaderElectionRecordAnnotationKey]; found {
        if err := json.Unmarshal([]byte(recordBytes), &record); err != nil {
            return nil, err
        }
    }
    return &record, nil
}

//如果自己不是leader,尝试竞选为leader,如果自己就是leader,尝试renew续租
// tryAcquireOrRenew tries to acquire a leader lease if it is not already acquired,
// else it tries to renew the lease if it has already been acquired. Returns true
// on success else returns false.
func (le *LeaderElector) tryAcquireOrRenew() bool {
    now := metav1.Now()
    // 这个Identity()返回的就是自己的hostname + "_" + string(uuid.NewUUID())
// 初始化一个leader是自己的leaderElectionRecord对象,为自己acquire成功时准备 leaderElectionRecord := rl.LeaderElectionRecord{ HolderIdentity: le.config.Lock.Identity(), LeaseDurationSeconds: int(le.config.LeaseDuration / time.Second), RenewTime: now, AcquireTime: now, } // 1. obtain or create the ElectionRecord oldLeaderElectionRecord, err := le.config.Lock.Get()
// 如果从apiserver获取ep失败,则尝试自己作为leader
if err != nil { le.observedRecord = leaderElectionRecord le.observedTime = le.clock.Now() return true } // 2. Record obtained, check the Identity & Time
// apiServer中的leader对象和自己记录的不一样,更新自己的记录 if !reflect.DeepEqual(le.observedRecord, *oldLeaderElectionRecord) { le.observedRecord = *oldLeaderElectionRecord le.observedTime = le.clock.Now() }

//以自己观察到的observe时间来看,如果租约(15s)未到,并且自己不是leader,那么自己没有其他可以做的了
if le.observedTime.Add(le.config.LeaseDuration).After(now.Time) && oldLeaderElectionRecord.HolderIdentity != le.config.Lock.Identity() { return false } // 3. We're going to try to update. The leaderElectionRecord is set to it's default // here. Let's correct it before updating.
// 走到这里可能:1、自己不是leader,但是租约到期了 2、自己是leader,但租约没有到期 3、自己是leader,但是租约到期
// 如果当前自己就是leader,即对应2、3,不管租约是否到期,都以当前时间尝试续约,竞选时间acquireTime保持、leader切换次数保持,否则切换次数加1
if oldLeaderElectionRecord.HolderIdentity == le.config.Lock.Identity() { leaderElectionRecord.AcquireTime = oldLeaderElectionRecord.AcquireTime leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions } else { leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions + 1 } // update the lock itself
// 向apiserver发送更新ep的请求,由apiserver来保证多个客户端的原子更新操作,其resourceVersion版本号机制保证只有一个client能修改成功
if err = le.config.Lock.Update(leaderElectionRecord); err != nil { glog.Errorf("Failed to update lock: %v", err) return false } le.observedRecord = leaderElectionRecord le.observedTime = le.clock.Now() return true }

 

posted @ 2020-09-11 14:34  JL_Zhou  阅读(3449)  评论(0编辑  收藏  举报