一次kuberneets evicted的历险

一、概述

  kubernetes 的eviction检测diskpresure,检测的是kubelet的root-dir。kubelet的默认root-dir是/var/lib/kubelet,可以使用参数--root-dir进行修改,源码:

    kubernetes/cmd/kubelet/app/options/options.go

   

const defaultRootDir = "/var/lib/kubelet"

fs.StringVar(&f.RootDirectory, "root-dir", f.RootDirectory, "Directory path for managing kubelet files (volume mounts,etc).")

 kubernetes/pkg/kubelet/eviction/helpers.go

  

// diskUsage converts used bytes into a resource quantity.
func diskUsage(fsStats *statsapi.FsStats) *resource.Quantity {
    if fsStats == nil || fsStats.UsedBytes == nil {
        return &resource.Quantity{Format: resource.BinarySI}
    }
    usage := int64(*fsStats.UsedBytes)
    return resource.NewQuantity(usage, resource.BinarySI)
}

// rankDiskPressureFunc returns a rankFunc that measures the specified fs stats.
func rankDiskPressureFunc(fsStatsToMeasure []fsStatsType, diskResource v1.ResourceName) rankFunc {
    return func(pods []*v1.Pod, stats statsFunc) {
        orderedBy(exceedDiskRequests(stats, fsStatsToMeasure, diskResource), priority, disk(stats, fsStatsToMeasure, diskResource)).Sort(pods)
    }
}

if nodeFs := summary.Node.Fs; nodeFs != nil {
        if nodeFs.AvailableBytes != nil && nodeFs.CapacityBytes != nil {
            result[evictionapi.SignalNodeFsAvailable] = signalObservation{
                available: resource.NewQuantity(int64(*nodeFs.AvailableBytes), resource.BinarySI),
                capacity:  resource.NewQuantity(int64(*nodeFs.CapacityBytes), resource.BinarySI),
                time:      nodeFs.Time,
            }
        }
type NodeStats struct {
    // Reference to the measured Node.
    NodeName string `json:"nodeName"`
    // Stats of system daemons tracked as raw containers.
    // The system containers are named according to the SystemContainer* constants.
    // +optional
    // +patchMergeKey=name
    // +patchStrategy=merge
    SystemContainers []ContainerStats `json:"systemContainers,omitempty" patchStrategy:"merge" patchMergeKey:"name"`
    // The time at which data collection for the node-scoped (i.e. aggregate) stats was (re)started.
    StartTime metav1.Time `json:"startTime"`
    // Stats pertaining to CPU resources.
    // +optional
    CPU *CPUStats `json:"cpu,omitempty"`
    // Stats pertaining to memory (RAM) resources.
    // +optional
    Memory *MemoryStats `json:"memory,omitempty"`
    // Stats pertaining to network resources.
    // +optional
    Network *NetworkStats `json:"network,omitempty"`
    // Stats pertaining to total usage of filesystem resources on the rootfs used by node k8s components.
    // NodeFs.Used is the total bytes used on the filesystem.
    // +optional
    Fs *FsStats `json:"fs,omitempty"`
    // Stats about the underlying container runtime.
    // +optional
    Runtime *RuntimeStats `json:"runtime,omitempty"`
    // Stats about the rlimit of system.
    // +optional
    Rlimit *RlimitStats `json:"rlimit,omitempty"`
}

 

二、事故

   事情发生在几个月前,有人修改了fluentd的pattern,fluentd使用ds部署的,里面有挂载了一个hostpath,/var/log.里面的日志会输出到syslog里面。导致pattern不匹配的日志全部打入到/var/log/syslog里面,一个小时写入了7个多G。后面磁盘使用率直接达到了90%,而我们在kubelet里面设置的驱逐策略如下:

  

evictionHard:
  imagefs.available: 15%
  memory.available: 100Mi
  nodefs.available: 10%
  nodefs.inodesFree: 5%

当kubelet的root-dir所在的磁盘使用率达到90%就开始evicted,这个fluentd是没有报错的,只是pattern不匹配然后就把日志输出到了sysylog,所以使用的时候一定要设置好日志的输出路径和日志的输出级别。

 

三、善后

通过分析源码得出结论,紧急恢复服务。(系统盘的告警阈值没有减掉kubelet里面设置的驱逐阈值)。重新规划监控阈值,线上的node节点设置特性,不同的业务部署在不同node节点上。

posted @ 2019-06-05 16:13  诗码者  阅读(674)  评论(0编辑  收藏  举报