linux的调度域和调度组的初始化
在拥有多cpu的系统中,调度域和调度组是进程负载均衡的基础。本文介绍在x86系统中,多cpu环境下,调度域和调度组是如何初始化的。
什么是调度域?
在大型计算机系统中可能拥有上百cpu,这些cpu并不是完全对称的,比如smt之间是共享L1cache和一些计算单元的,有些cpu共享一个L3cache。同一个numa内部的cpu共享内存总线。还有更大的层级比如一个socket,整个系统可能存在多个socket。于是整个系统形成了层级拓扑结构。这些cpu物理层级的存在使得如果不加控制地去进行负载的迁移,可能得不到最好的效果。为了能够利用这些层级信息,人们长造出能够反映系统拓扑结构的调度结构,这就是调度域,sched_domain。
什么是调度组?
一般,负载均衡是在一个个调度域中进行的。调度域内的负载是否均衡也是需要计算其cpu子集之间的平衡度,这个cpu子集就是调度组。调度域的cpu子集并不是随意划分的,一般情况下,一个调度域的调度组所包含的cpu范围与其子调度域相当。这样是合理的,我们在计算一个调度域的负载是否均衡是,是在判断各个子调度域之间是否均衡。
调度域和调度组相关的数据结构
sched_domain_topology_level
struct sched_domain_topology_level { sched_domain_mask_f mask; sched_domain_flags_f sd_flags; int flags; int numa_level; struct sd_data data; #ifdef CONFIG_SCHED_DEBUG char *name; #endif };
sd_data
struct sd_data { struct sched_domain *__percpu *sd; struct sched_domain_shared *__percpu *sds; struct sched_group *__percpu *sg; struct sched_group_capacity *__percpu *sgc; };
sched_domain
struct sched_domain { /* These fields must be setup */ struct sched_domain __rcu *parent; /* top domain must be null terminated */ struct sched_domain __rcu *child; /* bottom domain must be null terminated */ struct sched_group *groups; /* the balancing groups of the domain */ unsigned long min_interval; /* Minimum balance interval ms */ unsigned long max_interval; /* Maximum balance interval ms */ unsigned int busy_factor; /* less balancing by factor if busy */ unsigned int imbalance_pct; /* No balance until over watermark */ unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ unsigned int imb_numa_nr; /* Nr running tasks that allows a NUMA imbalance */ int nohz_idle; /* NOHZ IDLE status */ int flags; /* See SD_* */ int level; /* Runtime fields. */ unsigned long last_balance; /* init to jiffies. units in jiffies */ unsigned int balance_interval; /* initialise to 1. units in ms. */ unsigned int nr_balance_failed; /* initialise to 0 */ /* idle_balance() stats */ u64 max_newidle_lb_cost; unsigned long last_decay_max_lb_cost; ... #ifdef CONFIG_SCHED_DEBUG char *name; #endif union { void *private; /* used during construction */ struct rcu_head rcu; /* used during destruction */ }; struct sched_domain_shared *shared; unsigned int span_weight; /* * Span of all CPUs in this domain. * * NOTE: this field is variable length. (Allocated dynamically * by attaching extra space to the end of the structure, * depending on how many CPUs the kernel has booted up with) */ unsigned long span[]; };
sched_group
struct sched_group { struct sched_group *next; /* Must be a circular list */ atomic_t ref; unsigned int group_weight; unsigned int cores; struct sched_group_capacity *sgc; int asym_prefer_cpu; /* CPU of highest priority in group */ int flags; /* * The CPUs this group covers. * * NOTE: this field is variable length. (Allocated dynamically * by attaching extra space to the end of the structure, * depending on how many CPUs the kernel has booted up with) */ unsigned long cpumask[]; };
sched_group_capacity
struct sched_group_capacity { atomic_t ref; /* * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity * for a single CPU. */ unsigned long capacity; unsigned long min_capacity; /* Min per-CPU capacity in group */ unsigned long max_capacity; /* Max per-CPU capacity in group */ unsigned long next_update; int imbalance; /* XXX unrelated to capacity but shared group state */ #ifdef CONFIG_SCHED_DEBUG int id; #endif unsigned long cpumask[]; /* Balance mask */ };
调度域和调度组的初始化
初始化的工作肯定是在kernel初始化阶段完成的。在start_kernel->rest_init->kernel_init->kernel_init_freeable内有两个函数与此有关,smp_init和sched_init_smp。

build_sched_topology初始化sched_domain_topology_level结构体的全局变量sched_domain_topology,这个变量会在后面的sched_domain初始化的时候使用。
static void __init build_sched_topology(void) { int i = 0; #ifdef CONFIG_SCHED_SMT x86_topology[i++] = (struct sched_domain_topology_level){ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) }; #endif #ifdef CONFIG_SCHED_CLUSTER x86_topology[i++] = (struct sched_domain_topology_level){ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) }; #endif #ifdef CONFIG_SCHED_MC x86_topology[i++] = (struct sched_domain_topology_level){ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) }; #endif /* * When there is NUMA topology inside the package skip the PKG domain * since the NUMA domains will auto-magically create the right spanning * domains based on the SLIT. */ if (!x86_has_numa_in_package) { x86_topology[i++] = (struct sched_domain_topology_level){ cpu_cpu_mask, x86_die_flags, SD_INIT_NAME(PKG) }; } /* * There must be one trailing NULL entry left. */ BUG_ON(i >= ARRAY_SIZE(x86_topology)-1); set_sched_topology(x86_topology); }
我们来分别看看这里提到的几个层级。smt,cluster,mc。smt是指一个core中的超线程组成的范围,cluster是指共享L2cache的cpu域,在x86上,smt是共享L1,L2 cache的,所以这个域跟smt域是一致的,也就是在x86的机器上看不到这个域。mc域是共享L3cache的cpu域。在intel的机器中,一个numa可能只有一个LLC,因此,这个域可能是一个numa的区域。对于amd机器,可能一个numa上有多个LLC,因此,这个域的层级比numa要低。
如果机器上没有numa,这里直接将所有cpu作为最后一个domain,PKG。
这里的初始化应该是不全的,只包含了smt,cluster,mc域,或者还有pkg域,numa域在其他地方初始化。x86_topology是一个全局变量。
static struct sched_domain_topology_level x86_topology[6];
只包含6个层级,且最后一个层级是NULL。x86_topology会被设置给sched_domain_topolog。
初始化numa域的代码位于sched_init_numa。这个函数值得读一读,由于比较长这里就不展示了。该函数会首先判断numa之间有多少种距离,这个距离数就是numa域的层级数。
void sched_init_numa(int offline_node) { struct sched_domain_topology_level *tl; unsigned long *distance_map; int nr_levels = 0; int i, j; int *distances; struct cpumask ***masks; /* * O(nr_nodes^2) de-duplicating selection sort -- in order to find the * unique distances in the node_distance() table. */ distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL); if (!distance_map) return; bitmap_zero(distance_map, NR_DISTANCE_VALUES); for_each_cpu_node_but(i, offline_node) { for_each_cpu_node_but(j, offline_node) { int distance = node_distance(i, j); if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) { sched_numa_warn("Invalid distance value range"); bitmap_free(distance_map); return; } bitmap_set(distance_map, distance, 1); // 设置distance到distance_map,distance相同的在同一个bit } } /* * We can now figure out how many unique distance values there are and * allocate memory accordingly. */ nr_levels = bitmap_weight(distance_map, NR_DISTANCE_VALUES); // 数一数distance有几种作为之后numa层级数
然后根据这个层级数建立numa层级的调度域。
rcu_assign_pointer(sched_domains_numa_masks, masks); // 省略了计算每个distance层级cpumask的部分 /* Compute default topology size */ for (i = 0; sched_domain_topology[i].mask; i++); // 之前已经初始化过sched_domain_topology,也就是前面提到的smt,cluster,mc域。 tl = kzalloc((i + nr_levels + 1) * // 因为有numa域的加入,需要重新分配内存来放置数量不确定的numa域。 sizeof(struct sched_domain_topology_level), GFP_KERNEL); if (!tl) return; /* * Copy the default topology bits.. */ for (i = 0; sched_domain_topology[i].mask; i++) tl[i] = sched_domain_topology[i]; /* * Add the NUMA identity distance, aka single NODE. */ tl[i++] = (struct sched_domain_topology_level){ // numa最低的层级为NODE域,就是本地numa .mask = sd_numa_mask, .numa_level = 0, SD_INIT_NAME(NODE) }; /* * .. and append 'j' levels of NUMA goodness. */ for (j = 1; j < nr_levels; i++, j++) { // 包含多个numa的域都称之为NUMA域 tl[i] = (struct sched_domain_topology_level){ .mask = sd_numa_mask, .sd_flags = cpu_numa_flags, .flags = SDTL_OVERLAP, .numa_level = j, SD_INIT_NAME(NUMA) }; } sched_domain_topology_saved = sched_domain_topology; sched_domain_topology = tl; // sched_domain_topology不再是原先那个全局数组了
可以看到,单独一个numa为NODE域,包含多个numa的numa域的name都是NUMA。比如一个有两个socket的机器,由于kernel只是分辨numa之间的距离,它会将socket域也认为是NUMA域。当sched domain topo被设置好后就被赋给全局变量sched_domain_topology了,此时sched_domain_topology不再是之前的全局数组了。
初始化sched domain
sched_init_domains会调用build_sched_domains来初始化sched_domain。
int __init sched_init_domains(const struct cpumask *cpu_map) { int err; ndoms_cur = 1; doms_cur = alloc_sched_domains(ndoms_cur); if (!doms_cur) doms_cur = &fallback_doms; cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_TYPE_DOMAIN)); err = build_sched_domains(doms_cur[0], NULL); return err; }
doms_cur包含的cpu排除掉了isolated cpu,所以隔离的cpu是不会被负载均衡使用的。
build_sched_domains是初始化sched domain的关键函数。下面分析这个函数。
static int build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr) { ... alloc_state = __visit_domain_allocation_hell(&d, cpu_map); .. } static enum s_alloc __visit_domain_allocation_hell(struct s_data *d, const struct cpumask *cpu_map) { memset(d, 0, sizeof(*d)); if (__sdt_alloc(cpu_map)) return sa_sd_storage; d->sd = alloc_percpu(struct sched_domain *); if (!d->sd) return sa_sd_storage; d->rd = alloc_rootdomain(); if (!d->rd) return sa_sd; return sa_rootdomain; }
首先调用__visit_domain_allocation_hell。它会调用__sdt_alloc给cpu_map中的每一个cpu分配sched_domain, sched_group, sched_group_capacity, sched_domain_shared对应的per_cpu变量。
接下来会给每个cpu创建调度域拓扑。
/* Set up domains for CPUs specified by the cpu_map: */ for_each_cpu(i, cpu_map) { struct sched_domain_topology_level *tl; sd = NULL; for_each_sd_topology(tl) { if (WARN_ON(!topology_span_sane(tl, cpu_map, i))) goto error; sd = build_sched_domain(tl, cpu_map, attr, sd, i); has_asym |= sd->flags & SD_ASYM_CPUCAPACITY; if (tl == sched_domain_topology) *per_cpu_ptr(d.sd, i) = sd; if (tl->flags & SDTL_OVERLAP) sd->flags |= SD_OVERLAP; if (cpumask_equal(cpu_map, sched_domain_span(sd))) break; } }
这是一个两层循环,外层是遍历cpu,内层遍历调度域层级。可知,sched_domain是每个cpu都会有的per_cpu变量。然后每次循环都调用build_sched_domain创建该层调度域。注意,只有最低层的调度域会设置到该cpu的sched_domain per_cpu变量上,sched_domain会形成一个层级,从最底层向上遍历即可得到该cpu的所有调度域。
接下来是创建调度组。
/* Build the groups for the domains */ for_each_cpu(i, cpu_map) { for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { sd->span_weight = cpumask_weight(sched_domain_span(sd)); if (sd->flags & SD_OVERLAP) { if (build_overlap_sched_groups(sd, i)) goto error; } else { if (build_sched_groups(sd, i)) goto error; } } }
每个cpu的每个调度域都有一组调度组,因此也是两层循环。根据是否有重叠选择调用build_overlap_sched_groups还是build_sched_groups。
接下来是对多llc numa的优化,跳过。之后是初始化sched_group_capacity。
/* Calculate CPU capacity for physical packages and nodes */ for (i = nr_cpumask_bits-1; i >= 0; i--) { if (!cpumask_test_cpu(i, cpu_map)) continue; for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { claim_allocations(i, sd); init_sched_groups_capacity(i, sd); } }
同样是针对每个调度域都存在该结构,也是两层循环。第一层遍历cpu,第二层遍历调度域。claim_allocations将调度域对应的sched_group_capacity为NULL。
init_sched_groups_capacity会计算当前cpu对应的调度组的cores和capacity。
capacity是怎么来的?
cpu的capacity在不同的架构中有不同的获取方式。一般在非异构cpu系统中capacity是1024。在init_sched_groups_capacity会调用update_cpu_capacity完成每个sched group capacity的capacity初始化。
static void update_cpu_capacity(struct sched_domain *sd, int cpu) { unsigned long capacity = scale_rt_capacity(cpu); struct sched_group *sdg = sd->groups; if (!capacity) capacity = 1; cpu_rq(cpu)->cpu_capacity = capacity; trace_sched_cpu_capacity_tp(cpu_rq(cpu)); sdg->sgc->capacity = capacity; sdg->sgc->min_capacity = capacity; sdg->sgc->max_capacity = capacity; }
scale_rt_capacity会获取cpu的capacity。
static unsigned long scale_rt_capacity(int cpu) { unsigned long max = get_actual_cpu_capacity(cpu); struct rq *rq = cpu_rq(cpu); unsigned long used, free; unsigned long irq; irq = cpu_util_irq(rq); if (unlikely(irq >= max)) return 1; /* * avg_rt.util_avg and avg_dl.util_avg track binary signals * (running and not running) with weights 0 and 1024 respectively. */ used = cpu_util_rt(rq); used += cpu_util_dl(rq); if (unlikely(used >= max)) return 1; free = max - used; return scale_irq_capacity(free, irq, max); }
它会从get_actual_cpu_capacity中获取cpu的最大capacity,减去irq,rt class,dl调度器的util,根据比例得到最终的capacity。
get_actual_cpu_capacity会从arch_scale_cpu_capacity获取。
unsigned long arch_scale_cpu_capacity(int cpu) { if (static_branch_unlikely(&arch_hybrid_cap_scale_key)) return READ_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity); return SCHED_CAPACITY_SCALE; }
可知,arch_scale_cpu_capacity会返回SCHED_CAPACITY_SCALE。在SCHED_CAPACITY_SCALE内核中是1024。
接下来是attatch domain。
rcu_read_lock(); for_each_cpu(i, cpu_map) { rq = cpu_rq(i); sd = *per_cpu_ptr(d.sd, i); cpu_attach_domain(sd, d.rd, i); if (lowest_flag_domain(i, SD_CLUSTER)) has_cluster = true; }
给每个cpu调用cpu_attach_domain。该函数会检查每个cpu的sched_domain是否合理,删除那些不应存在的domain,比如一个domain只有一个cpu,是没必要存在的。接着会将指定cpu加入root_domain,之后将其添加到rq的rd域。调用update_top_cache_domain更新拓扑相关信息,包括:
per_cpu(sd_llc, cpu) // LLC(最后一级缓存)调度域
per_cpu(sd_llc_size, cpu) // LLC 域的 CPU 数量
per_cpu(sd_llc_id, cpu) // LLC 域的代表 CPU ID
per_cpu(sd_llc_shared, cpu) // LLC 域的共享状态
per_cpu(sd_share_id, cpu) // 集群/LLC 域的代表 ID
per_cpu(sd_numa, cpu) // NUMA 域
per_cpu(sd_asym_packing, cpu)// 非对称打包调度域
per_cpu(sd_asym_cpucapacity, cpu) // 非对称算力调度域
build_sched_domains整体分析完了,我们来深入分析一下build_sched_domain和创建调度组的代码。
创建调度域,build_sched_domain
static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, struct sched_domain_attr *attr, struct sched_domain *child, int cpu) { struct sched_domain *sd = sd_init(tl, cpu_map, child, cpu); if (child) { sd->level = child->level + 1; sched_domain_level_max = max(sched_domain_level_max, sd->level); child->parent = sd; } set_domain_attribute(sd, attr); return sd; }
可以分为三部分,sd_init负责初始化sched_domain, 建立层级关系, 设置attribute。
sd_init
static struct sched_domain * sd_init(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, struct sched_domain *child, int cpu) { #ifdef CONFIG_NUMA /* * Ugly hack to pass state to sd_numa_mask()... */ sched_domains_curr_level = tl->numa_level; #endif sd_weight = cpumask_weight(tl->mask(cpu)); if (tl->sd_flags) sd_flags = (*tl->sd_flags)(); if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS, "wrong sd_flags in topology description\n")) sd_flags &= TOPOLOGY_SD_FLAGS; *sd = (struct sched_domain){ .min_interval = sd_weight, .max_interval = 2*sd_weight, .busy_factor = 16, .imbalance_pct = 117, .cache_nice_tries = 0, .flags = 1*SD_BALANCE_NEWIDLE | 1*SD_BALANCE_EXEC | 1*SD_BALANCE_FORK | 0*SD_BALANCE_WAKE | 1*SD_WAKE_AFFINE | 0*SD_SHARE_CPUCAPACITY | 0*SD_SHARE_LLC | 0*SD_SERIALIZE | 1*SD_PREFER_SIBLING | 0*SD_NUMA | sd_flags , .last_balance = jiffies, .balance_interval = sd_weight, .max_newidle_lb_cost = 0, .last_decay_max_lb_cost = jiffies, .child = child, .name = tl->name, };
可以看到sd_init会对sched_domain结构中的部分成员做初始化。min_interval与调度域包含的cpu数量有关。busy_factor, imbalance_pct,balance_interval都被初始化。
if (sd->flags & SD_SHARE_CPUCAPACITY) { sd->imbalance_pct = 110; } else if (sd->flags & SD_SHARE_LLC) { sd->imbalance_pct = 117; sd->cache_nice_tries = 1; #ifdef CONFIG_NUMA } else if (sd->flags & SD_NUMA) { sd->cache_nice_tries = 2; sd->flags &= ~SD_PREFER_SIBLING; sd->flags |= SD_SERIALIZE; if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) { sd->flags &= ~(SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE); } #endif } else { sd->cache_nice_tries = 1; }
这里对不同层级的调度域的cache_nice_tries做了调整,层级越高值越大,SD_NUMA层级是2,MC层级是1。对于共享llc的域,调大了imbalance_pct的值,可以减少llc内进程迁移。对于SD_NUMA域,去掉了SD_PREFER_SIBLING,加上了SD_SERIALIZE,后者会让load balance在这些调度域上串行。还有一个特殊的改动,当调度域内部最大的numa distance大于node_reclaim_distance时会去掉SD_BALANCE_EXEC,SD_BALANCE_FORK, SD_WAKE_AFFINE。也就是说,这个调度域之间的距离太大,不适合做wakeup和新建进程时不适合在此调度域内找cpu。
if (sd->flags & SD_SHARE_LLC) { sd->shared = *per_cpu_ptr(sdd->sds, sd_id); atomic_inc(&sd->shared->ref); atomic_set(&sd->shared->nr_busy_cpus, sd_weight); }
对于共享LLC的域会初始化sched_domain_share结构,这是将nr_busy_cpus设置成调度域的weight(为啥一开始就设置成最大值?)
接下来看创建调度组是怎么实现的。
创建调度组有两个相关函数,build_overlap_sched_groups和build_sched_groups。根据是否存在SD_OVERLAP标志调用对应的函数。什么是overlap?我们知道调度组是调度域的子集,一个调度域有多个调度组。一个疑问就是,这些作为子集的调度组之间是否存在交集?这是可能的,有关这个问题我们可以在以后详细讲解,这里只说明一个结论:调度组之间有交集的该调度域就会被打上SD_OVERLAP的标签。而这个标签就是在创建好调度域之后,如果当前的SDTL flag中存在SDTL_OVERLAP标签打上的。
build_sched_groups相对简单,先看它。
static int build_sched_groups(struct sched_domain *sd, int cpu) { struct sched_group *first = NULL, *last = NULL; struct sd_data *sdd = sd->private; const struct cpumask *span = sched_domain_span(sd); struct cpumask *covered; int i; lockdep_assert_held(&sched_domains_mutex); covered = sched_domains_tmpmask; cpumask_clear(covered); for_each_cpu_wrap(i, span, cpu) { struct sched_group *sg; if (cpumask_test_cpu(i, covered)) continue; sg = get_group(i, sdd); cpumask_or(covered, covered, sched_group_span(sg)); if (!first) first = sg; if (last) last->next = sg; last = sg; } last->next = first; sd->groups = first; return 0; }
主要逻辑就上面的代码。对当前调度域的每个cpu进行迭代。通过get_group获取该cpu的调度组,将调度组链接起来形成一个环形链表。调度域结构中的groups成员指向它的第一个调度组。这里有个比较关键的点,调度域的第一个调度组,也就是first指向的那个调度组就是该调度域的子调度域。这里大家不要搞混淆,调度组和调度域是两个结构,这里说的等同是从他们所包含的cpu集合是否一致而言的。这里covered变量记录了当前调度组的cpumask,判断当前的cpu是否已经包含在调度组中,如果是就跳过,避免将重复的调度组引入调度组链表。这样就保证了调度组之间是没有重叠的。剩下的重点就是get_group函数。
get_group是一个比较重要的函数,可以帮助你理解调度组是怎么来的。该函数也提供了非常详细的注释,值得好好读一下。关于这个注释,这里稍微总结一下:
1. 有三个重要的概念,调度域,调度组,sched_group_capability。调度域是垂直方向移动,调度组是水平方向移动,sched_group_capability每个调度组的cpu成员共享一个。
2. 调度域的第一个调度组就是它的子调度域。
3. 对于cpu拓扑结构之间没有重叠的情况,只需对每个调度组的第一个cpu创建调度组即可。这也就是上面代码中covered变量的作用。
下面看代码。
static struct sched_group *get_group(int cpu, struct sd_data *sdd) { struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu); struct sched_domain *child = sd->child; struct sched_group *sg; bool already_visited; if (child) cpu = cpumask_first(sched_domain_span(child)); sg = *per_cpu_ptr(sdd->sg, cpu); sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); /* Increase refcounts for claim_allocations: */ already_visited = atomic_inc_return(&sg->ref) > 1; /* sgc visits should follow a similar trend as sg */ WARN_ON(already_visited != (atomic_inc_return(&sg->sgc->ref) > 1)); /* If we have already visited that group, it's already initialized. */ if (already_visited) return sg; if (child) { cpumask_copy(sched_group_span(sg), sched_domain_span(child)); cpumask_copy(group_balance_mask(sg), sched_group_span(sg)); sg->flags = child->flags; } else { cpumask_set_cpu(cpu, sched_group_span(sg)); cpumask_set_cpu(cpu, group_balance_mask(sg)); } sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg)); sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; sg->sgc->max_capacity = SCHED_CAPACITY_SCALE; return sg; }
代码不多,也比较容易理解。入参是调度域和sd_data变量。在前面的代码中已经提到,每个调度组是一个per-cpu变量,每个cpu都有一个。首先是获取调度域的子调度域,然后拿到它的第一个cpu,只对第一个cpu创建调度组。sgc代表的sched_group_capacity也是一个per-cpu变量,作为调度组的一个成员出现。对于有child调度域的情况,直接将child调度域的cpu集合复制到调度组的cpu span上。这也就是上面所说,调度组就是子调度域的由来。对于没有child的情况,说该调度域处于最低级别,它的调度组只包含当前的cpu。这里还有group_balance_mask,它是sched_group_capability的成员,它与调度组的cpumask保持一致。需要特别注意,在做负载均衡的时候,不是调度组的每个cpu都要去做,只会找其中一个cpu,一般是第一个idle的cpu。调度组也有标志,含义与调度域相同。调度组的标志指向调度域的子调度域的flags。
最后初始化sgc的capacity相关的成员。前面已经讲述过cpu capability,由于每个cpu都有相同的CAPABILITY,调度组总的capacity只需SCHED_CAPABILITY_SCALE乘以调度组包含的cpu总数即可。
讲完没有overlap的调度组的创建过程,大家应该会对调度域和调度组的拓扑结构有一个清晰的了解了。这里比较容易混淆的是子调度域和调度组,虽然他们有相同的cpumask,但是他们仍然是不同的概念,有着不一样的使命。
没有overlap的调度域创建调度组是基于子调度域。有overlap的调度域在创建调度组时靠子调度域是会出错的。一个例子就包含在build_overlap_sched_groups的注释中。下面我们看看这个例子。
/* * Usually we build sched_group by sibling's child sched_domain * But for machines whose NUMA diameter are 3 or above, we move * to build sched_group by sibling's proper descendant's child * domain because sibling's child sched_domain will span out of * the sched_domain being built as below. * * Smallest diameter=3 topology is: * * node 0 1 2 3 * 0: 10 20 30 40 * 1: 20 10 20 30 * 2: 30 20 10 20 * 3: 40 30 20 10 * * 0 --- 1 --- 2 --- 3 * * NUMA-3 0-3 N/A N/A 0-3 * groups: {0-2},{1-3} {1-3},{0-2} * * NUMA-2 0-2 0-3 0-3 1-3 * groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2} * * NUMA-1 0-1 0-2 1-3 2-3 * groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2} * * NUMA-0 0 1 2 3 * * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the * group span isn't a subset of the domain span. */
这个例子中有4个numa node,由于4个node之间的距离并非完全对称,造成了各个node的调度域也出现不对称的情况。图中显示的调度组成员是正确的。假设基于子调度域去构建调度组,对于node0的NUMA2调度域而言,它有3个cpu,0对应的调度组是0-1,没问题,而2对应的子调度域是1-3,这就不对了,0的NUMA2调度域只是0-2,并不包含3,因此,该调度组是错误的。也就是说,我们没办法再按照创建没有overlap调度域的方法来操作。
来看一下kernel是怎么解决这个问题的,也就是find_descended_sibling的实现。
static struct sched_domain * find_descended_sibling(struct sched_domain *sd, struct sched_domain *sibling) { /* * The proper descendant would be the one whose child won't span out * of sd */ while (sibling->child && !cpumask_subset(sched_domain_span(sibling->child), sched_domain_span(sd))) sibling = sibling->child; /* * As we are referencing sgc across different topology level, we need * to go down to skip those sched_domains which don't contribute to * scheduling because they will be degenerated in cpu_attach_domain */ while (sibling->child && cpumask_equal(sched_domain_span(sibling->child), sched_domain_span(sibling))) sibling = sibling->child; return sibling; }
也比较容易理解,就是向下找能够包含在当前调度域内的子调度域。注意,当前调度域对应的cpu和这个子调度域对应的cpu并不是一个,他们是兄弟关系。
找到子调度域之后,调用build_group_from_child_sched_domain创建调度组。
static struct sched_group * build_group_from_child_sched_domain(struct sched_domain *sd, int cpu) { struct sched_group *sg; struct cpumask *sg_span; sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(), GFP_KERNEL, cpu_to_node(cpu)); if (!sg) return NULL; sg_span = sched_group_span(sg); if (sd->child) { cpumask_copy(sg_span, sched_domain_span(sd->child)); sg->flags = sd->child->flags; } else { cpumask_copy(sg_span, sched_domain_span(sd)); } atomic_inc(&sg->ref); return sg; }
这里有一个优化,给sg分配的内存在cpu所在的node,这样访问该sg就不需要到远端。主要的逻辑还是将子调度域的cpumask复制到sg的cpumask上。这里有个疑问,函数入参已经是前面找到的子调度域,为何此处还要用它的child调度域?答案是,前面找的子调度域是去跟当前调度域比较的。以没有overlap的调度域为例,寻找到的子调度域就是当前调度域(其实这种情况就是不需要在子调度域中找)对于没有子调度域的最低层调度域,直接将该调度域的cpumask复制到sg中(这不会有问题吗?)
之后调用init_overlap_sched_group初始化balance cpu mask和capability。
static void init_overlap_sched_group(struct sched_domain *sd, struct sched_group *sg) { struct cpumask *mask = sched_domains_tmpmask2; struct sd_data *sdd = sd->private; struct cpumask *sg_span; int cpu; build_balance_mask(sd, sg, mask); cpu = cpumask_first(mask); sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); if (atomic_inc_return(&sg->sgc->ref) == 1) cpumask_copy(group_balance_mask(sg), mask); else WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)); /* * Initialize sgc->capacity such that even if we mess up the * domains and no possible iteration will get us here, we won't * die on a /0 trap. */ sg_span = sched_group_span(sg); sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span); sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; sg->sgc->max_capacity = SCHED_CAPACITY_SCALE; }
build_balance_mask会过滤掉那些子调度域跟调度组cpumask不一致的cpu。
创建好调度组后跟前面一样将其串联起来,调度组就创建好了。
至此调度域和调度组的初始化代码就分析完了。
浙公网安备 33010602011771号