译自:https://lwn.net/Articles/807882/
The Linux control-group mechanism was designed to make it easy to assign processes to groups or move them around; it is a simple matter of writing a process ID to the appropriate cgroup.procs file in the control-group filesystem hierarchy. That only works for processes that actually exist, though. Adding the ability to place a new process into a control group at birth is the subject of this patch set from Christian Brauner.
在Linux中要把一个进程加入或者移出某个cgroup非常简单, 只需要将某个进程的PID写入到cgroup文件系统对应层级的cgroup.procs文件即可。不过这种方式只能操作已经创建好了的进程, 因为我们需要知道它的PID。能不能在不知道PID的情况下控制进程所在的控制组呢?
最近,在Christian Brauner提供的系列补丁中就提到了"将新任务创建到指定控制组"的这个主题。
Before getting into the details, one might naturally ask why this capability is needed, given that Linux has done without it since control groups were first created. The answer is that current kernels present a bit of a race condition for process managers using control groups. A manager can create a process, and it can assign it to a group, but it has little control over what happens between those two actions. That means, for example, that a process might run briefly before being placed into a group where its resource usage can be accounted for properly. The amount of error introduced is likely to be small, but people who are into accounting tend to be intensely irritated by such things.
在深入该补丁的细节之前,大家可能会有疑问为什么要引入这个功能呢?毕竟在该补丁之前 cgroup已经在Linux中相安无事的运行了这么多年。
答案就是需求驱动。在当前的内核实现方式中,进程管理者在使用cgroup管理进程时存在着一些竞争风险:进程管理者创建一个进程,然后紧接着就将该任务在加入到目标cgroup--这两个操作之间不是原子的,即有目标cgroup有一个空窗期无法监控到这个新进程。
这也意味着一个进程在加入到目标cgroup、其资源可以被统计之前可能还运行了一小会儿;这一小段时间内没有被目标cgroup有效统计到的资源(如CPU时间)其量虽然不多,但是对于统计人员来说确实一件非常恼人的事情。
Placing a process into its intended control group at birth avoids that kind of messiness. It also enables actions like creating a process in a frozen group, ensuring that it will start in a frozen state and not run at all until the process manager allows it to. Finally, as Brauner notes, this feature "simplifies container creation and exec logic quite a bit".
想象一下,如果可以在创建一个任务的过程中就把新任务放到目标cgroup中就能够完美解决类似的问题。例如,可以把新任务创建到frozen控制组中,这样进程管理者允许新任务运行之前它会一直处于frozen阻塞状态。此外,就像Brauner所指出的那样,这个特性“大大简化了容器的创建和执行逻辑”。
Cloning a new process directly into a control group is only available with version-2 control groups, and it can only be invoked with the brand-new clone3() system call, which has this interface:
直接将一个新任务clone到目标cgroup的这种功能只有cgroup-v2上才支持,而且还需要调用新引入的系统调用clone3(),该函数接口原型如下所示:
int clone3(struct clone_args *args, size_t size);
The decision to put all of the arguments to clone3() into a structure simplifies the prototype, but it also adds extensibility. That is demonstrated in this patch set, which adds a new field to struct clone_args:
将clone3()的参数用一个结构体clone_args来传递不仅简化了函数原型,同时还增加了扩展性。而这个补丁也正是利用了这个特性,在struct clone_args中增加了新的成员:
u64 cgroup;
This new field naturally increases the size of the clone_args structure. Any call to clone3() must pass the size of the structure it is using as the size parameter, so the kernel is able to recognize whether the caller is using a version of the structure containing the cgroup field or not. This new field, in other words, can be added without breaking code that was built before its addition.
新增cgroup成员必然会导致clone_args结构体变大。不过,clone3()函数还有一个参数"szie"--用以指定入参strut clone_args结构体的大小; 因而内核可以据此判断入参struct clone_args结构中否包含"cgroup"成员。这样一来,在struct clone_args结构中引入的新成员并不会破坏以往代码的兼容性。
To create a process into a different control group than the parent, one places a file descriptor for the appropriate control-group directory into that cgroup field and adds CLONE_INTO_CGROUP to the flags field in that same structure. If all goes well, the child will be created in the indicated group; otherwise the call will fail.
要把新任务创建到与parent不同的cgroup中,首先要成功获得cgroup目录文件的描述符,并将该描述符传给struct clone_args结构体的"cgroup"字段; 同时将CLONE_INTO_CGROUP添加到该结构体的"flags"字段。不出意外,子任务会创建到目标cgroup中,否则函数执行失败。
Placing a process into a control group in this way must adhere to all of the usual rules. For example, processes are only allowed to be placed in leaf nodes in the hierarchy, not in the internal nodes; an attempt to create a process in an internal node will fail. Similarly, all threads of a process must normally be in the same control group (though thread mode can relax that restriction in some cases); an attempt to separate threads improperly will fail.
用这种方式把新任务加入cgroup同样也需要遵循cgroup的使用规则。例如,任务只能够加入到叶子节点,如果把新任务加入到中间节点就会报错。同样,一个进程中的线程通常必须得放在同一个控制组(虽然线程模式可以放宽这个限制),但是如果打开的方式不正确,把一个进程的线程分开放到不同的控制组也会失败。
This patch set is quite new (this is its first posting), so there has been little time for the development community to review it. The feature seems useful enough that it is hard to imagine a great deal of opposition to the goal. Whether the implementation stands up to review remains to be seen, but it seems likely that this enhancement will find its way upstream in the relatively near future.
(截至该文发布, linux-5.7)由于这个补丁还是第一次发布,因而社区需要花一些时间来评审。不过,补丁本身有很大的用处,因而预计阻力应该不会太大。该版补丁能否评审通过仍然有待观察,但是该功能未来不久就会进入upstream。
LWN文章遵循CC BY-SA 4.0许可协议。欢迎分享、转载及基于现有协议再创作~