NUMA Memory Policy (翻译by chatgpt)

原文：https://www.kernel.org/doc/html/latest/admin-guide/mm/numa_memory_policy.html

What is NUMA Memory Policy?

In the Linux kernel, "memory policy" determines from which node the kernel will allocate memory in a NUMA system or in an emulated NUMA system. Linux has supported platforms with Non-Uniform Memory Access architectures since 2.4.?. The current memory policy support was added to Linux 2.6 around May 2004. This document attempts to describe the concepts and APIs of the 2.6 memory policy support.
在Linux内核中，“内存策略”决定了在NUMA系统或模拟NUMA系统中内核将从哪个节点分配内存。自2.4版本以来，Linux已经支持具有非统一内存访问架构的平台。当前的内存策略支持是在2004年5月左右添加到Linux 2.6中的。本文试图描述2.6内存策略支持的概念和API。

Memory policies should not be confused with cpusets (Documentation/admin-guide/cgroup-v1/cpusets.rst) which is an administrative mechanism for restricting the nodes from which memory may be allocated by a set of processes. Memory policies are a programming interface that a NUMA-aware application can take advantage of. When both cpusets and policies are applied to a task, the restrictions of the cpuset takes priority. See Memory Policies and cpusets below for more details.
内存策略不应与cpusets（Documentation/admin-guide/cgroup-v1/cpusets.rst）混淆，后者是一种管理机制，用于限制一组进程可以从哪些节点分配内存。内存策略是一个NUMA感知应用程序可以利用的编程接口。当任务同时应用cpusets和策略时，cpuset的限制优先级更高。有关更多详细信息，请参阅下面的内存策略和cpusets部分。

Memory Policy Concepts

Scope of Memory Policies

The Linux kernel supports scopes of memory policy, described here from most general to most specific:
Linux内核支持内存策略的范围，从最一般到最具体的描述如下：

System Default Policy

this policy is "hard coded" into the kernel. It is the policy that governs all page allocations that aren't controlled by one of the more specific policy scopes discussed below. When the system is "up and running", the system default policy will use "local allocation" described below. However, during boot up, the system default policy will be set to interleave allocations across all nodes with "sufficient" memory, so as not to overload the initial boot node with boot-time allocations.
这个策略是“硬编码”到内核中的。它是管理所有未受到下面讨论的更具体策略范围控制的页面分配的政策。当系统“运行中”时，默认系统策略将使用下面描述的“本地分配”。然而，在启动过程中，默认系统策略将被设置为在所有具有“足够”内存的节点之间交错分配，以免在初始引导节点上过载引导时分配。

Task/Process Policy

this is an optional, per-task policy. When defined for a specific task, this policy controls all page allocations made by or on behalf of the task that aren't controlled by a more specific scope. If a task does not define a task policy, then all page allocations that would have been controlled by the task policy "fall back" to the System Default Policy.
这是一个可选的、按任务划分的策略。当为特定任务定义时，该策略控制由任务或代表任务进行的所有页面分配，这些分配不受更具体范围的控制。如果任务没有定义任务策略，那么所有原本受任务策略控制的页面分配将“回退”到系统默认策略。

The task policy applies to the entire address space of a task. Thus, it is inheritable, and indeed is inherited, across both fork() [clone() w/o the CLONE_VM flag] and exec(). This allows a parent task to establish the task policy for a child task exec()'d from an executable image that has no awareness of memory policy. See the Memory Policy APIs section, below, for an overview of the system call that a task may use to set/change its task/process policy.
任务策略适用于任务的整个地址空间。因此，它是可继承的，并且确实会在fork() [没有CLONE_VM标志的clone()]和exec()中被继承。这使得父任务能够为从不了解内存策略的可执行映像中执行的子任务建立任务策略。有关任务可能用来设置/更改其任务/进程策略的系统调用的概述，请参见下面的内存策略API部分。

In a multi-threaded task, task policies apply only to the thread [Linux kernel task] that installs the policy and any threads subsequently created by that thread. Any sibling threads existing at the time a new task policy is installed retain their current policy.
在多线程任务中，任务策略仅适用于安装该策略的线程[Linux内核任务]以及随后由该线程创建的任何线程。在安装新任务策略时存在的任何同级线程将保留其当前策略。

A task policy applies only to pages allocated after the policy is installed. Any pages already faulted in by the task when the task changes its task policy remain where they were allocated based on the policy at the time they were allocated.
任务策略仅适用于安装策略后分配的页面。当任务更改其任务策略时，任何已由任务引发的页面将保留在它们根据当时的策略分配的位置。

VMA Policy

A "VMA" or "Virtual Memory Area" refers to a range of a task's virtual address space. A task may define a specific policy for a range of its virtual address space. See the Memory Policy APIs section, below, for an overview of the mbind() system call used to set a VMA policy.
"VMA"或"虚拟内存区域"是指任务的虚拟地址空间的范围。任务可以为其虚拟地址空间的某个范围定义特定策略。有关用于设置VMA策略的mbind()系统调用的概述，请参见下面的内存策略API部分。

A VMA policy will govern the allocation of pages that back this region of the address space. Any regions of the task's address space that don't have an explicit VMA policy will fall back to the task policy, which may itself fall back to the System Default Policy.
VMA策略将管理支持该地址空间区域的页面分配。任务地址空间中没有明确VMA策略的任何区域将回退到任务策略，而任务策略本身可能会回退到系统默认策略。

VMA policies have a few complicating details:

VMA policy applies ONLY to anonymous pages. These include pages allocated for anonymous segments, such as the task stack and heap, and any regions of the address space mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is applied to a file mapping, it will be ignored if the mapping used the MAP_SHARED flag. If the file mapping used the MAP_PRIVATE flag, the VMA policy will only be applied when an anonymous page is allocated on an attempt to write to the mapping-- i.e., at Copy-On-Write.
VMA策略仅适用于匿名页面。这些页面包括为匿名段分配的页面，例如任务堆栈和堆，以及使用MAP_ANONYMOUS标志进行映射的地址空间的任何区域。如果对文件映射应用了VMA策略，并且映射使用了MAP_SHARED标志，则将忽略该策略。如果文件映射使用了MAP_PRIVATE标志，则仅在尝试写入映射时分配匿名页面时，即写时复制时，才会应用VMA策略。
VMA policies are shared between all tasks that share a virtual address space--a.k.a. threads--independent of when the policy is installed; and they are inherited across fork(). However, because VMA policies refer to a specific region of a task's address space, and because the address space is discarded and recreated on exec(), VMA policies are NOT inheritable across exec(). Thus, only NUMA-aware applications may use VMA policies.
VMA（Virtual Memory Area）策略是在共享虚拟地址空间（也就是线程）的所有任务之间共享的，与策略安装的时间无关；并且它们会在 fork() 时被继承。然而，由于 VMA 策略是针对任务地址空间的特定区域的，而且在 exec() 中地址空间会被丢弃并重新创建，所以 VMA 策略在 exec() 时是不可继承的。因此，只有 NUMA 感知的应用程序可以使用 VMA 策略。
A task may install a new VMA policy on a sub-range of a previously mmap()ed region. When this happens, Linux splits the existing virtual memory area into 2 or 3 VMAs, each with its own policy.
任务可能在先前进行了mmap()的区域的子范围上安装一个新的VMA策略。当这种情况发生时，Linux会将现有的虚拟内存区域分割成2或3个VMAs，每个都有自己的策略。
By default, VMA policy applies only to pages allocated after the policy is installed. Any pages already faulted into the VMA range remain where they were allocated based on the policy at the time they were allocated. However, since 2.6.16, Linux supports page migration via the mbind() system call, so that page contents can be moved to match a newly installed policy.
默认情况下，VMA策略仅适用于在安装策略后分配的页面。已经故障到VMA范围内的任何页面将保留在它们根据当时分配策略分配的位置。然而，自2.6.16以来，Linux支持通过mbind()系统调用进行页面迁移，以便将页面内容移动以匹配新安装的策略。

Shared Policy

Conceptually, shared policies apply to "memory objects" mapped shared into one or more tasks' distinct address spaces. An application installs shared policies the same way as VMA policies--using the mbind() system call specifying a range of virtual addresses that map the shared object. However, unlike VMA policies, which can be considered to be an attribute of a range of a task's address space, shared policies apply directly to the shared object. Thus, all tasks that attach to the object share the policy, and all pages allocated for the shared object, by any task, will obey the shared policy.
概念上，共享策略适用于映射到一个或多个任务独立地址空间的“内存对象”。应用程序安装共享策略的方式与VMA策略相同--使用mbind()系统调用指定映射共享对象的一系列虚拟地址。然而，与VMA策略不同的是，VMA策略可以被视为任务地址空间范围的属性，而共享策略直接适用于共享对象。因此，连接到该对象的所有任务共享该策略，并且任何任务分配给共享对象的页面都将遵守共享策略。

As of 2.6.22, only shared memory segments, created by shmget() or mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared policy support was added to Linux, the associated data structures were added to hugetlbfs shmem segments. At the time, hugetlbfs did not support allocation at fault time--a.k.a lazy allocation--so hugetlbfs shmem segments were never "hooked up" to the shared policy support. Although hugetlbfs segments now support lazy allocation, their support for shared policy has not been completed.
从2.6.22开始，只有由shmget()或mmap(MAP_ANONYMOUS|MAP_SHARED)创建的共享内存段支持共享策略。当共享策略支持被添加到Linux时，相关的数据结构被添加到hugetlbfs shmem段。当时，hugetlbfs不支持故障时的分配--即延迟分配--因此hugetlbfs shmem段从未与共享策略支持“挂钩”。尽管现在hugetlbfs段支持延迟分配，但它们对共享策略的支持尚未完成。

As mentioned above in VMA policies section, allocations of page cache pages for regular files mmap()ed with MAP_SHARED ignore any VMA policy installed on the virtual address range backed by the shared file mapping. Rather, shared page cache pages, including pages backing private mappings that have not yet been written by the task, follow task policy, if any, else System Default Policy.
如上所述，在VMA策略部分，对于使用MAP_SHARED映射的常规文件的页面缓存页面的分配，会忽略对由共享文件映射支持的虚拟地址范围安装的任何VMA策略。相反，共享页面缓存页面，包括尚未被任务写入的支持私有映射的页面，将遵循任务策略（如果有的话），否则遵循系统默认策略。

The shared policy infrastructure supports different policies on subset ranges of the shared object. However, Linux still splits the VMA of the task that installs the policy for each range of distinct policy. Thus, different tasks that attach to a shared memory segment can have different VMA configurations mapping that one shared object. This can be seen by examining the /proc//numa_maps of tasks sharing a shared memory region, when one task has installed shared policy on one or more ranges of the region.
共享策略基础设施支持共享对象子范围上的不同策略。然而，Linux仍然会为安装了每个不同策略范围的任务的VMA拆分。因此，连接到共享内存段的不同任务可以具有不同的VMA配置，映射同一个共享对象。通过检查共享内存区域的/proc//numa_maps，可以看到安装了共享策略的一个或多个范围的任务共享共享内存区域的情况。

Components of Memory Policies

A NUMA memory policy consists of a "mode", optional mode flags, and an optional set of nodes. The mode determines the behavior of the policy, the optional mode flags determine the behavior of the mode, and the optional set of nodes can be viewed as the arguments to the policy behavior.
NUMA内存策略由一个"模式"、可选的模式标志和一个可选的节点集合组成。模式决定策略的行为，可选的模式标志决定模式的行为，可选的节点集合可以看作是策略行为的参数。

Internally, memory policies are implemented by a reference counted structure, struct mempolicy. Details of this structure will be discussed in context, below, as required to explain the behavior.
在内部，内存策略通过一个引用计数的结构体struct mempolicy来实现。下面将根据需要在上下文中讨论该结构的详细信息，以解释其行为。

NUMA memory policy supports the following 4 behavioral modes:
NUMA内存策略支持以下4种行为模式：

Default Mode--MPOL_DEFAULT

This mode is only used in the memory policy APIs. Internally, MPOL_DEFAULT is converted to the NULL memory policy in all policy scopes. Any existing non-default policy will simply be removed when MPOL_DEFAULT is specified. As a result, MPOL_DEFAULT means "fall back to the next most specific policy scope."
此模式仅在内存策略API中使用。在内部，MPOL_DEFAULT在所有策略范围中都会转换为NULL内存策略。当指定MPOL_DEFAULT时，任何现有的非默认策略都将被简单地移除。因此，MPOL_DEFAULT的意思是"回退到下一个最具体的策略范围"。

For example, a NULL or default task policy will fall back to the system default policy. A NULL or default vma policy will fall back to the task policy.
例如，NULL或默认的任务策略将回退到系统默认策略。NULL或默认的vma策略将回退到任务策略。

When specified in one of the memory policy APIs, the Default mode does not use the optional set of nodes.
在内存策略API中指定默认模式时，不使用可选的节点集合。

It is an error for the set of nodes specified for this policy to be non-empty.
对于此策略指定的节点集合非空是错误的。
MPOL_BIND

This mode specifies that memory must come from the set of nodes specified by the policy. Memory will be allocated from the node in the set with sufficient free memory that is closest to the node where the allocation takes place.
此模式指定内存必须来自策略指定的节点集合。内存将从具有足够空闲内存的节点中的最接近分配位置的节点中分配。
MPOL_PREFERRED

This mode specifies that the allocation should be attempted from the single node specified in the policy. If that allocation fails, the kernel will search other nodes, in order of increasing distance from the preferred node based on information provided by the platform firmware.
此模式指定应尝试从策略中指定的单个节点分配。如果分配失败，内核将根据平台固件提供的信息搜索其他节点，按照与首选节点距离递增的顺序。

Internally, the Preferred policy uses a single node--the preferred_node member of struct mempolicy. When the internal mode flag MPOL_F_LOCAL is set, the preferred_node is ignored and the policy is interpreted as local allocation. "Local" allocation policy can be viewed as a Preferred policy that starts at the node containing the cpu where the allocation takes place.
在内部，首选策略使用单个节点--struct mempolicy的preferred_node成员。当设置了内部模式标志MPOL_F_LOCAL时，将忽略preferred_node，并将策略解释为本地分配。"本地"分配策略可以看作是从包含分配位置的CPU的节点开始的首选策略。

It is possible for the user to specify that local allocation is always preferred by passing an empty nodemask with this mode. If an empty nodemask is passed, the policy cannot use the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described below.
用户可以通过传递空的节点掩码来指定始终优先选择本地分配。如果传递了空的节点掩码，则策略不能使用下面描述的MPOL_F_STATIC_NODES或MPOL_F_RELATIVE_NODES标志。
MPOL_INTERLEAVED

This mode specifies that page allocations be interleaved, on a page granularity, across the nodes specified in the policy. This mode also behaves slightly differently, based on the context where it is used:
此模式指定页面分配在策略中指定的节点上以页面粒度交错进行。此模式在使用的上下文中也有略微不同的行为：

For allocation of anonymous pages and shared memory pages, Interleave mode indexes the set of nodes specified by the policy using the page offset of the faulting address into the segment [VMA] containing the address modulo the number of nodes specified by the policy. It then attempts to allocate a page, starting at the selected node, as if the node had been specified by a Preferred policy or had been selected by a local allocation. That is, allocation will follow the per node zonelist.
对于匿名页面和共享内存页面的分配，交错模式使用策略指定的节点集合，使用故障地址在包含地址的段[VMA]中的页面偏移对策略指定的节点集合进行索引，取模策略指定的节点数。然后，它尝试分配一个页面，从所选节点开始，就像节点已被首选策略指定或已被本地分配选择一样。也就是说，分配将遵循每个节点的zonelist。

For allocation of page cache pages, Interleave mode indexes the set of nodes specified by the policy using a node counter maintained per task. This counter wraps around to the lowest specified node after it reaches the highest specified node. This will tend to spread the pages out over the nodes specified by the policy based on the order in which they are allocated, rather than based on any page offset into an address range or file. During system boot up, the temporary interleaved system default policy works in this mode.
对于页面缓存页面的分配，交错模式使用策略指定的节点集合，使用每个任务维护的节点计数器对节点进行索引。此计数器在达到最高指定节点后，会循环回到最低指定节点。这将倾向于根据分配的顺序将页面分散到策略指定的节点上，而不是基于地址范围或文件的任何页面偏移。在系统启动期间，临时的交错系统默认策略以此模式工作。
MPOL_PREFERRED_MANY

This mode specifies that the allocation should be preferably satisfied from the nodemask specified in the policy. If there is a memory pressure on all nodes in the nodemask, the allocation can fall back to all existing numa nodes. This is effectively MPOL_PREFERRED allowed for a mask rather than a single node.
此模式指定分配应优先满足策略中指定的节点掩码。如果所有节点的内存压力都很大，分配可以回退到所有现有的NUMA节点。这实际上是允许使用掩码而不是单个节点的MPOL_PREFERRED。

NUMA memory policy supports the following optional mode flags:
NUMA内存策略支持以下可选的模式标志：

MPOL_F_STATIC_NODES

This flag specifies that the nodemask passed by the user should not be remapped if the task or VMA's set of allowed nodes changes after the memory policy has been defined.
此标志指定用户传递的节点掩码在任务或VMA的允许节点集合发生更改后不应重新映射。

Without this flag, any time a mempolicy is rebound because of a change in the set of allowed nodes, the preferred nodemask (Preferred Many), preferred node (Preferred) or nodemask (Bind, Interleave) is remapped to the new set of allowed nodes. This may result in nodes being used that were previously undesired.
如果没有此标志，每当由于允许节点集合的更改而重新绑定内存策略时，首选节点掩码（Preferred Many）、首选节点（Preferred）或节点掩码（Bind、Interleave）将重新映射为新的允许节点集合。这可能导致使用先前不需要的节点。

With this flag, if the user-specified nodes overlap with the nodes allowed by the task's cpuset, then the memory policy is applied to their intersection. If the two sets of nodes do not overlap, the Default policy is used.
使用此标志时，如果用户指定的节点与任务的cpuset允许的节点重叠，则将内存策略应用于它们的交集。如果两个节点集合不重叠，则使用默认策略。

For example, consider a task that is attached to a cpuset with mems 1-3 that sets an Interleave policy over the same set. If the cpuset's mems change to 3-5, the Interleave will now occur over nodes 3, 4, and 5. With this flag, however, since only node 3 is allowed from the user's nodemask, the "interleave" only occurs over that node. If no nodes from the user's nodemask are now allowed, the Default behavior is used.
例如，考虑一个附加到具有内存1-3的cpuset的任务，该任务在相同的节点集合上设置了交错策略。如果cpuset的内存更改为3-5，交错现在将发生在节点3、4和5上。然而，使用此标志，由于用户的节点掩码只允许节点3，因此"交错"仅在该节点上发生。如果用户的节点掩码现在不允许任何节点，则使用默认行为。

MPOL_F_STATIC_NODES cannot be combined with the MPOL_F_RELATIVE_NODES flag. It also cannot be used for MPOL_PREFERRED policies that were created with an empty nodemask (local allocation).
MPOL_F_STATIC_NODES不能与MPOL_F_RELATIVE_NODES标志组合使用。它也不能用于使用空节点掩码（本地分配）创建的MPOL_PREFERRED策略。
MPOL_F_RELATIVE_NODES

This flag specifies that the nodemask passed by the user will be mapped relative to the set of the task or VMA's set of allowed nodes. The kernel stores the user-passed nodemask, and if the allowed nodes changes, then that original nodemask will be remapped relative to the new set of allowed nodes.
此标志指定用户传递的节点掩码将相对于任务或VMA的允许节点集合进行映射。内核存储用户传递的节点掩码，如果允许的节点发生更改，则将原始节点掩码相对于新的允许节点集合进行重新映射。

Without this flag (and without MPOL_F_STATIC_NODES), anytime a mempolicy is rebound because of a change in the set of allowed nodes, the node (Preferred) or nodemask (Bind, Interleave) is remapped to the new set of allowed nodes. That remap may not preserve the relative nature of the user's passed nodemask to its set of allowed nodes upon successive rebinds: a nodemask of 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of allowed nodes is restored to its original state.
如果没有此标志（且没有MPOL_F_STATIC_NODES），每当由于允许节点集合的更改而重新绑定内存策略时，节点（Preferred）或节点掩码（Bind、Interleave）将重新映射为新的允许节点集合。该重新映射可能不会保留用户传递的节点掩码相对于其允许节点集合的相对性：节点掩码1、3、5可能会被重新映射为7-9，然后再映射为1-3，如果允许节点集合恢复到其原始状态。

With this flag, the remap is done so that the node numbers from the user's passed nodemask are relative to the set of allowed nodes. In other words, if nodes 0, 2, and 4 are set in the user's nodemask, the policy will be effected over the first (and in the Bind or Interleave case, the third and fifth) nodes in the set of allowed nodes. The nodemask passed by the user represents nodes relative to task or VMA's set of allowed nodes.
使用此标志，重新映射将使用户传递的节点掩码的节点编号相对于允许节点集合。换句话说，如果用户的节点掩码中设置了节点0、2和4，则策略将在允许节点集合的第一个节点（在绑定或交错情况下，第三个和第五个节点）上生效。用户传递的节点掩码表示相对于任务或VMA的允许节点集合的节点。

If the user's nodemask includes nodes that are outside the range of the new set of allowed nodes (for example, node 5 is set in the user's nodemask when the set of allowed nodes is only 0-3), then the remap wraps around to the beginning of the nodemask and, if not already set, sets the node in the mempolicy nodemask.
如果用户的节点掩码包括超出新的允许节点集合范围的节点（例如，当允许节点集合仅为0-3时，用户的节点掩码中设置了节点5），则重新映射将循环回到节点掩码的开始，并且如果尚未设置，则设置内存策略节点掩码中的节点。

For example, consider a task that is attached to a cpuset with mems 2-5 that sets an Interleave policy over the same set with MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the interleave now occurs over nodes 3,5-7. If the cpuset's mems then change to 0,2-3,5, then the interleave occurs over nodes 0,2-3,5.
例如，考虑一个附加到具有内存2-5的cpuset的任务，该任务使用MPOL_F_RELATIVE_NODES在相同的节点集合上设置了交错策略。如果cpuset的内存更改为3-7，交错现在将发生在节点3、5-7上。如果cpuset的内存然后更改为0、2-3、5，则交错将发生在节点0、2-3、5上。

Thanks to the consistent remapping, applications preparing nodemasks to specify memory policies using this flag should disregard their current, actual cpuset imposed memory placement and prepare the nodemask as if they were always located on memory nodes 0 to N-1, where N is the number of memory nodes the policy is intended to manage. Let the kernel then remap to the set of memory nodes allowed by the task's cpuset, as that may change over time.
根据持续的重新映射，准备使用此标志指定内存策略的应用程序应忽略其当前实际的cpuset强加的内存放置，并准备节点掩码，就好像它们总是位于内存节点0到N-1上，其中N是策略旨在管理的内存节点数。然后让内核重新映射到任务cpuset允许的内存节点集合，因为这可能随时间而变化。

MPOL_F_RELATIVE_NODES cannot be combined with the MPOL_F_STATIC_NODES flag. It also cannot be used for MPOL_PREFERRED policies that were created with an empty nodemask (local allocation).
MPOL_F_RELATIVE_NODES不能与MPOL_F_STATIC_NODES标志结合使用。它也不能用于使用空节点掩码（本地分配）创建的MPOL_PREFERRED策略。

Memory Policy Reference Counting

To resolve use/free races, struct mempolicy contains an atomic reference count field. Internal interfaces, mpol_get()/mpol_put() increment and decrement this reference count, respectively. mpol_put() will only free the structure back to the mempolicy kmem cache when the reference count goes to zero.
这段文本主要讲述了在 Linux 内核中，为了解决使用/释放内存策略时的原子操作开销和竞争条件问题，使用了 mempolicy 结构中的原子引用计数字段。内部接口 mpol_get()/mpol_put() 分别用于增加和减少这个引用计数。当引用计数减少至零时，mpol_put() 将会将结构释放回 mempolicy kmem 缓存中。

When a new memory policy is allocated, its reference count is initialized to '1', representing the reference held by the task that is installing the new policy. When a pointer to a memory policy structure is stored in another structure, another reference is added, as the task's reference will be dropped on completion of the policy installation.
在分配新的内存策略时，其引用计数被初始化为 '1'，代表着安装新策略的任务所持有的引用。当将内存策略结构的指针存储在另一个结构中时，会增加另一个引用，因为任务的引用将在策略安装完成后被释放。

During run-time "usage" of the policy, we attempt to minimize atomic operations on the reference count, as this can lead to cache lines bouncing between cpus and NUMA nodes. "Usage" here means one of the following:
在运行时对策略的“使用”过程中，尝试最小化对引用计数的原子操作，因为这可能导致缓存行在 CPU 和 NUMA 节点之间跳动。这里的“使用”指以下情况之一：

querying of the policy, either by the task itself [using the get_mempolicy() API discussed below] or by another task using the /proc//numa_maps interface.
查询策略，可以是任务本身（使用下文讨论的 get_mempolicy() API）或其他任务使用 /proc//numa_maps 接口进行查询。
examination of the policy to determine the policy mode and associated node or node lists, if any, for page allocation. This is considered a "hot path". Note that for MPOL_BIND, the "usage" extends across the entire allocation process, which may sleep during page reclamation, because the BIND policy nodemask is used, by reference, to filter ineligible nodes.
检查策略以确定策略模式和相关节点或节点列表（如果有）以进行页面分配。这被视为“热路径”。需要注意的是，对于 MPOL_BIND，该“使用”跨越整个分配过程，可能在页面回收期间休眠，因为 BIND 策略的节点掩码是通过引用用于过滤不合格的节点。

We can avoid taking an extra reference during the usages listed above as follows:
我们可以通过以下方式避免在上述使用过程中额外增加引用：

we never need to get/free the system default policy as this is never changed nor freed, once the system is up and running.
我们永远不需要获取/释放系统默认策略，因为一旦系统启动运行，它就不会被更改或释放。
for querying the policy, we do not need to take an extra reference on the target task's task policy nor vma policies because we always acquire the task's mm's mmap_lock for read during the query. The set_mempolicy() and mbind() APIs [see below] always acquire the mmap_lock for write when installing or replacing task or vma policies. Thus, there is no possibility of a task or thread freeing a policy while another task or thread is querying it.
对于查询策略，我们不需要在目标任务的任务策略或 vma 策略上额外增加引用，因为我们在查询期间总是以读模式获取任务的 mm 的 mmap_lock。set_mempolicy() 和 mbind() API（见下文）在安装或替换任务或 vma 策略时总是以写模式获取 mmap_lock。因此，在一个任务或线程正在查询策略时，另一个任务或线程无法释放该策略。
Page allocation usage of task or vma policy occurs in the fault path where we hold them mmap_lock for read. Again, because replacing the task or vma policy requires that the mmap_lock be held for write, the policy can't be freed out from under us while we're using it for page allocation.
任务或 vma 策略的页面分配使用发生在故障路径上，我们在此期间以读模式持有 mmap_lock。同样，因为替换任务或 vma 策略需要持有写模式的 mmap_lock，所以在我们用于页面分配时，策略不会在我们使用它时被释放。
Shared policies require special consideration. One task can replace a shared memory policy while another task, with a distinct mmap_lock, is querying or allocating a page based on the policy. To resolve this potential race, the shared policy infrastructure adds an extra reference to the shared policy during lookup while holding a spin lock on the shared policy management structure. This requires that we drop this extra reference when we're finished "using" the policy. We must drop the extra reference on shared policies in the same query/allocation paths used for non-shared policies. For this reason, shared policies are marked as such, and the extra reference is dropped "conditionally"--i.e., only for shared policies.
共享策略需要特别考虑。一个任务可以在另一个任务（具有不同的 mmap_lock）正在查询或基于策略分配页面时替换共享内存策略。为了解决这种潜在的竞争条件，共享策略基础设施在持有共享策略管理结构的自旋锁时，在查找期间额外增加了一个引用。这要求我们在“使用”策略结束时放弃这个额外引用。我们必须在用于非共享策略的相同查询/分配路径上放弃共享策略的额外引用。因此，共享策略被标记为共享，并且额外引用被“有条件地”放弃，即仅对共享策略。

Because of this extra reference counting, and because we must lookup shared policies in a tree structure under spinlock, shared policies are more expensive to use in the page allocation path. This is especially true for shared policies on shared memory regions shared by tasks running on different NUMA nodes. This extra overhead can be avoided by always falling back to task or system default policy for shared memory regions, or by prefaulting the entire shared memory region into memory and locking it down. However, this might not be appropriate for all applications.
由于这种额外的引用计数，以及我们必须在自旋锁下的树结构中查找共享策略，共享策略在页面分配路径上的使用成本更高。这对于在不同 NUMA 节点上运行的任务共享内存区域上的共享策略尤其如此。通过始终回退到任务或系统默认策略来避免这种额外开销，或者通过预先将整个共享内存区域预取到内存并锁定它，可以避免这种额外开销。然而，这可能并不适用于所有应用程序。

Memory Policy APIs

Linux supports 4 system calls for controlling memory policy. These APIS always affect only the calling task, the calling task's address space, or some shared object mapped into the calling task's address space.
Linux支持4个系统调用来控制内存策略。这些API总是只影响调用任务、调用任务的地址空间或映射到调用任务地址空间的一些共享对象。

Note
the headers that define these APIs and the parameter data types for user space applications reside in a package that is not part of the Linux kernel. The kernel system call interfaces, with the 'sys_' prefix, are defined in <linux/syscalls.h>; the mode and flag definitions are defined in <linux/mempolicy.h>.
注意
定义这些API和用户空间应用程序的参数数据类型的头文件位于Linux内核的一个独立包中。内核系统调用接口，以 'sys_' 前缀开头，定义在 <linux/syscalls.h> 中；模式和标志的定义在 <linux/mempolicy.h> 中。

Set [Task] Memory Policy:

long set_mempolicy(int mode, const unsigned long *nmask,
                                unsigned long maxnode);

Set's the calling task's "task/process memory policy" to mode specified by the 'mode' argument and the set of nodes defined by 'nmask'. 'nmask' points to a bit mask of node ids containing at least 'maxnode' ids. Optional mode flags may be passed by combining the 'mode' argument with the flag (for example: MPOL_INTERLEAVE | MPOL_F_STATIC_NODES).
将调用任务的“任务/进程内存策略”设置为由 'mode' 参数和 'nmask' 定义的节点集合。'nmask' 指向至少包含 'maxnode' 个节点ID的位掩码。可通过将 'mode' 参数与标志组合（例如：MPOL_INTERLEAVE | MPOL_F_STATIC_NODES）来传递可选的模式标志。

See the set_mempolicy(2) man page for more details
有关更多详细信息，请参阅 set_mempolicy(2) 手册页。

Get [Task] Memory Policy or Related Information:

long get_mempolicy(int *mode,
                   const unsigned long *nmask, unsigned long maxnode,
                   void *addr, int flags);

Queries the "task/process memory policy" of the calling task, or the policy or location of a specified virtual address, depending on the 'flags' argument.
查询调用任务的“任务/进程内存策略”，或者根据 'flags' 参数查询指定虚拟地址的策略或位置。

See the get_mempolicy(2) man page for more details

Install VMA/Shared Policy for a Range of Task's Address Space:
为任务地址空间的一段安装VMA/共享策略：

long mbind(void *start, unsigned long len, int mode,
           const unsigned long *nmask, unsigned long maxnode,
           unsigned flags);

mbind() installs the policy specified by (mode, nmask, maxnodes) as a VMA policy for the range of the calling task's address space specified by the 'start' and 'len' arguments. Additional actions may be requested via the 'flags' argument.
mbind() 将由（mode、nmask、maxnodes）指定的策略作为调用任务地址空间范围的VMA策略安装到由 'start' 和 'len' 参数指定的范围内。可以通过 'flags' 参数请求其他操作。

See the mbind(2) man page for more details.

Set home node for a Range of Task's Address Spacec:
为任务地址空间的一段设置主节点：

long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
                                 unsigned long home_node,
                                 unsigned long flags);

sys_set_mempolicy_home_node set the home node for a VMA policy present in the task's address range. The system call updates the home node only for the existing mempolicy range. Other address ranges are ignored. A home node is the NUMA node closest to which page allocation will come from. Specifying the home node override the default allocation policy to allocate memory close to the local node for an executing CPU.
sys_set_mempolicy_home_node 为任务地址范围内存在的VMA策略设置主节点。该系统调用仅更新现有内存策略范围的主节点。其他地址范围将被忽略。主节点是最接近页面分配将来自的NUMA节点。指定主节点会覆盖默认分配策略，以便为执行CPU附近的本地节点分配内存。

Memory Policy Command Line Interface

Although not strictly part of the Linux implementation of memory policy, a command line tool, numactl(8), exists that allows one to:
虽然不严格属于Linux内存策略的实现部分，但存在一个命令行工具 numactl(8)，允许：

set the task policy for a specified program via set_mempolicy(2), fork(2) and exec(2)
通过 set_mempolicy(2)、fork(2) 和 exec(2) 为指定程序设置任务策略
set the shared policy for a shared memory segment via mbind(2)
通过 mbind(2) 为共享内存段设置共享策略

The numactl(8) tool is packaged with the run-time version of the library containing the memory policy system call wrappers. Some distributions package the headers and compile-time libraries in a separate development package.
numactl(8) 工具打包在包含内存策略系统调用包装器的运行时版本库中。一些发行版将头文件和编译时库打包在单独的开发包中。

Memory Policies and cpusets

Memory policies work within cpusets as described above. For memory policies that require a node or set of nodes, the nodes are restricted to the set of nodes whose memories are allowed by the cpuset constraints. If the nodemask specified for the policy contains nodes that are not allowed by the cpuset and MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes specified for the policy and the set of nodes with memory is used. If the result is the empty set, the policy is considered invalid and cannot be installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped onto and folded into the task's set of allowed nodes as previously described.
内存策略在上述cpusets中起作用。对于需要节点或节点集的内存策略，这些节点受限于cpuset约束允许的节点集。如果为策略指定的节点掩码包含不受cpuset允许且未使用MPOL_F_RELATIVE_NODES的节点，则使用策略指定的节点集和具有内存的节点集的交集。如果结果为空集，则认为策略无效，无法安装。如果使用了MPOL_F_RELATIVE_NODES，则策略的节点将映射到并折叠到任务允许节点集中，如前面所述。

The interaction of memory policies and cpusets can be problematic when tasks in two cpusets share access to a memory region, such as shared memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and any of the tasks install shared policy on the region, only nodes whose memories are allowed in both cpusets may be used in the policies. Obtaining this information requires "stepping outside" the memory policy APIs to use the cpuset information and requires that one know in what cpusets other task might be attaching to the shared region. Furthermore, if the cpusets' allowed memory sets are disjoint, "local" allocation is the only valid policy.
当两个cpusets中的任务共享对内存区域的访问时，例如通过shmget()或带有MAP_ANONYMOUS和MAP_SHARED标志的mmap()创建的共享内存段，并且任何任务在该区域上安装了共享策略时，只有在两个cpusets中允许内存的节点才可以在策略中使用。获取此信息需要“跳出”内存策略API以使用cpuset信息，并且需要知道其他任务可能附加到共享区域的cpusets。此外，如果cpusets的允许内存集是不相交的，则“本地”分配是唯一有效的策略。

posted @ 2023-11-27 21:48 dolinux 阅读(507) 评论(0) 收藏举报

刷新页面返回顶部

dolinux

Linux内核工程师，计算机底层技术爱好者