How The Kernel Manages Your Memory内核如何管理你的内存

After examining the virtual address layout of a process, we turn to the kernel and its mechanisms for managing user memory. Here is gonzo again:

Linux kernel mm_struct

Linux processes are implemented in the kernel as instances of task_struct, the process descriptor. The mm field in task_struct points to the memory descriptor, mm_struct, which is an executive summary of a program's memory. It stores the start and end of memory segments as shown above, the number of physical memory pages used by the process (rss stands for Resident Set Size), the amount of virtual address space used, and other tidbits. Within the memory descriptor we also find the two work horses for managing program memory: the set of virtual memory areas and the page tables. Gonzo's memory areas are shown below:

Kernel memory descriptor and memory areas

Each virtual memory area (VMA) is a contiguous range of virtual addresses; these areas never overlap. An instance of vm_area_struct fully describes a memory area, including its start and end addresses, flags to determine access rights and behaviors, and the vm_file field to specify which file is being mapped by the area, if any. A VMA that does not map a file is anonymous. Each memory segment above (e.g., heap, stack) corresponds to a single VMA, with the exception of the memory mapping segment. This is not a requirement, though it is usual in x86 machines. VMAs do not care which segment they are in.

A program's VMAs are stored in its memory descriptor both as a linked list in the mmap field, ordered by starting virtual address, and as a red-black tree rooted at the mm_rb field. The red-black tree allows the kernel to search quickly for the memory area covering a given virtual address. When you read file /proc/pid_of_process/maps, the kernel is simply going through the linked list of VMAs for the process and printing each one.

In Windows, the EPROCESS block is roughly a mix of task_struct and mm_struct. The Windows analog to a VMA is the Virtual Address Descriptor, or VAD; they are stored in an AVL tree. You know what the funniest thing about Windows and Linux is? It's the little differences.

The 4GB virtual address space is divided into pages. x86 processors in 32-bit mode support page sizes of 4KB, 2MB, and 4MB. Both Linux and Windows map the user portion of the virtual address space using 4KB pages. Bytes 0-4095 fall in page 0, bytes 4096-8191 fall in page 1, and so on. The size of a VMA must be a multiple of page size. Here's 3GB of user space in 4KB pages:

4KB Pages Virtual User Space

The processor consults page tables to translate a virtual address into a physical memory address. Each process has its own set of page tables; whenever a process switch occurs, page tables for user space are switched as well. Linux stores a pointer to a process' page tables in the pgd field of the memory descriptor. To each virtual page there corresponds one page table entry (PTE) in the page tables, which in regular x86 paging is a simple 4-byte record shown below:

x86 Page Table Entry (PTE) for 4KB page

Linux has functions to read and set each flag in a PTE. Bit P tells the processor whether the virtual page is present in physical memory. If clear (equal to 0), accessing the page triggers a page fault. Keep in mind that when this bit is zero, the kernel can do whatever it pleases with the remaining fields. The R/W flag stands for read/write; if clear, the page is read-only. Flag U/S stands for user/supervisor; if clear, then the page can only be accessed by the kernel. These flags are used to implement the read-only memory and protected kernel space we saw before.

Bits D and A are for dirty and accessed. A dirty page has had a write, while an accessed page has had a write or read. Both flags are sticky: the processor only sets them, they must be cleared by the kernel. Finally, the PTE stores the starting physical address that corresponds to this page, aligned to 4KB. This naive-looking field is the source of some pain, for it limits addressable physical memory to 4 GB. The other PTE fields are for another day, as is Physical Address Extension.

A virtual page is the unit of memory protection because all of its bytes share the U/S and R/W flags. However, the same physical memory could be mapped by different pages, possibly with different protection flags. Notice that execute permissions are nowhere to be seen in the PTE. This is why classic x86 paging allows code on the stack to be executed, making it easier to exploit stack buffer overflows (it's still possible to exploit non-executable stacks using return-to-libc and other techniques). This lack of a PTE no-execute flag illustrates a broader fact: permission flags in a VMA may or may not translate cleanly into hardware protection. The kernel does what it can, but ultimately the architecture limits what is possible.

Virtual memory doesn't store anything, it simply maps a program's address space onto the underlying physical memory, which is accessed by the processor as a large block called the physical address space. While memory operations on the bus are somewhat involved, we can ignore that here and assume that physical addresses range from zero to the top of available memory in one-byte increments. This physical address space is broken down by the kernel into page frames. The processor doesn't know or care about frames, yet they are crucial to the kernel because the page frame is the unit of physical memory management. Both Linux and Windows use 4KB page frames in 32-bit mode; here is an example of a machine with 2GB of RAM:

Physical Address Space

In Linux each page frame is tracked by a descriptor and several flags. Together these descriptors track the entire physical memory in the computer; the precise state of each page frame is always known. Physical memory is managed with the buddy memory allocation technique, hence a page frame is free if it's available for allocation via the buddy system. An allocated page frame might be anonymous, holding program data, or it might be in the page cache, holding data stored in a file or block device. There are other exotic page frame uses, but leave them alone for now. Windows has an analogous Page Frame Number (PFN) database to track physical memory.

Let's put together virtual memory areas, page table entries and page frames to understand how this all works. Below is an example of a user heap:

Physical Address Space

Blue rectangles represent pages in the VMA range, while arrows represent page table entries mapping pages onto page frames. Some virtual pages lack arrows; this means their corresponding PTEs have the Present flag clear. This could be because the pages have never been touched or because their contents have been swapped out. In either case access to these pages will lead to page faults, even though they are within the VMA. It may seem strange for the VMA and the page tables to disagree, yet this often happens.

A VMA is like a contract between your program and the kernel. You ask for something to be done (memory allocated, a file mapped, etc.), the kernel says "sure", and it creates or updates the appropriate VMA. But it does not actually honor the request right away, it waits until a page fault happens to do real work. The kernel is a lazy, deceitful sack of scum; this is the fundamental principle of virtual memory. It applies in most situations, some familiar and some surprising, but the rule is that VMAs record what has been agreed upon, while PTEs reflect what has actually been done by the lazy kernel. These two data structures together manage a program's memory; both play a role in resolving page faults, freeing memory, swapping memory out, and so on. Let's take the simple case of memory allocation:

Example of demand paging and memory allocation

When the program asks for more memory via the brk() system call, the kernel simply updates the heap VMA and calls it good. No page frames are actually allocated at this point and the new pages are not present in physical memory. Once the program tries to access the pages, the processor page faults and do_page_fault() is called. It searches for the VMA covering the faulted virtual address using find_vma(). If found, the permissions on the VMA are also checked against the attempted access (read or write). If there's no suitable VMA, no contract covers the attempted memory access and the process is punished by Segmentation Fault.

When a VMA is found the kernel must handle the fault by looking at the PTE contents and the type of VMA. In our case, the PTE shows the page is not present. In fact, our PTE is completely blank (all zeros), which in Linux means the virtual page has never been mapped. Since this is an anonymous VMA, we have a purely RAM affair that must be handled by do_anonymous_page(), which allocates a page frame and makes a PTE to map the faulted virtual page onto the freshly allocated frame.

Things could have been different. The PTE for a swapped out page, for example, has 0 in the Present flag but is not blank. Instead, it stores the swap location holding the page contents, which must be read from disk and loaded into a page frame by do_swap_page() in what is called a major fault.

This concludes the first half of our tour through the kernel's user memory management. In the next post, we'll throw files into the mix to build a complete picture of memory fundamentals, including consequences for performance.

在检查了进程的虚拟地址布局之后，我们转向内核及其管理用户内存的机制。这里又是奇闻趣事：

Linux kernel mm_struct

Linux 进程在内核中作为进程描述符 task_struct 的实例实现。 task_struct 中的 mm 字段指向内存描述符 mm_struct，它是程序内存的执行摘要。它存储如上所示内存段的开始和结束、进程使用的物理内存页数（rss 代表 Resident Set Size）、使用的虚拟地址空间量以及其他花絮。在内存描述符中，我们还发现了管理程序内存的两个工作对象：虚拟内存区域集和页表。 Gonzo 的内存区域如下所示：

Kernel memory descriptor and memory areas

每个虚拟内存区域 (VMA) 是一个连续的虚拟地址范围；这些领域从不重叠。 vm_area_struct 的一个实例完整地描述了一个内存区域，包括它的开始和结束地址、用于确定访问权限和行为的标志，以及用于指定该区域正在映射哪个文件（如果有）的 vm_file 字段。不映射文件的 VMA 是匿名的。上面的每个内存段（例如，堆、堆栈）对应于一个 VMA，但内存映射段除外。这不是必需的，尽管它在 x86 机器中很常见。 VMA 不关心他们在哪个段。

程序的 VMA 存储在其内存描述符中，既作为 mmap 字段中的链表（按起始虚拟地址排序），也作为以 mm_rb 字段为根的红黑树。红黑树允许内核快速搜索覆盖给定虚拟地址的内存区域。当你读取文件 /proc/pid_of_process/maps 时，内核只是简单地遍历进程的 VMA 链表并打印每个进程。

在 Windows 中，EPROCESS 块大致是 task_struct 和 mm_struct 的混合。与 VMA 类似的 Windows 是虚拟地址描述符或 VAD；它们存储在 AVL 树中。你知道 Windows 和 Linux 最有趣的地方是什么吗？就是细微的差别。

4GB 的虚拟地址空间被分成页。 32 位模式下的 x86 处理器支持 4KB、2MB 和 4MB 的页面大小。 Linux 和 Windows 都使用 4KB 页面映射虚拟地址空间的用户部分。字节 0-4095 位于第 0 页，字节 4096-8191 位于第 1 页，依此类推。 VMA 的大小必须是页面大小的倍数。这是 4KB 页面中的 3GB 用户空间：

4KB Pages Virtual User Space

处理器查阅页表以将虚拟地址转换为物理内存地址。每个进程都有自己的一组页表；每当发生进程切换时，也会切换用户空间的页表。 Linux 在内存描述符的 pgd 字段中存储了一个指向进程页表的指针。对于每个虚拟页面，在页表中对应一个页表条目 (PTE)，在常规 x86 分页中，它是一个简单的 4 字节记录，如下所示：

x86 Page Table Entry (PTE) for 4KB page

Linux 具有读取和设置 PTE 中每个标志的功能。位 P 告诉处理器虚拟页面是否存在于物理内存中。如果清除（等于 0），则访问页面会触发页面错误。请记住，当该位为零时，内核可以对其余字段进行任何操作。 R/W 标志代表读/写；如果清除，则该页面是只读的。 Flag U/S 代表用户/主管；如果清除，则该页面只能由内核访问。这些标志用于实现我们之前看到的只读内存和受保护的内核空间。

位 D 和 A 用于脏和访问。脏页有写入，而访问的页有写入或读取。这两个标志都是粘性的：处理器只设置它们，它们必须由内核清除。最后，PTE 存储此页对应的起始物理地址，对齐为 4KB。这个看起来很幼稚的字段是一些痛苦的根源，因为它将可寻址物理内存限制为 4 GB。其他 PTE 字段是另一天，物理地址扩展也是如此。

虚拟页面是内存保护的单位，因为它的所有字节共享 U/S 和 R/W 标志。然而，相同的物理内存可能被不同的页面映射，可能具有不同的保护标志。请注意，在 PTE 中看不到执行权限。这就是为什么经典的 x86 分页允许执行堆栈上的代码，从而更容易利用堆栈缓冲区溢出（仍然可以使用 return-to-libc 和其他技术来利用不可执行的堆栈）。缺少 PTE 不执行标志说明了一个更广泛的事实：VMA 中的权限标志可能会或可能不会完全转换为硬件保护。内核尽其所能，但最终架构限制了可能的事情。

虚拟内存不存储任何东西，它只是将程序的地址空间映射到底层物理内存，处理器将其作为称为物理地址空间的大块访问。虽然总线上的内存操作有些涉及，但我们可以在这里忽略这一点，并假设物理地址的范围从零到可用内存的顶部，以一个字节为增量。这个物理地址空间被内核分解成页框。处理器不知道也不关心帧，但它们对内核至关重要，因为页帧是物理内存管理的单元。 Linux 和 Windows 在 32 位模式下都使用 4KB 页帧；以下是具有 2GB RAM 的机器示例：

Physical Address Space

在 Linux 中，每个页框都由一个描述符和几个标志来跟踪。这些描述符一起跟踪计算机中的整个物理内存；每个页框的精确状态总是已知的。物理内存使用伙伴内存分配技术进行管理，因此如果页框可用于通过伙伴系统分配，则它是空闲的。分配的页框可能是匿名的，保存程序数据，或者它可能在页缓存中，保存存储在文件或块设备中的数据。还有其他奇特的页框用途，但暂时不要管它们。 Windows 有一个类似的页框号 (PFN) 数据库来跟踪物理内存。

让我们将虚拟内存区域、页表条目和页框放在一起来了解这一切是如何工作的。下面是一个用户堆的例子：

Physical Address Space

蓝色矩形表示 VMA 范围内的页面，而箭头表示将页面映射到页框的页表条目。一些虚拟页面缺少箭头；这意味着它们对应的 PTE 的 Present 标志已清除。这可能是因为这些页面从未被触摸过，或者因为它们的内容已被换出。在任何一种情况下，对这些页面的访问都会导致页面错误，即使它们位于 VMA 内。 VMA 和页表不一致可能看起来很奇怪，但这种情况经常发生。

VMA 就像您的程序和内核之间的合同。您要求做某事（分配内存、映射文件等），内核说“确定”，然后它会创建或更新适当的 VMA。但它实际上并没有立即响应请求，它等到页面错误发生时才开始真正的工作。内核是一袋懒惰的、欺骗性的渣滓；这是虚拟内存的基本原理。它适用于大多数情况，有些熟悉，有些令人惊讶，但规则是 VMA 记录已达成共识的内容，而 PTE 反映惰性内核实际完成的内容。这两种数据结构共同管理程序的内存；两者都在解决页面错误、释放内存、换出内存等方面发挥作用。让我们以内存分配的简单案例为例：

Example of demand paging and memory allocation

当程序通过 brk() 系统调用请求更多内存时，内核会简单地更新堆 VMA 并调用它。此时实际上没有分配页框，并且新页不存在于物理内存中。一旦程序尝试访问页面，处理器页面错误并调用 do_page_fault()。它使用 find_vma() 搜索覆盖故障虚拟地址的 VMA。如果找到，还会根据尝试的访问（读取或写入）检查 VMA 上的权限。如果没有合适的 VMA，则没有合约涵盖尝试的内存访问，并且该进程将受到分段错误的惩罚。

当找到 VMA 时，内核必须通过查看 PTE 内容和 VMA 类型来处理故障。在我们的例子中，PTE 显示页面不存在。事实上，我们的 PTE 是完全空白的（全为零），这在 Linux 中意味着虚拟页面从未被映射过。由于这是一个匿名 VMA，我们有一个纯粹的 RAM 事务，它必须由 do_anonymous_page() 处理，它分配一个页框并制作一个 PTE 将错误的虚拟页映射到新分配的帧上。

事情可能会有所不同。例如，换出页面的 PTE 在 Present 标志中为 0，但不是空白。相反，它存储保存页面内容的交换位置，在所谓的重大故障中，必须从磁盘读取并通过 do_swap_page() 将其加载到页面框架中。

这结束了我们内核用户内存管理之旅的前半部分。在下一篇文章中，我们将混合文件来构建内存基础的完整图景，包括对性能的影响。

内核如何管理你的内存 |许多但有限

硅基文明

代码改变不了世界，但是改变世界需要代码。

Many But Finite

How The Kernel Manages Your Memory