Linux Zero Copy 零拷贝
Linux Zero Copy 零拷贝
Purpose
目的
The purpose of this document is to highlight some of the aspects of various Linux subsystems (virtual memory, virtual file system) and how they relate to storage subsystems (file systems and block device drivers). Specifically, the goal is to provide enough information to intelligently design inter-process communication and logging software.
本文的目的是强调各种 Linux 子系统(虚拟内存、虚拟文件系统)的一些方面,以及它们与存储子系统(文件系统和块设备驱动程序)的关系。具体来说,我们的目标是提供足够的信息来智能地设计行程间通讯和日志软件。
Virtual Memory
虚拟内存
Much of this section is summarized from the very good introduction on The Linux Documentation Project.
这一部分的大部分内容是从《 Linux文档计划精彩的介绍中总结出来的。
The virtual memory subsystem presents a view of system memory that is larger than the physical memory available. In addition it provides a number of other features:
虚拟内存子系统提供的系统内存视图大于可用的物理内存。此外,它还提供了一些其他特性:
- Large address spaces 大地址空间
- Process isolation and protection 进程隔离和保护
- Memory mapping of files or devices 文件或设备的内存映射
- Allocation / division of physical memory 物理内存的分配/划分
- Shared memory 共享内存
Abstract Model
抽象模型
The address space of a process is the set of all addresses representable in a pointer (i.e. a 32 or 64 bit word). In userspace, all memory accesses are made against virtual addresses. A virtual address within the process address space is translated to physical addresses by lookup tables managed by the kernel. In order to keep these lookup tables efficient, they operate on fixed sized regions of memory, one unit of which is called a page. The lookup tables are called page tables. On x86 and arm the page size is 4kb (4096 bytes).
进程的地址空间是指针中可表示的所有地址的集合(即32位或64位字)。在用户空间中,所有的内存访问都是针对虚拟地址进行的。进程地址空间中的虚拟地址通过内核管理的查找表转换为物理地址。为了保持这些查找表的有效性,它们对固定大小的内存区域进行操作,其中一个单元称为页。查阅表称为页表。在 x86和 arm 上,页面大小为4kb (4096字节)。
Virtual addresses are composed of two fields: frame number and offset. The frame number is the table entry used to lookup the physical page and the offset is where within the physical page the address refers to.
虚拟地址由两个字段组成: 帧号和偏移量。帧号是用于查找物理页面的表条目,偏移量是地址在物理页面中引用的位置。

Abstract model of Virtual to Physical address mapping
虚拟到物理地址映射的抽象模型
The page table keeps track of some metadata associated with each mapping:
页面表跟踪与每个映射相关的一些元数据:
- Is this page table entry valid 此页表项有效吗
- The physical page number this entry points to 这个入口指向的物理页码
- Access control: read, write, executable 访问控制: 读,写,可执行
The physical processor is able to use the page table that is managed by the kernel. When a program attempts to access memory the processor will lookup the physical page from the table and (if valid) complete the access operation. If the table entry is not valid, it notifies the kernel by issuing a page fault.
物理处理器能够使用由内核管理的页表。当程序试图访问内存时,处理器将从表中查找物理页,并(如果有效)完成访问操作。如果表条目无效,它将通过发出一个页面错误通知内核。
Demand Paging
请求寻呼
Given that each process has it's own address space the amount of addressable virtual memory in a multiprocess system can be much greater than the physical memory. Linux implements a number of strategies to efficiently utilize this limited physical memory. One of those strategies is demand paging of process images.
假设每个进程都有自己的地址空间,那么多进程系统中的可寻址虚拟内存的数量可能远远大于物理内存。Linux 实现了许多策略来有效地利用这个有限的物理内存。其中一种策略是进程映像的请求分页。
When a program (in ELF format) is first started, the ELF interpreter maps the program file into memory at which point we refer to it as the process image. This mapping is initially unresolved (except for the initial portion) in the page table and physical memory is not yet dedicated to the process image. Instead, pages are filled on demand in response to page faults encountered as the program image is accessed.
当一个程序(ELF 格式)首次启动时,ELF 解释器将程序文件映射到内存中,此时我们将其称为进程映像。该映射最初在页表中无法解析(除了初始部分) ,物理内存尚未专用于进程映像。相反,在访问程序映像时遇到的页面错误将按需填充页面。
When the processor encounters a memory access in the program flow (it may need to fetch the next instruction, jump to a different instruction, or fetch or write to memory) it resolves the virtual address to a physical address through the processes page table. As mentioned above, if there is no entry in the process page table for that virtual address or if the entry is invalid the processor issues a page fault and control of the processor moves to the kernel.
当处理器在程序流中遇到内存访问时(它可能需要获取下一条指令、跳转到另一条指令,或者获取或写入内存) ,它通过进程页表将虚拟地址解析为物理地址。如上所述,如果进程页表中没有该虚拟地址的条目,或者该条目无效,则处理器会发出一个页面错误,并将对处理器的控制转移到内核。
For example, in the figure above there is no entry in process 1's page table for virtual page frame number 2 and so if process 1 attempts to read from an address within virtual page frame number 2 the processor cannot translate the address into a physical one.
例如,在上图中,进程1的页表中没有进入虚拟页帧2,因此如果进程1试图从虚拟页帧2中的地址读取地址,处理器不能将地址转换为物理地址。
If the virtual address of the access that induced the page fault is invalid, then the program is trying to read or write to an address that it has not configured. This is known as a segmentation fault. The kernel will signal the program with SIGSEGV and the program counter will jump to the signal handler (usually resulting in process termination).
如果导致页面错误的访问的虚拟地址无效,那么程序将尝试读取或写入一个它没有配置的地址。这就是所谓的内存区段错误。内核将用 SIGSEGV 向程序发出信号,程序计数器将跳转到信号处理程序(通常导致进程终止)。
If the virtual address of the access is valid but there is no physical page backing it, the kernel must assign a physical page to that virtual page, and then fill that page with the program contents read off from disk. In general this is a time consuming process and so this is an opportunity for the scheduler to service some other process on the processor that issued the fault. Once the fetched page is copied to physical memory and an entry is added to the page table, the process is restarted at the faulting instruction. This time the virtual memory address is successfully translated to a physical address by the processor and the program continues.
如果访问的虚拟地址是有效的,但是没有物理页面支持它,内核必须为该虚拟页面分配一个物理页面,然后用从磁盘读取的程序内容填充该页面。一般来说,这是一个耗时的过程,因此调度程序有机会在发出错误的处理器上为其他进程提供服务。一旦获取的页面被复制到物理内存并且一个条目被添加到页表中,进程就会在 faulting 指令处重新启动。这一次,虚拟内存地址被处理器成功地转换为物理地址,程序继续运行。
Linux Page Cache
页面缓存
In general, all reads and writes to real files in Linux go through the Linux page cache. This is a fundamental aspect of Linux performance and has far-reaching implications, including some that we may exploit for optimization.
通常,Linux 中对真实文件的所有读写操作都要经过 Linux 页面缓存。这是 Linux 性能的一个基本方面,具有深远的意义,包括我们可以利用的一些优化。
The Linux page cache is an in-memory generally write-back/read-through cache for file data. When data is read from a regular file it is first moved to page cache, and then made available through the filesystem driver. When data is written to a file, it is first copied to the page cache, and then flushed out to storage at some point later.
Linux 页面缓存通常是一个内存中的文件数据写回/读取缓存。当从一个普通文件中读取数据时,首先将数据移动到页面缓存,然后通过文件系统驱动程序使其可用。将数据写入文件时,首先将其复制到页面缓存,然后在稍后的某个时刻将其刷新到存储。
The purpose of the page cache is to speed up access to files on storage media. Files are generally read in a page at a time and these pages are stored in the page cache.
页面缓存的目的是加速对存储介质上文件的访问。文件通常一次读取一个页面,这些页面存储在页面缓存中。

The Linux Page Cache
页面缓存
Each file in Linux is identified by an data structure called an inode, and in Linux pretty much everything is a file (and so has an inode associated with it). When a page from a memory mapped file is read, it is processed through the page cache. If the cache is hot the page is served out of the cache. Otherwise a physical page is allocated and the filesystem or storage driver is informed to fill the page.
Linux 中的每个文件都由一个称为 inode 的数据结构来标识,在 Linux 中,几乎所有内容都是一个文件(因此有一个与之关联的 inode)。当读取内存映射文件中的页时,将通过页缓存处理该页。如果缓存处于热状态,则从缓存中提供页面。否则,将分配一个物理页面,并通知文件系统或存储驱动程序填充页面。
Pages filled in the page cache generally stay resident until something other demand pushes them out. This is of particular note because, in general, most of the physical memory is in use on Linux (by at least the page cache).
在页面缓存中填充的页面通常保持驻留状态,直到其他需求将它们推出。这一点特别值得注意,因为通常大部分物理内存都在 Linux 上使用(至少是通过页面缓存)。
Recovering physical memory
恢复物理记忆
The Linux kernel attempts to keep a pool of physical memory available for future use. The configurable behavior of this pool has two relative parameters:
Linux 内核试图保持一个可用的物理内存池,以供将来使用。该池的可配置行为有两个相关参数:
- high water mark 高水位线
- low water mark 低水位线
If the amount of physical memory available is greater than the high water mark then the kernel does nothing at all. Anything currently paged into the page cache is left there indefinitely.
如果可用的物理内存量大于最大值,那么内核什么也不做。当前页面缓存中的任何内容都会无限期地保留在那里。
Between the high water mark and low water mark the kernel begins to take action. It will start to evict pages out of physical memory. Below the low water mark the kernel gets more aggressive. The difference between the two is the number of pages the kernel will try to free during each attempt.
在高水位线和低水位线之间,果仁开始采取行动。它将开始从物理内存中删除页面。低于低水位标志的内核变得更具侵略性。两者之间的区别在于每次尝试期间内核尝试释放的页面数量。
kswapd decision points. (a): with lots of physical memory available, the swap daemon doesn't do anything. (b): when available memory drops below the high water mark, the swap daemon attempts to free a couple of pages per timer tick. (c): when available memory drops below the low water mark, the daemon attempts to free more pages per tick.
Kswapd 决策点。(a) : 有了大量可用的物理内存,交换守护进程不会做任何事情。(b) : 当可用内存降到高水位线以下时,交换守护进程会尝试释放每个定时器计时器的几个页面。(c) : 当可用内存降到低水位线以下时,守护进程每秒尝试释放更多页面。
The whole process is done by the a kernel thread called the kernel swap daemon (kswapd). It is serviced on a timer and at each service it looks at the number of free pages and takes action.
整个过程由一个称为内核交换守护进程(kswapd)的内核线程完成。它是通过计时器提供服务的,在每个服务中,它查看空闲页面的数量并采取行动。
When the swap daemon decides to try and free memory it first looks for page cache entries that can be discarded. It does this by walking around the page cache and inspecting some fixed number of pages at each iteration (clock algorithm), looking for any pages that can be discarded. A page is discardable if the page is not mapped into any process address space.
当交换守护进程决定尝试释放内存时,它首先查找可以丢弃的页面缓存条目。它通过遍历页面缓存并在每次迭代中检查一些固定数量的页面(时钟算法) ,寻找可以丢弃的任何页面来实现这一点。如果页没有映射到任何进程地址空间,则可放弃该页。
If the swap daemon doesn't recover enough pages through discarding of disk cache it will then attempt to swap out or discard mapped pages. It looks only at processes that are swappable (some are not), and that have swappable pages. A page can be locked removing it from the candidate pool of swappable pages. If disk swap is enabled, the swap daemon will consider swapping it out to swap file only if it cannot be recovered in another way. Demand-paged program storage, for instance, can be discarded without swapping because the data can be read back from disk if it's needed again later. The swap daemon will preferentially page out old pages versus those that were used recently.
如果交换守护进程无法通过丢弃磁盘缓存恢复足够的页面,那么它将尝试交换或丢弃映射的页面。它只关注可切换的进程(有些不可切换)和可切换页面的进程。从可切换页面的候选池中移除该页面可以被锁定。如果启用了磁盘交换,交换守护进程只有在无法以其他方式恢复时,才会考虑将其交换出交换文件。例如,需求分页的程序存储可以丢弃,而不需要进行交换,因为如果以后需要数据,可以从磁盘读取数据。交换守护进程将优先翻出旧页面而不是最近使用的页面。
Virtual File System
虚拟文件系统
Again, much of this section is summarized from the very good introduction on The Linux Documentation Project
同样,这一部分的大部分内容是从《 Linux文档计划精彩的介绍中总结出来的
The Linux Virtual File system (VFS) allows the operating system to interact with heterogeneous storage media utilizing a wide array of different filesystems. Filesystem drivers in Linux translate VFS interactions to filesystem specific interactions with the underlying storage media.
Linux 虚拟文件系统(Linux Virtual File system,VFS)允许操作系统利用大量不同的文件系统与异构存储介质进行交互。Linux 中的文件系统驱动程序将 VFS 交互转换为特定于文件系统的与底层存储介质的交互。
Schematic of the Linux virtual filesystem and caching.
Linux 虚拟档案系统和缓存示意图。
The basic building block of the Linux VFS is the inode. Every file in the filesystem is described by one and only one inode. They describe both the contents and topology of the VFS. inodes and directory contents are cached in the page cache like file contents, though these cache entries are not 1-1 mappings with data on the block-device of the storage medium. Rather, they are translated by the filesystem driver when they are read in. Never-the-less, they still are generally discardable cache entries as the data can always be restored by reading back the relevant blocks of the underlying storage medium (through the translation layer of the filesystem driver).
Linuxvfs 的基本构建块是 inode。文件系统中的每个文件都由一个且只有一个 inode 描述。它们描述了 VFS 的内容和拓扑结构。Inode 和目录内容像文件内容一样缓存在页面缓存中,尽管这些缓存条目不是与存储介质的块设备上的数据的1-1映射。相反,它们在读入时由文件系统驱动程序进行转换。尽管如此,它们通常仍然是可丢弃的高速缓存条目,因为数据总是可以通过读回底层存储介质的相关块(通过文件系统驱动程序的转换层)来还原。
A filesystem is basically two things:
文件系统基本上有两个方面:
- a specification for how file contents, metadata, and directory information are laid out on a continuous storage medium 文件内容、元数据和目录信息如何在连续存储介质上布局的规范
- a driver software which interprets this specification and provides a consistent API for the kernel to interact with. 一个驱动程序软件,它解释这个规范并为内核提供一致的 API 来进行交互
Storage media are block devices which are represented in Linux as files. Like other files they get an inode representation in the VFS and reads and writes to these files are cached in the page cache. When the filesystem driver reads data from the block device to, for instance, enumerate inodes or directory entries, the entire block is pulled into the page cache and then the exact data needed is read, interpreted, and used to fill inode and directory structures. These structures are themselves stored within pages of memory pulled from the page cache pool and are subject to cache rules.
存储介质是块设备,在 Linux 中以文件的形式表示。像其他文件一样,它们在 VFS 中获得一个 inode 表示,并将对这些文件的读写缓存在页面缓存中。当文件系统驱动程序将数据从块设备读取到,例如,枚举 inode 或目录条目时,整个块被拉到页面缓存中,然后所需的准确数据被读取、解释,并用于填充 inode 和目录结构。这些结构本身存储在从页缓存池提取的内存页中,并受缓存规则的约束。
Normally read() and write() act similarly. When the userspace read()s data from a file, the filesystem driver copies data from the block device into the userspace buffer for the read.
正常情况下,read ()和 write ()的动作相似。当用户空间从文件中读取()数据时,文件系统驱动程序将数据从块设备复制到用户空间缓冲区中进行读取。
Memory-mapped files are (potentially) dealt with a little differently. If the file contents are stored page-aligned and byte-for-byte on the block device (they are for a sane filesystem) then the filesystem driver can implement an optimization informing the kernel to map the existing cache pages directly into the process page table.
内存映射文件(可能)的处理方式稍有不同。如果文件内容以页面对齐和字节对字节的方式存储在块设备上(它们用于一个健全的文件系统) ,那么文件系统驱动程序可以实现一个优化,通知内核将现有的缓存页面直接映射到进程页表中。
Process Memory
进程记忆
The fundamental API with which a userspace program interacts with the kernel virtual memory subsystem is through the mmap(2) system call (and it's glibc wrapper function):
用户空间程序与内核虚拟内存子系统交互的基本 API 是通过 mmap (2)系统调用(以及它的 glibc 包装函式) :
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); int munmap(void *addr, size_t length);
From the Linux manual:
来自 Linux 手册:
mmap() Mmap () creates a new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. The length argument specifies the length of the mapping (which must be greater than 在编码过程的虚拟地址空间中创建一个新的映射。新映射的起始地址在 addr 中指定。Length 参数指定映射的长度(必须大于0).
mmap() can be used to map the contents of a file into the process address space. The process can then read from or write to the file by simply dereferencing a pointer. In particular:
Mmap ()可用于将文件的内容映射到进程地址空间。然后,进程可以通过简单地解引用一个指针来读取或写入文件。特别是:
The contents of a file mapping (as opposed to an anonymous mapping; see 文件映射的内容(与匿名映射相反) ,参见MAP_ANONYMOUS 匿名地图 below), are initialized using 下面) ,使用length 长度 bytes starting at 从... 开始offset 偏移 in the file (or other object) referred to by the file descriptor 在文件描述符引用的文件(或其他对象)中fd 消防局. offset 偏移 must be a multiple of the page size as returned by sysconf( 必须是 sysconf (返回的页大小的倍数_SC_PAGE_SIZE 页面大小).
Of particular note is how program instructions are accessed, which was introduced previously in the discussion on the virtual memory subsystem. Consider the execution of a program in the Executable and Linker File (ELF) format. When a program is executed with the exec(3) family of functions (and the underlying system call) Linux will replace the current process image with the system interpreter by mapping it into process address space and moving the program counter to first address in the interpreter program. The interpreter then maps the ELF file into memory, parses out some of the metadata, and then jumps to the start of the program in the mapped file. We often refer to the ELF file (as mapped into memory) as the program image. When the interpreter maps this file into memory it does so as an executable read-only mapping.
特别要注意的是如何访问程序指令,这在前面关于虚拟内存子系统的讨论中已经介绍过。考虑以可执行文件和链接器文件(ELF)格式执行程序。当一个程序用 exec (3)系列函数(和底层的系统调用)执行时,Linux 将用系统解释器替换当前进程映像,方法是将其映射到进程地址空间,并将程序计数器移动到解释器程序中的第一个地址。然后,解释器将 ELF 文件映射到内存中,解析出一些元数据,然后跳到映射文件中的程序开始。我们通常将 ELF 文件(映射到内存中)称为程序映像。当解释器将这个文件映射到内存中时,它作为一个可执行的只读映射来执行。
Calling mmap() with flags |= MAP_ANONYMOUS is how a process maps general purpose physical memory into it's address space. Specifically:
使用标志 | = MAP _ anonymous 调用 mmap ()是进程将通用物理内存映射到其地址空间的方式:
The mapping is not backed by any file; its contents are initialized to zero. The fd argument is ignored; however, some implementations require fd to be 映射不受任何文件的支持; 其内容被初始化为零。Fd 参数被忽略; 但是,有些实现要求 fd 为-1 if 如果MAP_ANONYMOUS 匿名地图 (or (或MAP_ANON 图片来源: anon) is specified, and portable applications should ensure this. The offset argument should be zero. The use of ) ,可移植应用程序应该确认这一点。偏移量参数应该为零MAP_ANONYMOUS 匿名地图 in conjunction with 与... 一起MAP_SHARED 图片共享 is supported on Linux only since kernel 2.4. 只在内核2.4版本中支持
This is the underlying mechanism of how memory allocators (i.e. malloc()) work. They call mmap() to map physical pages into the process address space, then they add additional metadata and various global data structures to provide a higher level interface on top of that. Note that the glibc implementation of malloc() never calls munmap(). Any free()ed memory is kept mapped for reuse in a later malloc() call.
这是内存分配器(即 malloc ())工作的基本机制。它们调用 mmap ()将物理页面映射到进程地址空间,然后添加额外的元数据和各种全局数据结构,以在此基础上提供更高级别的接口。注意,malloc ()的 glibc 实现从不调用 munmap ()。在以后的 malloc ()调用中,任何空闲的()内存都会被映射以便重用。
Calling mmap() with a file descriptor of an open file handle will initialize the map with the contents of the file, starting at offset. Calling with flags |= MAP_SHARED means that updates to the mapping are visible to other processes with the same region mapped into their address space, and (in the case of a file mapping), writes the the map are carried through to the underlying file. Specifically:
使用打开文件句柄的文件描述符调用 mmap ()将使用文件的内容初始化映射,从偏移量开始。使用标志 | = MAP _ shared 调用意味着对映射的更新对于其他进程是可见的,这些进程将相同的区域映射到它们的地址空间中,并且(在文件映射的情况下) ,将映射写入底层文件。具体来说:
MAP_SHARED 图片共享 Share this mapping. Updates to the mapping are visible to other processes mapping the same region, and (in the case of file-backed mappings) are carried through to the underlying file. (To precisely control when updates are carried through to the underlying file requires the use of 分享这个映射。对映射的更新对于映射相同区域的其他进程是可见的,并且(对于文件支持的映射来说)进行到底层文件。(要精确控制何时对基础文件执行更新,需要使用msync(2) 超级男孩(2).)
A shared mapping (whether anonymous or file-backed) allows the physical page to reside in the page table for more than one process.
共享映射(无论是匿名的还是文件支持的)允许物理页驻留在多个进程的页表中。



































浙公网安备 33010602011771号