Linux内核学习——内存管理


本文试图对linux系统内存管理中的几个问题作出总结:

  1. Linux 系统管理的内存是什么
  2. RAM到内存空间是如何映射的
  3. 内存是如何分页的

内核源码版本——6.6.69LTS

名词解释

  • Physical Page 物理页面/Page Frame 页帧:物理内存以页为单位管理的物理页
  • Virtual Page 虚拟页:进程虚拟地址空间的页
  • Page Frame Number 页帧号:操作系统为了管理页帧按照物理地址顺序给每个页帧编号
  • Memory Management Unit,MMU 内存管理单元:CPU内部专门负责转换虚拟页面到物理页面的硬件单元
  • Translation Lookaside Buffer,TLB 页表缓存:高速缓存,用于缓存页表转换结果

物理内存

狭义的内存是指内存储器(RAM、主存)。
从CPU 的角度看,CPU承认的内存指的是有效的连接在总线上的存储(广义的内存),linux的内存内管理,管理的是广义的内存。

linux操作系统需要管理两个内存空间:

  1. 物理内存空间,从CPU角度看到的内存
  2. 虚拟内存空间,从程序角度看到的内存

物理内存空间包含了多种存储,RAM只是其中的一部分,还包括外部I/O设备的空间,系统通过MMIO(Memory Mapped IO, MMIO)将他们映射到内存空间。可以通过/proc/iomem查看系统的内存空间布局:

$ sudo cat /proc/iomem 
00000000-00000fff : Reserved
00001000-0002ffff : System RAM
00030000-0004ffff : Reserved
00050000-0009efff : System RAM
0009f000-0009ffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000f0000-000fffff : System ROM
00100000-7d143017 : System RAM
7d143018-7d178c57 : System RAM
7d178c58-7d179017 : System RAM
7d179018-7d182a57 : System RAM
7d182a58-7d1a7fff : System RAM
7d1a8000-7d1b0fff : Reserved
7d1b1000-7e8ecfff : System RAM
7e8ed000-7eb6cfff : Reserved
7eb6d000-7eb7efff : ACPI Tables
7eb7f000-7ebfefff : ACPI Non-volatile Storage
7ebff000-7efb1fff : System RAM
7efb2000-7efb5fff : Reserved
7efb6000-7efb7fff : ACPI Non-volatile Storage
7efb8000-7effffff : System RAM
7f000000-7fffffff : Reserved
80000000-dfffffff : PCI Bus 0000:00
  80000000-83ffffff : 0000:00:01.0
  84000000-87ffffff : 0000:00:01.0
  88000000-881fffff : PCI Bus 0000:0e
  88200000-883fffff : PCI Bus 0000:0d
  88400000-885fffff : PCI Bus 0000:0c
  88600000-887fffff : PCI Bus 0000:0b
  88800000-889fffff : PCI Bus 0000:0a
  88a00000-88bfffff : PCI Bus 0000:09
  88c00000-88dfffff : PCI Bus 0000:08
  88e00000-88ffffff : PCI Bus 0000:07
  89000000-891fffff : PCI Bus 0000:06
    89000000-89000fff : 0000:06:00.0
  89200000-893fffff : PCI Bus 0000:05
  89400000-895fffff : PCI Bus 0000:04
    89400000-89400fff : 0000:04:00.0
  89600000-897fffff : PCI Bus 0000:03
    89600000-89600fff : 0000:03:00.0
  89800000-899fffff : PCI Bus 0000:02
    89800000-89803fff : 0000:02:00.0
      89800000-89803fff : xhci-hcd
  89a00000-89bfffff : PCI Bus 0000:01
    89a00000-89a00fff : 0000:01:00.0
    89a80000-89afffff : 0000:01:00.0
  89c80000-89c83fff : 0000:00:1b.0
    89c80000-89c83fff : ICH HD audio
  89c84000-89c85fff : 0000:00:01.0
  89c86000-89c86fff : 0000:00:1f.2
    89c86000-89c86fff : ahci
  89c87000-89c87fff : 0000:00:03.5
  89c88000-89c88fff : 0000:00:03.4
  89c89000-89c89fff : 0000:00:03.3
  89c8a000-89c8afff : 0000:00:03.2
  89c8b000-89c8bfff : 0000:00:03.1
  89c8c000-89c8cfff : 0000:00:03.0
  89c8d000-89c8dfff : 0000:00:02.7
  89c8e000-89c8efff : 0000:00:02.6
  89c8f000-89c8ffff : 0000:00:02.5
  89c90000-89c90fff : 0000:00:02.4
  89c91000-89c91fff : 0000:00:02.3
  89c92000-89c92fff : 0000:00:02.2
  89c93000-89c93fff : 0000:00:02.1
  89c94000-89c94fff : 0000:00:02.0
e0000000-efffffff : PCI ECAM 0000 [bus 00-ff]
  e0000000-efffffff : Reserved
    e0000000-efffffff : pnp 00:04
f0000000-febfffff : PCI Bus 0000:00
fec00000-fec003ff : IOAPIC 0
fed1f410-fed1f414 : iTCO_wdt.1.auto
feffc000-feffffff : Reserved
100000000-17fffffff : System RAM
  117800000-118dfffff : Kernel code
  118e00000-119bfefff : Kernel rodata
  119c00000-11a056aff : Kernel data
  11a561000-11a9fffff : Kernel bss
7000000000-77ffffffff : PCI Bus 0000:00
  7000000000-700fffffff : PCI Bus 0000:01
    7000000000-7000003fff : 0000:01:00.0
      7000000000-7000003fff : virtio-pci-modern
  7010000000-701fffffff : PCI Bus 0000:02
  7020000000-702fffffff : PCI Bus 0000:03
    7020000000-7020003fff : 0000:03:00.0
      7020000000-7020003fff : virtio-pci-modern
  7030000000-703fffffff : PCI Bus 0000:04
    7030000000-7030003fff : 0000:04:00.0
      7030000000-7030003fff : virtio-pci-modern
  7040000000-704fffffff : PCI Bus 0000:05
    7040000000-7040003fff : 0000:05:00.0
      7040000000-7040003fff : virtio-pci-modern
  7050000000-705fffffff : PCI Bus 0000:06
    7050000000-7050003fff : 0000:06:00.0
      7050000000-7050003fff : virtio-pci-modern
  7060000000-706fffffff : PCI Bus 0000:07
  7070000000-707fffffff : PCI Bus 0000:08
  7080000000-708fffffff : PCI Bus 0000:09
  7090000000-709fffffff : PCI Bus 0000:0a
  70a0000000-70afffffff : PCI Bus 0000:0b
  70b0000000-70bfffffff : PCI Bus 0000:0c
  70c0000000-70cfffffff : PCI Bus 0000:0d
  70d0000000-70dfffffff : PCI Bus 0000:0e

整个内存空间一般不是连续的,会有用不到的Hole。

RAM到内存空间

RAM到内存空间的映射是分散映射的,比如系统上一个4GB的内存条,不是直接映射到内存空间的0~4GB,映射的策略在x86上是BIOS决定的。
内存管理,本质上是维护内存介质(RAM+MMIO)、内存空间和虚拟内存空间三者的关系:
在这里插入图片描述

内存映射

MMU

  • 什么是MMU
  • MMU工作原理
  • MMU工作流程
  • 软件模拟MMU

物理内存管理

处理器管理物理内存的最小单位是页,Linux中使用一个struct page数据结构来描述一个物理页面。

数据结构

Linux内核中使用struct page数据结构来描述一个物理页面。

struct page {
	unsigned long flags;		/* Atomic flags, some possibly
					 * updated asynchronously */
	/*
	 * Five words (20/40 bytes) are available in this union.
	 * WARNING: bit 0 of the first word is used for PageTail(). That
	 * means the other users of this union MUST NOT use the bit to
	 * avoid collision and false-positive PageTail().
	 */
	union {
		struct {	/* Page cache and anonymous pages */
			/**
			 * @lru: Pageout list, eg. active_list protected by
			 * lruvec->lru_lock.  Sometimes used as a generic list
			 * by the page owner.
			 */
			union {
				struct list_head lru;

				/* Or, for the Unevictable "LRU list" slot */
				struct {
					/* Always even, to negate PageTail */
					void *__filler;
					/* Count page's or folio's mlocks */
					unsigned int mlock_count;
				};

				/* Or, free page */
				struct list_head buddy_list;
				struct list_head pcp_list;
			};
			/* See page-flags.h for PAGE_MAPPING_FLAGS */
			struct address_space *mapping;
			union {
				pgoff_t index;		/* Our offset within mapping. */
				unsigned long share;	/* share count for fsdax */
			};
			/**
			 * @private: Mapping-private opaque data.
			 * Usually used for buffer_heads if PagePrivate.
			 * Used for swp_entry_t if PageSwapCache.
			 * Indicates order in the buddy system if PageBuddy.
			 */
			unsigned long private;
		};
		struct {	/* page_pool used by netstack */
			/**
			 * @pp_magic: magic value to avoid recycling non
			 * page_pool allocated pages.
			 */
			unsigned long pp_magic;
			struct page_pool *pp;
			unsigned long _pp_mapping_pad;
			unsigned long dma_addr;
			union {
				/**
				 * dma_addr_upper: might require a 64-bit
				 * value on 32-bit architectures.
				 */
				unsigned long dma_addr_upper;
				/**
				 * For frag page support, not supported in
				 * 32-bit architectures with 64-bit DMA.
				 */
				atomic_long_t pp_frag_count;
			};
		};
		struct {	/* Tail pages of compound page */
			unsigned long compound_head;	/* Bit zero is set */
		};
		struct {	/* ZONE_DEVICE pages */
			/** @pgmap: Points to the hosting device page map. */
			struct dev_pagemap *pgmap;
			void *zone_device_data;
			/*
			 * ZONE_DEVICE private pages are counted as being
			 * mapped so the next 3 words hold the mapping, index,
			 * and private fields from the source anonymous or
			 * page cache page while the page is migrated to device
			 * private memory.
			 * ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
			 * use the mapping, index, and private fields when
			 * pmem backed DAX files are mapped.
			 */
		};

		/** @rcu_head: You can use this to free a page by RCU. */
		struct rcu_head rcu_head;
	};

	union {		/* This union is 4 bytes in size. */
		/*
		 * If the page can be mapped to userspace, encodes the number
		 * of times this page is referenced by a page table.
		 */
		atomic_t _mapcount;

		/*
		 * If the page is neither PageSlab nor mappable to userspace,
		 * the value stored here may help determine what this page
		 * is used for.  See page-flags.h for a list of page types
		 * which are currently stored here.
		 */
		unsigned int page_type;
	};

	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
	atomic_t _refcount;

#ifdef CONFIG_MEMCG
	unsigned long memcg_data;
#endif

	/*
	 * On machines where all RAM is mapped into kernel address space,
	 * we can simply calculate the virtual address. On machines with
	 * highmem some memory is mapped into kernel virtual memory
	 * dynamically, so we need a place to store that address.
	 * Note that this field could be 16 bits on x86 ... ;)
	 *
	 * Architectures with slow multiplication can define
	 * WANT_PAGE_VIRTUAL in asm/page.h
	 */
#if defined(WANT_PAGE_VIRTUAL)
	void *virtual;			/* Kernel virtual address (NULL if
					   not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */

#ifdef CONFIG_KMSAN
	/*
	 * KMSAN metadata for this page:
	 *  - shadow page: every bit indicates whether the corresponding
	 *    bit of the original page is initialized (0) or not (1);
	 *  - origin page: every 4 bytes contain an id of the stack trace
	 *    where the uninitialized value was created.
	 */
	struct page *kmsan_shadow;
	struct page *kmsan_origin;
#endif

#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
	int _last_cpupid;
#endif
} _struct_page_alignment;
  • flags是页面的标志位集合,具体成两部分,低比特位的具体值定义在include/linux/page-flags.h,高比特为保留存放mapping zone, node (for NUMA) 和 SPARSEMEM section

伙伴系统

伙伴系统(Buddy system)是操作系统中最常用的动态存储管理方法之一。当用户提出申请时,伙伴系统分配一块大小合适的内存块给用户,反之在用户释放内存块时回收。

slab机制

slab机制的最核心的分配思想是在空闲时建立缓存对象池,包括本地对象缓冲池和共享对象缓冲池。
本地缓冲池:
在创建每个slab描述符时为每一个CPU创建一个本地对象缓存池,这样当需要从slab描述符中分配空闲对象时,优先从当前CPU的本地对象缓冲池中分配。本地缓存池,就是本地CPU可以访问的缓冲池,这样可以减少多核CPU中间的锁竞争。
共享对象缓冲池:
是所有CPU共享的,当本地缓冲池中没有空闲对象时,会从共享对象缓冲池中取一批空闲对象搬移到本地缓冲池中。

slob机制

slub机制

虚拟内存

虚拟内存管理

数据结构

Linux内核通过mm_struct数据结构管理每个进程所有的内存区域以及他们对应的页表映射。
mm_struct数据结构在include/linux/mm_types.h中定义

struct mm_struct {
	struct {
		/*
		 * Fields which are often written to are placed in a separate
		 * cache line.
		 */
		struct {
			/**
			 * @mm_count: The number of references to &struct
			 * mm_struct (@mm_users count as 1).
			 *
			 * Use mmgrab()/mmdrop() to modify. When this drops to
			 * 0, the &struct mm_struct is freed.
			 */
			atomic_t mm_count;
		} ____cacheline_aligned_in_smp;

		struct maple_tree mm_mt;
#ifdef CONFIG_MMU
		unsigned long (*get_unmapped_area) (struct file *filp,
				unsigned long addr, unsigned long len,
				unsigned long pgoff, unsigned long flags);
#endif
		unsigned long mmap_base;	/* base of mmap area */
		unsigned long mmap_legacy_base;	/* base of mmap area in bottom-up allocations */
#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
		/* Base addresses for compatible mmap() */
		unsigned long mmap_compat_base;
		unsigned long mmap_compat_legacy_base;
#endif
		unsigned long task_size;	/* size of task vm space */
		pgd_t * pgd;

#ifdef CONFIG_MEMBARRIER
		/**
		 * @membarrier_state: Flags controlling membarrier behavior.
		 *
		 * This field is close to @pgd to hopefully fit in the same
		 * cache-line, which needs to be touched by switch_mm().
		 */
		atomic_t membarrier_state;
#endif

		/**
		 * @mm_users: The number of users including userspace.
		 *
		 * Use mmget()/mmget_not_zero()/mmput() to modify. When this
		 * drops to 0 (i.e. when the task exits and there are no other
		 * temporary reference holders), we also release a reference on
		 * @mm_count (which may then free the &struct mm_struct if
		 * @mm_count also drops to 0).
		 */
		atomic_t mm_users;

#ifdef CONFIG_SCHED_MM_CID
		/**
		 * @pcpu_cid: Per-cpu current cid.
		 *
		 * Keep track of the currently allocated mm_cid for each cpu.
		 * The per-cpu mm_cid values are serialized by their respective
		 * runqueue locks.
		 */
		struct mm_cid __percpu *pcpu_cid;
		/*
		 * @mm_cid_next_scan: Next mm_cid scan (in jiffies).
		 *
		 * When the next mm_cid scan is due (in jiffies).
		 */
		unsigned long mm_cid_next_scan;
#endif
#ifdef CONFIG_MMU
		atomic_long_t pgtables_bytes;	/* size of all page tables */
#endif
		int map_count;			/* number of VMAs */

		spinlock_t page_table_lock; /* Protects page tables and some
					     * counters
					     */
		/*
		 * With some kernel config, the current mmap_lock's offset
		 * inside 'mm_struct' is at 0x120, which is very optimal, as
		 * its two hot fields 'count' and 'owner' sit in 2 different
		 * cachelines,  and when mmap_lock is highly contended, both
		 * of the 2 fields will be accessed frequently, current layout
		 * will help to reduce cache bouncing.
		 *
		 * So please be careful with adding new fields before
		 * mmap_lock, which can easily push the 2 fields into one
		 * cacheline.
		 */
		struct rw_semaphore mmap_lock;

		struct list_head mmlist; /* List of maybe swapped mm's.	These
					  * are globally strung together off
					  * init_mm.mmlist, and are protected
					  * by mmlist_lock
					  */
#ifdef CONFIG_PER_VMA_LOCK
		/*
		 * This field has lock-like semantics, meaning it is sometimes
		 * accessed with ACQUIRE/RELEASE semantics.
		 * Roughly speaking, incrementing the sequence number is
		 * equivalent to releasing locks on VMAs; reading the sequence
		 * number can be part of taking a read lock on a VMA.
		 *
		 * Can be modified under write mmap_lock using RELEASE
		 * semantics.
		 * Can be read with no other protection when holding write
		 * mmap_lock.
		 * Can be read with ACQUIRE semantics if not holding write
		 * mmap_lock.
		 */
		int mm_lock_seq;
#endif


		unsigned long hiwater_rss; /* High-watermark of RSS usage */
		unsigned long hiwater_vm;  /* High-water virtual memory usage */

		unsigned long total_vm;	   /* Total pages mapped */
		unsigned long locked_vm;   /* Pages that have PG_mlocked set */
		atomic64_t    pinned_vm;   /* Refcount permanently increased */
		unsigned long data_vm;	   /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
		unsigned long exec_vm;	   /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
		unsigned long stack_vm;	   /* VM_STACK */
		unsigned long def_flags;

		/**
		 * @write_protect_seq: Locked when any thread is write
		 * protecting pages mapped by this mm to enforce a later COW,
		 * for instance during page table copying for fork().
		 */
		seqcount_t write_protect_seq;

		spinlock_t arg_lock; /* protect the below fields */

		unsigned long start_code, end_code, start_data, end_data;
		unsigned long start_brk, brk, start_stack;
		unsigned long arg_start, arg_end, env_start, env_end;

		unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

		struct percpu_counter rss_stat[NR_MM_COUNTERS];

		struct linux_binfmt *binfmt;

		/* Architecture-specific MM context */
		mm_context_t context;

		unsigned long flags; /* Must use atomic bitops to access */

#ifdef CONFIG_AIO
		spinlock_t			ioctx_lock;
		struct kioctx_table __rcu	*ioctx_table;
#endif
#ifdef CONFIG_MEMCG
		/*
		 * "owner" points to a task that is regarded as the canonical
		 * user/owner of this mm. All of the following must be true in
		 * order for it to be changed:
		 *
		 * current == mm->owner
		 * current->mm != mm
		 * new_owner->mm == mm
		 * new_owner->alloc_lock is held
		 */
		struct task_struct __rcu *owner;
#endif
		struct user_namespace *user_ns;

		/* store ref to file /proc/<pid>/exe symlink points to */
		struct file __rcu *exe_file;
#ifdef CONFIG_MMU_NOTIFIER
		struct mmu_notifier_subscriptions *notifier_subscriptions;
#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
		pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
#ifdef CONFIG_NUMA_BALANCING
		/*
		 * numa_next_scan is the next time that PTEs will be remapped
		 * PROT_NONE to trigger NUMA hinting faults; such faults gather
		 * statistics and migrate pages to new nodes if necessary.
		 */
		unsigned long numa_next_scan;

		/* Restart point for scanning and remapping PTEs. */
		unsigned long numa_scan_offset;

		/* numa_scan_seq prevents two threads remapping PTEs. */
		int numa_scan_seq;
#endif
		/*
		 * An operation with batched TLB flushing is going on. Anything
		 * that can move process memory needs to flush the TLB when
		 * moving a PROT_NONE mapped page.
		 */
		atomic_t tlb_flush_pending;
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
		/* See flush_tlb_batched_pending() */
		atomic_t tlb_flush_batched;
#endif
		struct uprobes_state uprobes_state;
#ifdef CONFIG_PREEMPT_RT
		struct rcu_head delayed_drop;
#endif
#ifdef CONFIG_HUGETLB_PAGE
		atomic_long_t hugetlb_usage;
#endif
		struct work_struct async_put_work;

#ifdef CONFIG_IOMMU_SVA
		u32 pasid;
#endif
#ifdef CONFIG_KSM
		/*
		 * Represent how many pages of this process are involved in KSM
		 * merging (not including ksm_zero_pages).
		 */
		unsigned long ksm_merging_pages;
		/*
		 * Represent how many pages are checked for ksm merging
		 * including merged and not merged.
		 */
		unsigned long ksm_rmap_items;
		/*
		 * Represent how many empty pages are merged with kernel zero
		 * pages when enabling KSM use_zero_pages.
		 */
		atomic_long_t ksm_zero_pages;
#endif /* CONFIG_KSM */
#ifdef CONFIG_LRU_GEN
		struct {
			/* this mm_struct is on lru_gen_mm_list */
			struct list_head list;
			/*
			 * Set when switching to this mm_struct, as a hint of
			 * whether it has been used since the last time per-node
			 * page table walkers cleared the corresponding bits.
			 */
			unsigned long bitmap;
#ifdef CONFIG_MEMCG
			/* points to the memcg of "owner" above */
			struct mem_cgroup *memcg;
#endif
		} lru_gen;
#endif /* CONFIG_LRU_GEN */
	} __randomize_layout;

	/*
	 * The mm_cpumask needs to be at the end of mm_struct, because it
	 * is dynamically sized based on nr_cpu_ids.
	 */
	unsigned long cpu_bitmap[];
};

其中struct maple_tree是6.1内核引入的用于替换链表和红黑树管理vma的新数据结构,可以参考Introducing maple treesMaple Tree内核基础设施分析(二)Maple Tree
vm_area_struct是linux内核用来描述进程虚拟地址空间的数据结构,其定义在include/linux/mm_types.h

/*
 * This struct describes a virtual memory area. There is one of these
 * per VM-area/task. A VM area is any part of the process virtual memory
 * space that has a special rule for the page-fault handlers (ie a shared
 * library, the executable area etc).
 */
struct vm_area_struct {
	/* The first cache line has the info for VMA tree walking. */

	union {
		struct {
			/* VMA covers [vm_start; vm_end) addresses within mm */
			unsigned long vm_start;
			unsigned long vm_end;
		};
#ifdef CONFIG_PER_VMA_LOCK
		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
#endif
	};

	struct mm_struct *vm_mm;	/* The address space we belong to. */
	pgprot_t vm_page_prot;          /* Access permissions of this VMA. */

	/*
	 * Flags, see mm.h.
	 * To modify use vm_flags_{init|reset|set|clear|mod} functions.
	 */
	union {
		const vm_flags_t vm_flags;
		vm_flags_t __private __vm_flags;
	};

#ifdef CONFIG_PER_VMA_LOCK
	/*
	 * Can only be written (using WRITE_ONCE()) while holding both:
	 *  - mmap_lock (in write mode)
	 *  - vm_lock->lock (in write mode)
	 * Can be read reliably while holding one of:
	 *  - mmap_lock (in read or write mode)
	 *  - vm_lock->lock (in read or write mode)
	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
	 * while holding nothing (except RCU to keep the VMA struct allocated).
	 *
	 * This sequence counter is explicitly allowed to overflow; sequence
	 * counter reuse can only lead to occasional unnecessary use of the
	 * slowpath.
	 */
	int vm_lock_seq;
	struct vma_lock *vm_lock;

	/* Flag to indicate areas detached from the mm->mm_mt tree */
	bool detached;
#endif

	/*
	 * For areas with an address space and backing store,
	 * linkage into the address_space->i_mmap interval tree.
	 *
	 */
	struct {
		struct rb_node rb;
		unsigned long rb_subtree_last;
	} shared;

	/*
	 * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
	 * list, after a COW of one of the file pages.	A MAP_SHARED vma
	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
	 * or brk vma (with NULL file) can only be in an anon_vma list.
	 */
	struct list_head anon_vma_chain; /* Serialized by mmap_lock &
					  * page_table_lock */
	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */

	/* Function pointers to deal with this struct. */
	const struct vm_operations_struct *vm_ops;

	/* Information about our backing store: */
	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
					   units */
	struct file * vm_file;		/* File we map to (can be NULL). */
	void * vm_private_data;		/* was vm_pte (shared mem) */

#ifdef CONFIG_ANON_VMA_NAME
	/*
	 * For private and shared anonymous mappings, a pointer to a null
	 * terminated string containing the name given to the vma, or NULL if
	 * unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
	 */
	struct anon_vma_name *anon_name;
#endif
#ifdef CONFIG_SWAP
	atomic_long_t swap_readahead_info;
#endif
#ifndef CONFIG_MMU
	struct vm_region *vm_region;	/* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
#endif
#ifdef CONFIG_NUMA_BALANCING
	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
#endif
	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
} __randomize_layout;

缺页异常

linux内核通过缺页异常来建立虚拟内存和物理内存之间的映射关系。
缺页异常是内存管理中最复杂和重要的一部分,需要考虑很对细节,包括匿名页面、KSM页面、page cache页面、写时复制、私有映射和共享映射等。
缺页异常处理依赖于处理器的体系结构,因此缺页异常底层的处理流程在内核代码中处于特定体系结构的部分。

do_page_dault

匿名页面缺页异常

本件映射缺页异常

写时复制缺页异常

总结

页面回收算法

LRU算法

二次机会算法

OOM Killer机制

工具

top

在这里插入图片描述
PID: 进程ID
USER:进程所有者
PR:进程优先级
NI:进程nice值
VIRT:进程使用的虚拟内存,单位KB
RES:进程使用的并且未被换出的物理内存大小,单位KB
SHR:共享内存大小,单位KB
S:进程状态(D=不可中断睡眠状态,R=运行,S=睡眠,T=跟踪/停止,z=僵尸进程)
%CPU:上一次更新到现在的CPU时间占用百分比
%MEM:进程使用物理内存的百分比
TIME+:进程使用CPU事件总计,单位10ms
COMMAND:进程名或命令行

vmstat

$ vmstat --help

Usage:
 vmstat [options] [delay [count]]

Options:
 -a, --active           active/inactive memory
 -f, --forks            number of forks since boot
 -m, --slabs            slabinfo
 -n, --one-header       do not redisplay header
 -s, --stats            event counter statistics
 -d, --disk             disk statistics
 -D, --disk-sum         summarize disk statistics
 -p, --partition <dev>  partition specific statistics
 -S, --unit <char>      define display unit
 -w, --wide             wide output
 -t, --timestamp        show timestamp

 -h, --help     display this help and exit
 -V, --version  output version information and exit
$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 725336  94036 1956992    0    0   366   494  261  451  3  1 95  0  0

r:表示在运行队列中正在执行和等待的进程数
b:表示阻塞的进程
swap:表示交换到交换分区的内存大小
free:空闲的物理内存大小
buff:用作磁盘缓存的大小
cache:用于页面缓存的内存大小
si:每秒从交换分区读回内存的大小
so:每秒写入交换分区的大小
bi:每秒读取磁盘(块设备)的块数量
bo:每秒写入磁盘(块设备)的块数量
in:每秒中断数,包括时钟中断
cs:每秒上下文切换数量
us:用户进程执行时间百分比
sy:内核系统进程执行时间百分比
wa:I/O等待时间百分比
id:空闲时间百分比

参考

linux
Linux驱动学习——内存管理
Introducing maple trees
Maple Tree
内核基础设施分析(二)Maple Tree

posted @ 2026-03-01 13:29  main_c  阅读(2)  评论(0)    收藏  举报  来源