Linux对用户态的动态内存管理

Linux对内核态内存分配请求与用户态内存分配请求处理上分别对待

Linux本身信任自己，因此Linux内核请求分配多少内存，就会马上分配相同数量的内存出来。

但内核本身不相信应用程序，而且通常应用程序分配了一段内存，其实只是预定，并不是马上就去访问。由于应用程序的数目比较多，那么这部分只分配了但是没有立即访问的内存就占了很大的比例。

1. 因此，内核通过Page Fault exception handler来延迟(Defer)对应用程序申请的内存进行分配操作。

2. 用户态的应用程序分配内存时，返回的不是页结构体(struct page)，而是允许应用程序开始使用的一段新的线性地址区间。

与内核不同，应用程序对其地址空间的占有不是连续的，而是分成一段一段的区间(Interval)。如果应用程序试图访问这些区间没有覆盖到的范围，就会引起异常，这就是常见的Access Denied的错误来源。

对于区间内的内存进行访问时，因为是只是reserved的空间，并没有准备好相应的page，因此会触发Page Fault，关键是要明确谁来处理Page Fault，比如如果后备存储是file，就会由file/inode来负责准备page，如果是动态申请的页，则会返回一个zeroed page。

很显然，这个异常不是CPU发起的，因为这里面的区间是操作系统定义，因此这个异常也是操作系统抛出的。

内存描述符

Memory Descriptor

保存在mm_struct结构体中，由process descriptor->mm来引用。

struct mm_struct {

    struct vm_area_struct * mmap;        /* list of VMAs */

    struct rb_root mm_rb;

    struct vm_area_struct * mmap_cache;    /* last find_vma result */

#ifdef CONFIG_MMU

    unsigned long (*get_unmapped_area) (struct file *filp,

                unsigned long addr, unsigned long len,

                unsigned long pgoff, unsigned long flags);

    void (*unmap_area) (struct mm_struct *mm, unsigned long addr);

#endif

    unsigned long mmap_base;        /* base of mmap area */

    unsigned long task_size;        /* size of task vm space */

    unsigned long cached_hole_size;     /* if non-zero, the largest hole below free_area_cache */

    unsigned long free_area_cache;        /* first hole of size cached_hole_size or larger */

    pgd_t * pgd;

    atomic_t mm_users;            /* How many users with user space? */

    atomic_t mm_count;            /* How many references to "struct mm_struct" (users count as 1) */

    int map_count;                /* number of VMAs */

  

    spinlock_t page_table_lock;        /* Protects page tables and some counters */

    struct rw_semaphore mmap_sem;

  

    struct list_head mmlist;        /* List of maybe swapped mm's.    These are globally strung

                         * together off init_mm.mmlist, and are protected

                         * by mmlist_lock

                         */

  

  

    unsigned long hiwater_rss;    /* High-watermark of RSS usage */

    unsigned long hiwater_vm;    /* High-water virtual memory usage */

  

    unsigned long total_vm, locked_vm, shared_vm, exec_vm;

    unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;

    unsigned long start_code, end_code, start_data, end_data;

    unsigned long start_brk, brk, start_stack;

    unsigned long arg_start, arg_end, env_start, env_end;

  

    unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

  

    /*

     * Special counters, in some configurations protected by the

     * page_table_lock, in other configurations by being atomic.

     */

    struct mm_rss_stat rss_stat;

  

    struct linux_binfmt *binfmt;

  

    cpumask_var_t cpu_vm_mask_var;

  

    /* Architecture-specific MM context */

    mm_context_t context;

  

    /* Swap token stuff */

    /*

     * Last value of global fault stamp as seen by this process.

     * In other words, this value gives an indication of how long

     * it has been since this task got the token.

     * Look at mm/thrash.c

     */

    unsigned int faultstamp;

    unsigned int token_priority;

    unsigned int last_interval;

  

    /* How many tasks sharing this mm are OOM_DISABLE */

    atomic_t oom_disable_count;

  

    unsigned long flags; /* Must use atomic bitops to access the bits */

  

    struct core_state *core_state; /* coredumping support */

#ifdef CONFIG_AIO

    spinlock_t        ioctx_lock;

    struct hlist_head    ioctx_list;

#endif

#ifdef CONFIG_MM_OWNER

    /*

     * "owner" points to a task that is regarded as the canonical

     * user/owner of this mm. All of the following must be true in

     * order for it to be changed:

     *

     * current == mm->owner

     * current->mm != mm

     * new_owner->mm == mm

     * new_owner->alloc_lock is held

     */

    struct task_struct __rcu *owner;

#endif

  

    /* store ref to file /proc/<pid>/exe symlink points to */

    struct file *exe_file;

    unsigned long num_exe_file_vmas;

#ifdef CONFIG_MMU_NOTIFIER

    struct mmu_notifier_mm *mmu_notifier_mm;

#endif

#ifdef CONFIG_TRANSPARENT_HUGEPAGE

    pgtable_t pmd_huge_pte; /* protected by page_table_lock */

#endif

#ifdef CONFIG_CPUMASK_OFFSTACK

    struct cpumask cpumask_allocation;

#endif

};

每个内存区间，通过struct vm_area_struct来描述。

每个内存描述符中，通过两种方式来管理内存区间：

1. mmap，以数组的形式保存属于该内存中的所有内存区间。【方便遍历，从低地址到高地址】

2. mm_rb，通过“红黑树”的方式管理内存中的所有内存区间。【方便查找】

其他重要的成员：

pgd, 保存本进程相关的页目录地址。【页目录的线性地址，参考：http://www.cnblogs.com/long123king/p/3506893.html】

map_count, 该内存中内存区间的个数。

mmlist，（struct list_head结构体），将该内存描述符保存在一个链表中。

start_code/end_code/start_data/end_data, 顾名思义。

brk, 当前进程的堆的地址。

context, 当前进程的上下文信息，其实就是LDT的地址

内核线程的内存描述符

struct task_struct {

  

......

  

struct mm_struct *mm, *active_mm;

......

  

}

内核线程只在内核态下运行，因此它不使用0~3GB的线性地址空间，也没有memory region的概念，因为内核态的线性地址空间是连续的。

因为每个进程的页表中，对于3GB-4GB的地址空间的页表项都是相同的，因此，内核态的进程使用哪个进程的页目录项都是一样的，而且为了避免反复地刷新TLB和CPU的硬件缓存，内核就尽量使用上一个进程的页目录。

因此在task_struct中有两个mm_struct（内存描述符成员）：mm和active_mm

对于普通进程，这两个成员是相同的，都指向当前进程的内存描述符成员；

对于内核线程，它没有对应的内存描述符，因此mm为NULL。而当一个内核线程被调度执行时，它的task_struct结构体中的active_mm成员被初始化为前一个正在执行的进程的active_mm成员。

每当内核态对应的页表项需要重新映射时，内核更新swapper_pg_dir中定义的内核全局页表规范集合。

然后，在Page Fault handler处理时，会把这次的改动更新到每个进程的页目录中。

内存区间(Memory Region)

/*

 * This struct defines a memory VMM memory area. There is one of these

 * per VM-area/task.  A VM area is any part of the process virtual memory

 * space that has a special rule for the page-fault handlers (ie a shared

 * library, the executable area etc).

 */

struct vm_area_struct {

    struct mm_struct * vm_mm;    /* The address space we belong to. */

    unsigned long vm_start;        /* Our start address within vm_mm. */

    unsigned long vm_end;        /* The first byte after our end address

                       within vm_mm. */

  

    /* linked list of VM areas per task, sorted by address */

    struct vm_area_struct *vm_next, *vm_prev;

  

    pgprot_t vm_page_prot;        /* Access permissions of this VMA. */

    unsigned long vm_flags;        /* Flags, see mm.h. */

  

    struct rb_node vm_rb;

  

    /*

     * For areas with an address space and backing store,

     * linkage into the address_space->i_mmap prio tree, or

     * linkage to the list of like vmas hanging off its node, or

     * linkage of vma in the address_space->i_mmap_nonlinear list.

     */

    union {

        struct {

            struct list_head list;

            void *parent;    /* aligns with prio_tree_node parent */

            struct vm_area_struct *head;

        } vm_set;

  

        struct raw_prio_tree_node prio_tree_node;

    } shared;

  

    /*

     * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma

     * list, after a COW of one of the file pages.    A MAP_SHARED vma

     * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack

     * or brk vma (with NULL file) can only be in an anon_vma list.

     */

    struct list_head anon_vma_chain; /* Serialized by mmap_sem &

                      * page_table_lock */

    struct anon_vma *anon_vma;    /* Serialized by page_table_lock */

  

    /* Function pointers to deal with this struct. */

    const struct vm_operations_struct *vm_ops;

  

    /* Information about our backing store: */

    unsigned long vm_pgoff;        /* Offset (within vm_file) in PAGE_SIZE

                       units, *not* PAGE_CACHE_SIZE */

    struct file * vm_file;        /* File we map to (can be NULL). */

    void * vm_private_data;        /* was vm_pte (shared mem) */

  

#ifndef CONFIG_MMU

    struct vm_region *vm_region;    /* NOMMU mapping region */

#endif

#ifdef CONFIG_NUMA

    struct mempolicy *vm_policy;    /* NUMA policy for the VMA */

#endif

};

在mm_struct中，通过两种方式可以索引到vm_area_struct，分别是mmap成员，它维护一条按线性内存地址升序的双链表；还有一个是mm_rb，它维护一个“红黑树”。

但是，实际上，对于一个内存区间结构体(vm_area_struct)的对象来说，只有一个实例来代表这个内存区间，只不过是通过两种数据结构来共同维护它。

这样做的好处，是两种数据结构可以分别用于不同的目的：

1，红黑树，用来根据一个指定的线性地址，快速地找到它所在的内存区间。

2，双链表，用于按顺序遍历全部的内存区间。

页的访问属性

保存在三个地方:

1. 每个页表项中，有相应的flag，代表其对应的页的访问属性。这是x86硬件用来检查页是否可以访问的依据；

2. 每个页描述符struct page中，有相应的flag。这是为操作系统的检查提供的；

3. 每个内存区间vm_area_struct中，有相应的flag，代表该区间中的所有的页的访问属性。

对于Memory Region的几种操作

find_vma, 找到与目标地址最靠近的内存区间

/* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */

struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)

{

    struct vm_area_struct *vma = NULL;

  

    if (mm) {

        /* Check the cache first. */

        /* (Cache hit rate is typically around 35%.) */

        vma = ACCESS_ONCE(mm->mmap_cache);

        if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {

            struct rb_node * rb_node;

  

            rb_node = mm->mm_rb.rb_node;

            vma = NULL;

  

            while (rb_node) {

                struct vm_area_struct * vma_tmp;

  

                vma_tmp = rb_entry(rb_node,

                        struct vm_area_struct, vm_rb);

  

                if (vma_tmp->vm_end > addr) {

                    vma = vma_tmp;

                    if (vma_tmp->vm_start <= addr)

                        break;

                    rb_node = rb_node->rb_left;

                } else

                    rb_node = rb_node->rb_right;

            }

            if (vma)

                mm->mmap_cache = vma;

        }

    }

    return vma;

}

find_vma_intersection, 找到与目标地址范围相交的内存区间

/* Look up the first VMA which intersects the interval start_addr..end_addr-1,

   NULL if none.  Assume start_addr < end_addr. */

static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * mm, unsigned long start_addr, unsigned long end_addr)

{

    struct vm_area_struct * vma = find_vma(mm,start_addr);

  

    if (vma && end_addr <= vma->vm_start)

        vma = NULL;

    return vma;

}

get_unmapped_area, 找到符合指定长度的内存区间之间的空洞，可以作为新建的内存区间的候选位置

unsigned long

get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,

        unsigned long pgoff, unsigned long flags)

{

    unsigned long (*get_area)(struct file *, unsigned long,

                  unsigned long, unsigned long, unsigned long);

  

    unsigned long error = arch_mmap_check(addr, len, flags);

    if (error)

        return error;

  

    /* Careful about overflows.. */

    if (len > TASK_SIZE)

        return -ENOMEM;

  

    get_area = current->mm->get_unmapped_area;

    if (file && file->f_op && file->f_op->get_unmapped_area)

        get_area = file->f_op->get_unmapped_area;

    addr = get_area(file, addr, len, pgoff, flags);

    if (IS_ERR_VALUE(addr))

        return addr;

  

    if (addr > TASK_SIZE - len)

        return -ENOMEM;

    if (addr & ~PAGE_MASK)

        return -EINVAL;

  

    return arch_rebalance_pgtables(addr, len);

}

insert_vm_struct，把一个指定的内存区间，添加到指定的内存描述符中

/* Insert vm structure into process list sorted by address

 * and into the inode's i_mmap tree.  If vm_file is non-NULL

 * then i_mmap_mutex is taken here.

 */

int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)

{

    struct vm_area_struct * __vma, * prev;

    struct rb_node ** rb_link, * rb_parent;

  

    /*

     * The vm_pgoff of a purely anonymous vma should be irrelevant

     * until its first write fault, when page's anon_vma and index

     * are set.  But now set the vm_pgoff it will almost certainly

     * end up with (unless mremap moves it elsewhere before that

     * first wfault), so /proc/pid/maps tells a consistent story.

     *

     * By setting it to reflect the virtual start address of the

     * vma, merges and splits can happen in a seamless way, just

     * using the existing file pgoff checks and manipulations.

     * Similarly in do_mmap_pgoff and in do_brk.

     */

    if (!vma->vm_file) {

        BUG_ON(vma->anon_vma);

        vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;

    }

    __vma = find_vma_prepare(mm,vma->vm_start,&prev,&rb_link,&rb_parent);

    if (__vma && __vma->vm_start < vma->vm_end)

        return -ENOMEM;

    if ((vma->vm_flags & VM_ACCOUNT) &&

         security_vm_enough_memory_mm(mm, vma_pages(vma)))

        return -ENOMEM;

    vma_link(mm, vma, prev, rb_link, rb_parent);

    return 0;

}

do_mmap，分配一个线性地址内存区间，实现中调用do_mmap_pgoff和mmap_region完成。

/*
* 'kernel.h' contains some often-used function prototypes etc
*/
#define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
#define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))

static inline unsigned long do_mmap(struct file *file, unsigned long addr,

    unsigned long len, unsigned long prot,

    unsigned long flag, unsigned long offset)

{

    unsigned long ret = -EINVAL;

    if ((offset + PAGE_ALIGN(len)) < offset)

        goto out;

    if (!(offset & ~PAGE_MASK))

        ret = do_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);

out:

    return ret;

}

  

/*

 * The caller must hold down_write(&current->mm->mmap_sem).

 */

  

unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,

            unsigned long len, unsigned long prot,

            unsigned long flags, unsigned long pgoff)

{

    struct mm_struct * mm = current->mm;

    struct inode *inode;

    vm_flags_t vm_flags;

    int error;

    unsigned long reqprot = prot;

  

    /*

     * Does the application expect PROT_READ to imply PROT_EXEC?

     *

     * (the exception is when the underlying filesystem is noexec

     *  mounted, in which case we dont add PROT_EXEC.)

     */

    if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))

        if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC)))

            prot |= PROT_EXEC;

  

    if (!len)

        return -EINVAL;

  

    if (!(flags & MAP_FIXED))

        addr = round_hint_to_min(addr);

  

    /* Careful about overflows.. */

    len = PAGE_ALIGN(len);

    if (!len)

        return -ENOMEM;

  

    /* offset overflow? */

    if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)

               return -EOVERFLOW;

  

    /* Too many mappings? */

    if (mm->map_count > sysctl_max_map_count)

        return -ENOMEM;

  

    /* Obtain the address to map to. we verify (or select) it and ensure

     * that it represents a valid section of the address space.

     */

    addr = get_unmapped_area(file, addr, len, pgoff, flags);

    if (addr & ~PAGE_MASK)

        return addr;

  

    /* Do simple checking here so the lower-level routines won't have

     * to. we assume access permissions have been handled by the open

     * of the memory object, so we don't do any here.

     */

    vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |

            mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

  

    if (flags & MAP_LOCKED)

        if (!can_do_mlock())

            return -EPERM;

  

    /* mlock MCL_FUTURE? */

    if (vm_flags & VM_LOCKED) {

        unsigned long locked, lock_limit;

        locked = len >> PAGE_SHIFT;

        locked += mm->locked_vm;

        lock_limit = rlimit(RLIMIT_MEMLOCK);

        lock_limit >>= PAGE_SHIFT;

        if (locked > lock_limit && !capable(CAP_IPC_LOCK))

            return -EAGAIN;

    }

  

    inode = file ? file->f_path.dentry->d_inode : NULL;

  

    if (file) {

        switch (flags & MAP_TYPE) {

        case MAP_SHARED:

            if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))

                return -EACCES;

  

            /*

             * Make sure we don't allow writing to an append-only

             * file..

             */

            if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))

                return -EACCES;

  

            /*

             * Make sure there are no mandatory locks on the file.

             */

            if (locks_verify_locked(inode))

                return -EAGAIN;

  

            vm_flags |= VM_SHARED | VM_MAYSHARE;

            if (!(file->f_mode & FMODE_WRITE))

                vm_flags &= ~(VM_MAYWRITE | VM_SHARED);

  

            /* fall through */

        case MAP_PRIVATE:

            if (!(file->f_mode & FMODE_READ))

                return -EACCES;

            if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) {

                if (vm_flags & VM_EXEC)

                    return -EPERM;

                vm_flags &= ~VM_MAYEXEC;

            }

  

            if (!file->f_op || !file->f_op->mmap)

                return -ENODEV;

            break;

  

        default:

            return -EINVAL;

        }

    } else {

        switch (flags & MAP_TYPE) {

        case MAP_SHARED:

            /*

             * Ignore pgoff.

             */

            pgoff = 0;

            vm_flags |= VM_SHARED | VM_MAYSHARE;

            break;

        case MAP_PRIVATE:

            /*

             * Set pgoff according to addr for anon_vma.

             */

            pgoff = addr >> PAGE_SHIFT;

            break;

        default:

            return -EINVAL;

        }

    }

  

    error = security_file_mmap(file, reqprot, prot, flags, addr, 0);

    if (error)

        return error;

  

    return mmap_region(file, addr, len, flags, vm_flags, pgoff);

}

EXPORT_SYMBOL(do_mmap_pgoff);

  

unsigned long mmap_region(struct file *file, unsigned long addr,

              unsigned long len, unsigned long flags,

              vm_flags_t vm_flags, unsigned long pgoff)

{

    struct mm_struct *mm = current->mm;

    struct vm_area_struct *vma, *prev;

    int correct_wcount = 0;

    int error;

    struct rb_node **rb_link, *rb_parent;

    unsigned long charged = 0;

    struct inode *inode =  file ? file->f_path.dentry->d_inode : NULL;

  

    /* Clear old maps */

    error = -ENOMEM;

munmap_back:

    vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);

    if (vma && vma->vm_start < addr + len) {

        if (do_munmap(mm, addr, len))

            return -ENOMEM;

        goto munmap_back;

    }

  

    /* Check against address space limit. */

    if (!may_expand_vm(mm, len >> PAGE_SHIFT))

        return -ENOMEM;

  

    /*

     * Set 'VM_NORESERVE' if we should not account for the

     * memory use of this mapping.

     */

    if ((flags & MAP_NORESERVE)) {

        /* We honor MAP_NORESERVE if allowed to overcommit */

        if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)

            vm_flags |= VM_NORESERVE;

  

        /* hugetlb applies strict overcommit unless MAP_NORESERVE */

        if (file && is_file_hugepages(file))

            vm_flags |= VM_NORESERVE;

    }

  

    /*

     * Private writable mapping: check memory availability

     */

    if (accountable_mapping(file, vm_flags)) {

        charged = len >> PAGE_SHIFT;

        if (security_vm_enough_memory(charged))

            return -ENOMEM;

        vm_flags |= VM_ACCOUNT;

    }

  

    /*

     * Can we just expand an old mapping?

     */

    vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);

    if (vma)

        goto out;

  

    /*

     * Determine the object being mapped and call the appropriate

     * specific mapper. the address has already been validated, but

     * not unmapped, but the maps are removed from the list.

     */

    vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);

    if (!vma) {

        error = -ENOMEM;

        goto unacct_error;

    }

  

    vma->vm_mm = mm;

    vma->vm_start = addr;

    vma->vm_end = addr + len;

    vma->vm_flags = vm_flags;

    vma->vm_page_prot = vm_get_page_prot(vm_flags);

    vma->vm_pgoff = pgoff;

    INIT_LIST_HEAD(&vma->anon_vma_chain);

  

    if (file) {

        error = -EINVAL;

        if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))

            goto free_vma;

        if (vm_flags & VM_DENYWRITE) {

            error = deny_write_access(file);

            if (error)

                goto free_vma;

            correct_wcount = 1;

        }

        vma->vm_file = file;

        get_file(file);

        error = file->f_op->mmap(file, vma);

        if (error)

            goto unmap_and_free_vma;

        if (vm_flags & VM_EXECUTABLE)

            added_exe_file_vma(mm);

  

        /* Can addr have changed??

         *

         * Answer: Yes, several device drivers can do it in their

         *         f_op->mmap method. -DaveM

         */

        addr = vma->vm_start;

        pgoff = vma->vm_pgoff;

        vm_flags = vma->vm_flags;

    } else if (vm_flags & VM_SHARED) {

        error = shmem_zero_setup(vma);

        if (error)

            goto free_vma;

    }

  

    if (vma_wants_writenotify(vma)) {

        pgprot_t pprot = vma->vm_page_prot;

  

        /* Can vma->vm_page_prot have changed??

         *

         * Answer: Yes, drivers may have changed it in their

         *         f_op->mmap method.

         *

         * Ensures that vmas marked as uncached stay that way.

         */

        vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);

        if (pgprot_val(pprot) == pgprot_val(pgprot_noncached(pprot)))

            vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

    }

  

    vma_link(mm, vma, prev, rb_link, rb_parent);

    file = vma->vm_file;

  

    /* Once vma denies write, undo our temporary denial count */

    if (correct_wcount)

        atomic_inc(&inode->i_writecount);

out:

    perf_event_mmap(vma);

  

    mm->total_vm += len >> PAGE_SHIFT;

    vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);

    if (vm_flags & VM_LOCKED) {

        if (!mlock_vma_pages_range(vma, addr, addr + len))

            mm->locked_vm += (len >> PAGE_SHIFT);

    } else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))

        make_pages_present(addr, addr + len);

    return addr;

  

unmap_and_free_vma:

    if (correct_wcount)

        atomic_inc(&inode->i_writecount);

    vma->vm_file = NULL;

    fput(file);

  

    /* Undo any partial mapping done by a device driver. */

    unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);

    charged = 0;

free_vma:

    kmem_cache_free(vm_area_cachep, vma);

unacct_error:

    if (charged)

        vm_unacct_memory(charged);

    return error;

}

do_munmap，释放一个内存区间

/* Munmap is split into 2 main parts -- this part which finds

 * what needs doing, and the areas themselves, which do the

 * work.  This now handles partial unmappings.

 * Jeremy Fitzhardinge <jeremy@goop.org>

 */

int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)

{

    unsigned long end;

    struct vm_area_struct *vma, *prev, *last;

  

    if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start)

        return -EINVAL;

  

    if ((len = PAGE_ALIGN(len)) == 0)

        return -EINVAL;

  

    /* Find the first overlapping VMA */

    vma = find_vma(mm, start);

    if (!vma)

        return 0;

    prev = vma->vm_prev;

    /* we have  start < vma->vm_end  */

  

    /* if it doesn't overlap, we have nothing.. */

    end = start + len;

    if (vma->vm_start >= end)

        return 0;

  

    /*

     * If we need to split any vma, do it now to save pain later.

     *

     * Note: mremap's move_vma VM_ACCOUNT handling assumes a partially

     * unmapped vm_area_struct will remain in use: so lower split_vma

     * places tmp vma above, and higher split_vma places tmp vma below.

     */

    if (start > vma->vm_start) {

        int error;

  

        /*

         * Make sure that map_count on return from munmap() will

         * not exceed its limit; but let map_count go just above

         * its limit temporarily, to help free resources as expected.

         */

        if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)

            return -ENOMEM;

  

        error = __split_vma(mm, vma, start, 0);

        if (error)

            return error;

        prev = vma;

    }

  

    /* Does it split the last one? */

    last = find_vma(mm, end);

    if (last && end > last->vm_start) {

        int error = __split_vma(mm, last, end, 1);

        if (error)

            return error;

    }

    vma = prev? prev->vm_next: mm->mmap;

  

    /*

     * unlock any mlock()ed ranges before detaching vmas

     */

    if (mm->locked_vm) {

        struct vm_area_struct *tmp = vma;

        while (tmp && tmp->vm_start < end) {

            if (tmp->vm_flags & VM_LOCKED) {

                mm->locked_vm -= vma_pages(tmp);

                munlock_vma_pages_all(tmp);

            }

            tmp = tmp->vm_next;

        }

    }

  

    /*

     * Remove the vma's, and unmap the actual pages

     */

    detach_vmas_to_be_unmapped(mm, vma, prev, end);

    unmap_region(mm, vma, prev, start, end);

  

    /* Fix up all other VM information */

    remove_vma_list(mm, vma);

  

    return 0;

}

Page Fault缺页异常

/*

 * This routine handles page faults.  It determines the address,

 * and the problem, and then passes it off to one of the appropriate

 * routines.

 */

dotraplinkage void __kprobes

do_page_fault(struct pt_regs *regs, unsigned long error_code)

{

    struct vm_area_struct *vma;

    struct task_struct *tsk;

    unsigned long address;

    struct mm_struct *mm;

    int fault;

    int write = error_code & PF_WRITE;

    unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |

                    (write ? FAULT_FLAG_WRITE : 0);

  

    tsk = current;

    mm = tsk->mm;

  

    /* Get the faulting address: */

    address = read_cr2();

  

    /*

     * Detect and handle instructions that would cause a page fault for

     * both a tracked kernel page and a userspace page.

     */

    if (kmemcheck_active(regs))

        kmemcheck_hide(regs);

    prefetchw(&mm->mmap_sem);

  

    if (unlikely(kmmio_fault(regs, address)))

        return;

  

    /*

     * We fault-in kernel-space virtual memory on-demand. The

     * 'reference' page table is init_mm.pgd.

     *

     * NOTE! We MUST NOT take any locks for this case. We may

     * be in an interrupt or a critical region, and should

     * only copy the information from the master page table,

     * nothing more.

     *

     * This verifies that the fault happens in kernel space

     * (error_code & 4) == 0, and that the fault was not a

     * protection error (error_code & 9) == 0.

     */

    if (unlikely(fault_in_kernel_space(address))) {

        if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {

            if (vmalloc_fault(address) >= 0)

                return;

  

            if (kmemcheck_fault(regs, address, error_code))

                return;

        }

  

        /* Can handle a stale RO->RW TLB: */

        if (spurious_fault(error_code, address))

            return;

  

        /* kprobes don't want to hook the spurious faults: */

        if (notify_page_fault(regs))

            return;

        /*

         * Don't take the mm semaphore here. If we fixup a prefetch

         * fault we could otherwise deadlock:

         */

        bad_area_nosemaphore(regs, error_code, address);

  

        return;

    }

  

    /* kprobes don't want to hook the spurious faults: */

    if (unlikely(notify_page_fault(regs)))

        return;

    /*

     * It's safe to allow irq's after cr2 has been saved and the

     * vmalloc fault has been handled.

     *

     * User-mode registers count as a user access even for any

     * potential system fault or CPU buglet:

     */

    if (user_mode_vm(regs)) {

        local_irq_enable();

        error_code |= PF_USER;

    } else {

        if (regs->flags & X86_EFLAGS_IF)

            local_irq_enable();

    }

  

    if (unlikely(error_code & PF_RSVD))

        pgtable_bad(regs, error_code, address);

  

    perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, 0, regs, address);

  

    /*

     * If we're in an interrupt, have no user context or are running

     * in an atomic region then we must not take the fault:

     */

    if (unlikely(in_atomic() || !mm)) {

        bad_area_nosemaphore(regs, error_code, address);

        return;

    }

  

    /*

     * When running in the kernel we expect faults to occur only to

     * addresses in user space.  All other faults represent errors in

     * the kernel and should generate an OOPS.  Unfortunately, in the

     * case of an erroneous fault occurring in a code path which already

     * holds mmap_sem we will deadlock attempting to validate the fault

     * against the address space.  Luckily the kernel only validly

     * references user space from well defined areas of code, which are

     * listed in the exceptions table.

     *

     * As the vast majority of faults will be valid we will only perform

     * the source reference check when there is a possibility of a

     * deadlock. Attempt to lock the address space, if we cannot we then

     * validate the source. If this is invalid we can skip the address

     * space check, thus avoiding the deadlock:

     */

    if (unlikely(!down_read_trylock(&mm->mmap_sem))) {

        if ((error_code & PF_USER) == 0 &&

            !search_exception_tables(regs->ip)) {

            bad_area_nosemaphore(regs, error_code, address);

            return;

        }

retry:

        down_read(&mm->mmap_sem);

    } else {

        /*

         * The above down_read_trylock() might have succeeded in

         * which case we'll have missed the might_sleep() from

         * down_read():

         */

        might_sleep();

    }

  

    vma = find_vma(mm, address);

    if (unlikely(!vma)) {

        bad_area(regs, error_code, address);

        return;

    }

    if (likely(vma->vm_start <= address))

        goto good_area;

    if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {

        bad_area(regs, error_code, address);

        return;

    }

    if (error_code & PF_USER) {

        /*

         * Accessing the stack below %sp is always a bug.

         * The large cushion allows instructions like enter

         * and pusha to work. ("enter $65535, $31" pushes

         * 32 pointers and then decrements %sp by 65535.)

         */

        if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {

            bad_area(regs, error_code, address);

            return;

        }

    }

    if (unlikely(expand_stack(vma, address))) {

        bad_area(regs, error_code, address);

        return;

    }

  

    /*

     * Ok, we have a good vm_area for this memory access, so

     * we can handle it..

     */

good_area:

    if (unlikely(access_error(error_code, vma))) {

        bad_area_access_error(regs, error_code, address);

        return;

    }

  

    /*

     * If for any reason at all we couldn't handle the fault,

     * make sure we exit gracefully rather than endlessly redo

     * the fault:

     */

    fault = handle_mm_fault(mm, vma, address, flags);

  

    if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {

        if (mm_fault_error(regs, error_code, address, fault))

            return;

    }

  

    /*

     * Major/minor page fault accounting is only done on the

     * initial attempt. If we go through a retry, it is extremely

     * likely that the page will be found in page cache at that point.

     */

    if (flags & FAULT_FLAG_ALLOW_RETRY) {

        if (fault & VM_FAULT_MAJOR) {

            tsk->maj_flt++;

            perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,

                      regs, address);

        } else {

            tsk->min_flt++;

            perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,

                      regs, address);

        }

        if (fault & VM_FAULT_RETRY) {

            /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk

             * of starvation. */

            flags &= ~FAULT_FLAG_ALLOW_RETRY;

            goto retry;

        }

    }

  

    check_v8086_mode(regs, address, tsk);

  

    up_read(&mm->mmap_sem);

}

引起缺页异常的线性地址，被保存在CR2寄存器中

/* Get the faulting address: */
address = read_cr2();

Demand Paging按需调页

/*

 * By the time we get here, we already hold the mm semaphore

 */

int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,

        unsigned long address, unsigned int flags)

{

    pgd_t *pgd;

    pud_t *pud;

    pmd_t *pmd;

    pte_t *pte;

  

    __set_current_state(TASK_RUNNING);

  

    count_vm_event(PGFAULT);

    mem_cgroup_count_vm_event(mm, PGFAULT);

  

    /* do counter updates before entering really critical section. */

    check_sync_rss_stat(current);

  

    if (unlikely(is_vm_hugetlb_page(vma)))

        return hugetlb_fault(mm, vma, address, flags);

  

retry:

    pgd = pgd_offset(mm, address);

    pud = pud_alloc(mm, pgd, address);

    if (!pud)

        return VM_FAULT_OOM;

    pmd = pmd_alloc(mm, pud, address);

    if (!pmd)

        return VM_FAULT_OOM;

    if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {

        if (!vma->vm_ops)

            return do_huge_pmd_anonymous_page(mm, vma, address,

                              pmd, flags);

    } else {

        pmd_t orig_pmd = *pmd;

        int ret;

  

        barrier();

        if (pmd_trans_huge(orig_pmd)) {

            if (flags & FAULT_FLAG_WRITE &&

                !pmd_write(orig_pmd) &&

                !pmd_trans_splitting(orig_pmd)) {

                ret = do_huge_pmd_wp_page(mm, vma, address, pmd,

                              orig_pmd);

                /*

                 * If COW results in an oom, the huge pmd will

                 * have been split, so retry the fault on the

                 * pte for a smaller charge.

                 */

                if (unlikely(ret & VM_FAULT_OOM))

                    goto retry;

                return ret;

            }

            return 0;

        }

    }

  

    /*

     * Use __pte_alloc instead of pte_alloc_map, because we can't

     * run pte_offset_map on the pmd, if an huge pmd could

     * materialize from under us from a different thread.

     */

    if (unlikely(pmd_none(*pmd)) && __pte_alloc(mm, vma, pmd, address))

        return VM_FAULT_OOM;

    /* if an huge pmd materialized from under us just retry later */

    if (unlikely(pmd_trans_huge(*pmd)))

        return 0;

    /*

     * A regular pmd is established and it can't morph into a huge pmd

     * from under us anymore at this point because we hold the mmap_sem

     * read mode and khugepaged takes it in write mode. So now it's

     * safe to run pte_offset_map().

     */

    pte = pte_offset_map(pmd, address);

  

    return handle_pte_fault(mm, vma, address, pte, pmd, flags);

}

/*

 * These routines also need to handle stuff like marking pages dirty

 * and/or accessed for architectures that don't do it in hardware (most

 * RISC architectures).  The early dirtying is also good on the i386.

 *

 * There is also a hook called "update_mmu_cache()" that architectures

 * with external mmu caches can use to update those (ie the Sparc or

 * PowerPC hashed page tables that act as extended TLBs).

 *

 * We enter with non-exclusive mmap_sem (to exclude vma changes,

 * but allow concurrent faults), and pte mapped but not yet locked.

 * We return with mmap_sem still held, but pte unmapped and unlocked.

 */

int handle_pte_fault(struct mm_struct *mm,

             struct vm_area_struct *vma, unsigned long address,

             pte_t *pte, pmd_t *pmd, unsigned int flags)

{

    pte_t entry;

    spinlock_t *ptl;

  

    entry = *pte;

    if (!pte_present(entry)) {

        if (pte_none(entry)) {

            if (vma->vm_ops) {

                if (likely(vma->vm_ops->fault))

                    return do_linear_fault(mm, vma, address,

                        pte, pmd, flags, entry);

            }

            return do_anonymous_page(mm, vma, address,

                         pte, pmd, flags);

        }

        if (pte_file(entry))

            return do_nonlinear_fault(mm, vma, address,

                    pte, pmd, flags, entry);

        return do_swap_page(mm, vma, address,

                    pte, pmd, flags, entry);

    }

  

    ptl = pte_lockptr(mm, pmd);

    spin_lock(ptl);

    if (unlikely(!pte_same(*pte, entry)))

        goto unlock;

    if (flags & FAULT_FLAG_WRITE) {

        if (!pte_write(entry))

            return do_wp_page(mm, vma, address,

                    pte, pmd, ptl, entry);

        entry = pte_mkdirty(entry);

    }

    entry = pte_mkyoung(entry);

    if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {

        update_mmu_cache(vma, address, pte);

    } else {

        /*

         * This is needed only for protection faults but the arch code

         * is not yet telling us if this is a protection fault or not.

         * This still avoids useless tlb flushes for .text page faults

         * with threads.

         */

        if (flags & FAULT_FLAG_WRITE)

            flush_tlb_fix_spurious_fault(vma, address);

    }

unlock:

    pte_unmap_unlock(pte, ptl);

    return 0;

}

entry = *pte;
if (!pte_present(entry)) {
    if (pte_none(entry)) {
        if (vma->vm_ops) {
            if (likely(vma->vm_ops->fault))
                return do_linear_fault(mm, vma, address,
                    pte, pmd, flags, entry);
        }
        return do_anonymous_page(mm, vma, address,
                     pte, pmd, flags);
    }
    if (pte_file(entry))
        return do_nonlinear_fault(mm, vma, address,
                pte, pmd, flags, entry);
    return do_swap_page(mm, vma, address,
                pte, pmd, flags, entry);
}

如果页对应的是一个文件映射的话，就调用do_nonlinear_fault处理。（To Be Discussed!）

如果是按需调页的情况，就调用do_swap_page。

Copy on Write写时复制

Modern Unix kernels, including Linux, follow a more efficient approach called Copy On Write(COW).
The idea is quite simple: instead of duplicating page frames, they are shared between the parent
and the child process. However, as long as they are shared, they cannot be modified. Whenever the
parent or the child process attempts to write into a shared page frame, an exception occurs. At this
point, the kernel duplicates the page into a new page frame that it marks as writable. The original
page frame remains write-protected: when the other process tries to write into it, the kernel checks
whether the writing process is the only owner of the page frame; in such a case, it makes the page
frame writable for the process.

【对于内存的处理，基本上都是由exception handler来驱动的：

Page Fault: 动态地去准备好相应的页；

Copy-on-Write: 父子进程间共享内存页，页设置为read-only，如果尝试write，才会激发COW机制，才会创建另外的page】

【其实Linux内核的思想，与面向对象的思想很相似，只不过内核是通过exception驱动的，在exception handler中，根据情况（通常是各个标志）去调用不同的实现。】

【在细节上，struct中的函数指针成员，与c++中的vtable很相似，可以用来实现polymorphism】

在创建子进程时，将父子进程的页指向同一页帧(struct page)，并将其设置为write-protected，即如果以该页帧进行写操作时，会触发异常。

异常被操作系统捕获后，如果此时该页仍然被多个进程共享，就复制该页，将复制后的页设置为write-enable，提供给请求写操作的进程。如果此时只有请求进程自己在使用该页，就直接将该页设置为write-enable，允许请求进程进行写操作。

brk调整进程堆大小

SYSCALL_DEFINE1(brk, unsigned long, brk)

{

    unsigned long rlim, retval;

    unsigned long newbrk, oldbrk;

    struct mm_struct *mm = current->mm;

    unsigned long min_brk;

  

    down_write(&mm->mmap_sem);

  

#ifdef CONFIG_COMPAT_BRK

    /*

     * CONFIG_COMPAT_BRK can still be overridden by setting

     * randomize_va_space to 2, which will still cause mm->start_brk

     * to be arbitrarily shifted

     */

    if (current->brk_randomized)

        min_brk = mm->start_brk;

    else

        min_brk = mm->end_data;

#else

    min_brk = mm->start_brk;

#endif

    if (brk < min_brk)

        goto out;

  

    /*

     * Check against rlimit here. If this check is done later after the test

     * of oldbrk with newbrk then it can escape the test and let the data

     * segment grow beyond its set limit the in case where the limit is

     * not page aligned -Ram Gupta

     */

    rlim = rlimit(RLIMIT_DATA);

    if (rlim < RLIM_INFINITY && (brk - mm->start_brk) +

            (mm->end_data - mm->start_data) > rlim)

        goto out;

  

    newbrk = PAGE_ALIGN(brk);

    oldbrk = PAGE_ALIGN(mm->brk);

    if (oldbrk == newbrk)

        goto set_brk;

  

    /* Always allow shrinking brk. */

    if (brk <= mm->brk) {

        if (!do_munmap(mm, newbrk, oldbrk-newbrk))

            goto set_brk;

        goto out;

    }

  

    /* Check against existing mmap mappings. */

    if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE))

        goto out;

  

    /* Ok, looks good - let it rip. */

    if (do_brk(oldbrk, newbrk-oldbrk) != oldbrk)

        goto out;

set_brk:

    mm->brk = brk;

out:

    retval = mm->brk;

    up_write(&mm->mmap_sem);

    return retval;

}

posted @ 2014-01-02 16:55 Daniel King 阅读(2763) 评论(0) 收藏举报

刷新页面返回顶部

Daniel King

淡泊明志，宁静致远