Linux内核之页面换出详解

kswap线程主要用于页面的定期换出,接下来说说kswap线程的实现

首先kswap线程的初始化时,需要根据物理内存的大小设置一个page_cluster变量的值,这个值表示预读数目
(比如本来只读一个页面,预读3个,就会一次性读取3个页面,这样根据访问局部性原理有利于提高速度)
kswap是一个线程共享内核的内存空间,创建使用kernel_thread创建
kswap线程首先调用inactive_shortage()检查整个系统物理页面是否短缺.
系统物理页面的最低底线值由freepages.high(空闲页面的数量),inactive_targe(不活跃页面的数目)提供
而系统物理页面的实际可用物理页面由三部分组成,分别是
空闲页面(立即可分配,来自于各个zone),其数目由nr_free_pages()统计提供
不活跃干净页面(本质是可以分配的页面,但其页面还存在内容(在swap缓存),多保留这样的页面有利于减少从swap设备读入,提供速度),其数量由nr_inactive_clean_pages记录
不活跃脏页面(需要写入交换设备后,才能被分配的),由nr_inactive_dirty_pages记录
  1. int inactive_shortage(void)
  2. {
  3. int shortage = 0;
  4. //系统应该维持的物理内存由xxxhigh跟target维持
  5. //实际的由下面3个函数统计,如果没法满足那就返回正数
  6. shortage += freepages.high;
  7. shortage += inactive_target;
  8. shortage -= nr_free_pages();
  9. shortage -= nr_inactive_clean_pages();
  10. shortage -= nr_inactive_dirty_pages;
  11. if (shortage > 0)
  12. return shortage;
  13. return 0;
  14. }
即使以上条件满足(及实际页面数目高于底线数目),还需要调用free_shortage()检查各个管理区是否页面非常短缺.
统计管理区的实际的页面是否满足管理区的水准,如果不满足,则返回差值..
  1. /*
  2. * Check if there are zones with a severe shortage of free pages,
  3. * or if all zones have a minor shortage.
  4. */
  5. int free_shortage(void)
  6. {
  7. pg_data_t *pgdat = pgdat_list;//节点
  8. int sum = 0;
  9. int freeable = nr_free_pages() + nr_inactive_clean_pages();//实际空闲
  10. int freetarget = freepages.high + inactive_target / 3;//理论空闲
  11. //实际小于理论,直接返回差值,表示需要扩充
  12. /* Are we low on free pages globally? */
  13. if (freeable < freetarget)
  14. return freetarget - freeable;
  15. /* If not, are we very low on any particular zone? */
  16. do {
  17. int i;
  18. for(i = 0; i < MAX_NR_ZONES; i++) {
  19. zone_t *zone = pgdat->node_zones+ i;//获取管理区
  20. if (zone->size && (zone->inactive_clean_pages +
  21. zone->free_pages < zone->pages_min+1)) {//空闲页面+干净不活跃页面是否小于最低水准
  22. /* + 1 to have overlap with alloc_pages() !! */
  23. sum += zone->pages_min + 1;
  24. sum -= zone->free_pages;
  25. sum -= zone->inactive_clean_pages;
  26. }
  27. }
  28. pgdat = pgdat->node_next;
  29. } while (pgdat);
  30. return sum;
  31. }
以上两个条件都满足,那么将调用refill_inactive_scan函数,试图将一些活跃页面(没有用户映射)转换为非活跃脏页面,
据priority的值扫描活跃队列一部分页面,priority为0时才全部扫描,另外判断页面是否最近受到访问,收到了就增加age值,否则减少age值
(关于age值,age为0才考虑是否移到不活跃队列中),接着判断页面age是否等于0并且判断页面是否有用户进程映射(页面分配时count设置为1,
当做读写缓冲时+1,每当一个进程映射到这页面时+1,所以需要判断该页面是佛属于缓冲页面(读/写),如果age=0并且没有用户映射,那就
调用deactivate_page_nolock()函数,将页面的age设置为0,清除页面最近访问标志,并从活跃页面队列转移到非活跃脏队列,
当然如果页面还是活跃的就放入活跃队列尾.
  1. /**
  2. * refill_inactive_scan - scan the active list and find pages to deactivate
  3. * @priority: the priority at which to scan
  4. * @oneshot: exit after deactivating one page
  5. *
  6. * This function will scan a portion of the active list to find
  7. * unused pages, those pages will then be moved to the inactive list.
  8. *///据priority的值扫描队列一部分页面,priority为0时才全部扫描
  9. int refill_inactive_scan(unsigned int priority, int oneshot)
  10. {
  11. struct list_head * page_lru;
  12. struct page * page;
  13. int maxscan, page_active = 0;//maxscan控制扫描页面数目
  14. int ret = 0;
  15. /* Take the lock while messing with the list... */
  16. spin_lock(&pagemap_lru_lock);
  17. maxscan = nr_active_pages >> priority;
  18. while (maxscan-- > 0 && (page_lru = active_list.prev) != &active_list) {
  19. page = list_entry(page_lru, struct page, lru);
  20. /* Wrong page on list?! (list corruption, should not happen) */
  21. if (!PageActive(page)) {//扫描的页面必须是在活跃队列中
  22. printk("VM: refill_inactive, wrong page on list.\n");
  23. list_del(page_lru);
  24. nr_active_pages--;
  25. continue;
  26. }
  27. /* 判断页面是否受到访问,,决定增加或减少寿命,如果减少寿命到0,那说明此页面很久都没访问了Do aging on the pages. */
  28. if (PageTestandClearReferenced(page)) {
  29. age_page_up_nolock(page);
  30. page_active = 1;
  31. } else {
  32. age_page_down_ageonly(page);
  33. /*
  34. * Since we don't hold a reference on the page
  35. * ourselves, we have to do our test a bit more
  36. * strict then deactivate_page(). This is needed
  37. * since otherwise the system could hang shuffling
  38. * unfreeable pages from the active list to the
  39. * inactive_dirty list and back again...
  40. *
  41. * SUBTLE: we can have buffer pages with count 1.
  42. *///缓冲页面如果引用计数大于1,说明还要用户空间映射,不能转为不活跃页面
  43. if (page->age == 0 && page_count(page) <=
  44. (page->buffers ? 2 : 1)) {
  45. deactivate_page_nolock(page);
  46. page_active = 0;
  47. } else {
  48. page_active = 1;
  49. }
  50. }
  51. /*
  52. * If the page is still on the active list, move it
  53. * to the other end of the list. Otherwise it was
  54. * deactivated by age_page_down and we exit successfully.
  55. */
  56. if (page_active || PageActive(page)) {
  57. list_del(page_lru);//如果页面还是活跃的,就放入活跃尾部
  58. list_add(page_lru, &active_list);
  59. } else {
  60. ret = 1;
  61. if (oneshot)//根据oneshot参数选择是否继续扫描一次
  62. break;
  63. }
  64. }
  65. spin_unlock(&pagemap_lru_lock);
  66. return ret;
  67. }
上面是kswap检测了系统物理内存是够了并且管理区物理页面也够了的操作,kswap线程是一个死循环,完成上述操作,再次判断页面是否短缺或管理区短缺,如果不短缺就调用interruptibale_sleep_on_timeon()进入睡眠,让内核自由调度其他进程运行,然后在内核运行一定时间(HZ自己可以定义)后,又唤醒kswap继续重复操作
 
2.如果判断出系统内存不足或者管理区页面不足则调用do_try_free_pages()试图腾出一些内存页面来
 1.如果页面紧缺,或者脏的不活跃页面的数量大于空闲页面跟不活跃干净页面的数目就需要调用page_launder试图把不活跃状态的脏页面洗净,使得它们成为立刻可分配的页面,
如果经过page_launder()后,系统页面依旧紧缺,释放dentry目录项跟inode数据结构的缓存,一般而言即使关闭这些,页面也不会立刻释放而是保存到lru队列作为后备,否则如果页面不紧缺了,就只调用kmem_cache_reap回收一部分slab缓存
 
  1. static int do_try_to_free_pages(unsigned int gfp_mask, int user)
  2. {
  3. int ret = 0;
  4. /*
  5. 如果页面紧缺,或者脏的不活跃页面的数量大于空闲页面跟不活跃干净页面的数目
  6. 就需要调用page_launder试图把不活跃状态的脏页面洗净,使得它们成为立刻可分配的
  7. 页面
  8. */
  9. if (free_shortage() || nr_inactive_dirty_pages > nr_free_pages() +
  10. nr_inactive_clean_pages())
  11. ret += page_launder(gfp_mask, user);
  12. /*如果内存依旧紧缺
  13. * If needed, we move pages from the active list
  14. * to the inactive list. We also "eat" pages from
  15. * the inode and dentry cache whenever we do this.
  16. *///释放dentry目录项跟inode数据结构的缓存,即使关闭这些,页面也不会立刻释放
  17. //而是保存到lru队列作为后备
  18. if (free_shortage() || inactive_shortage()) {
  19. shrink_dcache_memory(6, gfp_mask);//释放dentry目录项缓存
  20. shrink_icache_memory(6, gfp_mask);//释放inode缓存
  21. ret += refill_inactive(gfp_mask, user);//user表示是否有等待队列的进程
  22. } else {
  23. /*
  24. * 否则回收slab缓存
  25. */
  26. kmem_cache_reap(gfp_mask);
  27. ret = 1;
  28. }
  29. return ret;
  30. }
 
以上是大体流程,接下来分析do_try_free_pages中的page_launder()函数
作用是把不活跃状态的脏页面洗净.
从不活跃脏页面队列取出每个页,判断是否最近受到访问(虽然是脏页面队列还是有可能会受到访问的,所以需要判断,如果受到了访问,那就移入活跃队列,
页面依旧是脏页面,判断是否是第一轮扫描,是的话放入队尾然后继续循环,否则如果是第二轮循环(当然有条件的,就是空闲页面是否短缺),那就清除脏位,同时调用address_space提供的相关写到swap设备的函数进行写入.
如果页面不再是脏的了但作用于缓存,先把该页面脱离脏队列,再调用try_to_free_buffers()后,count值减一
,如果失败了,那就转入活跃队列或者不活跃干净页面,接着判断 判断该页面是否有映射,不是的话,那就释放该页面,或者判断是否还有用户进程映射,如果有,那就转移到活跃队列中,否则那就是虽然此页面曾经是映射页面,但没有用户映射了,那就也释放该页面,(注:前面的释放,只是设置标志位.需要再经过page_cache_release()使其count减为0,那就页面进入了空闲页面队列了,接着判断是否释放了一个页面后系统不再短缺,那就跳出循环,结束清洗,否则
判断页面是否是干净页面并且是之前映射过的页面那就转移到不活跃干净队列中.
完成一趟扫描后,判断是否页面紧缺,如果依旧紧缺就第二轮扫描了
  1. int page_launder(int gfp_mask, int sync)
  2. {
  3. int launder_loop, maxscan, cleaned_pages, maxlaunder;
  4. int can_get_io_locks;
  5. struct list_head * page_lru;
  6. struct page * page;
  7. /*
  8. * We can only grab the IO locks (eg. for flushing dirty
  9. * buffers to disk) if __GFP_IO is set.
  10. */
  11. can_get_io_locks = gfp_mask & __GFP_IO;
  12. launder_loop = 0;
  13. maxlaunder = 0;
  14. cleaned_pages = 0;
  15. dirty_page_rescan:
  16. spin_lock(&pagemap_lru_lock);
  17. maxscan = nr_inactive_dirty_pages;//避免重复处理同一页面,设定的变量
  18. //对不活跃脏页面队列扫描
  19. while ((page_lru = inactive_dirty_list.prev) != &inactive_dirty_list &&
  20. maxscan-- > 0) {
  21. page = list_entry(page_lru, struct page, lru);
  22. /* Wrong page on list?! (list corruption, should not happen) */
  23. if (!PageInactiveDirty(page)) {检查其标志是否为1
  24. printk("VM: page_launder, wrong page on list.\n");
  25. list_del(page_lru);//从队列中删除
  26. nr_inactive_dirty_pages--;
  27. page->zone->inactive_dirty_pages--;
  28. continue;
  29. }
  30. /* 到了脏队列,由于可能受到访问,就会放入活跃页面队列Page is or was in use? Move it to the active list. */
  31. if (PageTestandClearReferenced(page) || page->age > 0 ||
  32. (!page->buffers && page_count(page) > 1) ||
  33. page_ramdisk(page)) {
  34. del_page_from_inactive_dirty_list(page);//删除非活跃队列
  35. add_page_to_active_list(page);//加入到活跃队列中
  36. continue;
  37. }
  38. /*页面是否被锁住,是的话表示把它移到队列尾部
  39. * The page is locked. IO in progress?
  40. * Move it to the back of the list.
  41. */
  42. if (TryLockPage(page)) {
  43. list_del(page_lru);
  44. list_add(page_lru, &inactive_dirty_list);
  45. continue;
  46. }
  47. /*
  48. * Dirty swap-cache page? Write it out if
  49. * last copy..
  50. */
  51. if (PageDirty(page)) {//是脏页面
  52. int (*writepage)(struct page *) = page->mapping->a_ops->writepage;
  53. int result;
  54. if (!writepage)//如果没有提供具体写swp的函数,则放入活跃队列中
  55. goto page_active;
  56. /*判断是否是第一次扫描,是的话就移到队列尾部,继续 First time through? Move it to the back of the list */
  57. if (!launder_loop) {
  58. list_del(page_lru);
  59. list_add(page_lru, &inactive_dirty_list);
  60. UnlockPage(page);
  61. continue;
  62. }
  63. /* OK, do a physical asynchronous write to swap. */
  64. ClearPageDirty(page);//清除page结构的_dirty位,防止再次写入
  65. page_cache_get(page);//增加page->count表示多了一个用户操作此
  66. //页面,因为kswap线程把这个页面写出到swp设备中
  67. spin_unlock(&pagemap_lru_lock);
  68. result = writepage(page);
  69. page_cache_release(page);//count--完成了写入操作
  70. //所以就用户--了
  71. /* And re-start the thing.. */
  72. spin_lock(&pagemap_lru_lock);
  73. if (result != 1)//写入失败的话
  74. continue;
  75. /* writepage refused to do anything */
  76. set_page_dirty(page);//又设置为脏页
  77. goto page_active;
  78. }
  79. /*
  80. * 如果页面不是脏的然后又是用于缓存文件读写的页面
  81. */
  82. if (page->buffers) {
  83. int wait, clearedbuf;
  84. int freed_page = 0;
  85. /*
  86. * Since we might be doing disk IO, we have to
  87. * drop the spinlock and take an extra reference
  88. * on the page so it doesn't go away from under us.
  89. */
  90. del_page_from_inactive_dirty_list(page);//脱离脏队列
  91. page_cache_get(page);//表示kswap进程需要作用于page,count++
  92. spin_unlock(&pagemap_lru_lock);
  93. /* Will we do (asynchronous) IO? */
  94. if (launder_loop && maxlaunder == 0 && sync)
  95. wait = 2; /* Synchrounous IO */
  96. else if (launder_loop && maxlaunder-- > 0)
  97. wait = 1; /* Async IO */
  98. else
  99. wait = 0; /* No IO */
  100. /*试图将页面释放,这里是count减一 Try to free the page buffers. */
  101. clearedbuf = try_to_free_buffers(page, wait);
  102. /*
  103. * Re-take the spinlock. Note that we cannot
  104. * unlock the page yet since we're still
  105. * accessing the page_struct here...
  106. */
  107. spin_lock(&pagemap_lru_lock);
  108. /* 不能释放或者说释放失败继续放入脏队列The buffers were not freed. */
  109. if (!clearedbuf) {
  110. add_page_to_inactive_dirty_list(page);
  111. /*/*页面只在buffer cache队列中,而不在某个文件的inode->i_mapping中,这样的页有超级块,索引节点位图等等,它们不属于某个文件,因此我们就成功释放了一个页面*/ 
    如果该页面只用于缓存,而非映射The page was only in the buffer cache. */
  112. } else if (!page->mapping) {
  113. atomic_dec(&buffermem_pages);
  114. freed_page = 1;
  115. cleaned_pages++;
  116. /* *否则这个页面还在某个文件的inode->i_mapping中,并且还有超过2个用户(the cache and us)在访问它,例如有多个进程映射到该文件如果该页有几个用户,加入到活跃队列中The page has more users besides the cache and us. */
  117. } else if (page_count(page) > 2) {
  118. add_page_to_active_list(page);
  119. /* 最后,只剩下page->mapping && page_count(page) == 2,说明虽然这个页面还在某个inode->i_mapping中,但是已经没有任何用户在访问他们了,因此可以释放该页面OK, we "created" a freeable page. */
  120. } else /* page->mapping && page_count(page) == 2 */ {
  121. add_page_to_inactive_clean_list(page);
  122. cleaned_pages++;
  123. }
  124. /*
  125. * Unlock the page and drop the extra reference.
  126. * We can only do it here because we ar accessing
  127. * the page struct above.
  128. */
  129. UnlockPage(page);
  130. page_cache_release(page);//最终释放页面到空闲队列缓存中
  131. /*
  132. * If we're freeing buffer cache pages, stop when
  133. * we've got enough free memory.
  134. 释放了一个页面,并且系统内存不再紧缺,那就停止
  135. */
  136. if (freed_page && !free_shortage())
  137. break;
  138. continue;//页面不再是脏页面,并且属于address_space红
  139. } else if (page->mapping && !PageDirty(page)) {
  140. /*
  141. * If a page had an extra reference in
  142. * deactivate_page(), we will find it here.
  143. * Now the page is really freeable, so we
  144. * move it to the inactive_clean list.
  145. */
  146. del_page_from_inactive_dirty_list(page);//转移到不活跃队列中
  147. add_page_to_inactive_clean_list(page);
  148. UnlockPage(page);
  149. cleaned_pages++;
  150. } else {
  151. page_active:
  152. /*
  153. * OK, we don't know what to do with the page.
  154. * It's no use keeping it here, so we move it to
  155. * the active list.
  156. */
  157. del_page_from_inactive_dirty_list(page);
  158. add_page_to_active_list(page);
  159. UnlockPage(page);
  160. }
  161. }
  162. spin_unlock(&pagemap_lru_lock);
  163. /*
  164. * If we don't have enough free pages, we loop back once
  165. * to queue the dirty pages for writeout. When we were called
  166. * by a user process (that /needs/ a free page) and we didn't
  167. * free anything yet, we wait synchronously on the writeout of
  168. * MAX_SYNC_LAUNDER pages.
  169. *
  170. * We also wake up bdflush, since bdflush should, under most
  171. * loads, flush out the dirty pages before we have to wait on
  172. * IO.
  173. *///如果内存继续紧缺,那就二次扫描一趟
  174. if (can_get_io_locks && !launder_loop && free_shortage()) {
  175. launder_loop = 1;
  176. /* If we cleaned pages, never do synchronous IO. */
  177. if (cleaned_pages)
  178. sync = 0;
  179. /* We only do a few "out of order" flushes. */
  180. maxlaunder = MAX_LAUNDER;
  181. /* Kflushd takes care of the rest. */
  182. wakeup_bdflush(0);
  183. goto dirty_page_rescan;
  184. }
  185. /* Return the number of pages moved to the inactive_clean list. */
  186. return cleaned_pages;//返回有多少页面被移到不活跃干净页面中
  187. }
如果经过page_launder后,页面也就紧缺,那就调用shrink_dcache_memory跟shrink_icache_memory
函数分别释放释放dentry目录项缓存跟释放inode缓存,并且调用refill_inactive函数进一步回收,否则如果
页面充裕,那就只调用kmem_cache_reap回收slab缓存
 
接下来分析refill_inactive函数.
首先判断系统还需要多少页面,接着回收slab缓存,然后一个do_while循环,从优先级最低的6开始,加大力度到0.
其循环调用了refill_active_scan(上面已经分析了)试图将一部分活跃页面转移到非活跃脏页面队列,
接着调用shrink_dcache_memory跟shrink_icache_memory,函数分别释放释放dentry目录项缓存跟释放inode缓存,
接着根据count的数目多次调用swap_out函数试图找出一个进程,扫描其映射表,找到可以转入不活跃状态页面,最后根据count的数目多次调用refill_active_scan再次扫描就结束了
  1. /*
  2. * We need to make the locks finer granularity, but right
  3. * now we need this so that we can do page allocations
  4. * without holding the kernel lock etc.
  5. *
  6. * We want to try to free "count" pages, and we want to
  7. * cluster them so that we get good swap-out behaviour.
  8. *
  9. * OTOH, if we're a user process (and not kswapd), we
  10. * really care about latency. In that case we don't try
  11. * to free too many pages.
  12. */
  13. static int refill_inactive(unsigned int gfp_mask, int user)
  14. {
  15. int priority, count, start_count, made_progress;
  16. count = inactive_shortage() + free_shortage();//获取需要的页面数目
  17. if (user)
  18. count = (1 << page_cluster);
  19. start_count = count;
  20. /* 任何时候,当页面紧缺时,从slab开始回收Always trim SLAB caches when memory gets low. */
  21. kmem_cache_reap(gfp_mask);
  22. priority = 6;//从最低优先级别6开始
  23. do {
  24. made_progress = 0;
  25. //每次循环都要检查下当前进程是否被设置被调度,设置了,说明某个中断程序需要调度
  26. if (current->need_resched) {
  27. __set_current_state(TASK_RUNNING);
  28. schedule();
  29. }
  30. //扫描活跃页面队列,试图从中找出可以转入不活跃状态页面
  31. while (refill_inactive_scan(priority, 1)) {
  32. made_progress = 1;
  33. if (--count <= 0)
  34. goto done;
  35. }
  36. /*
  37. * don't be too light against the d/i cache since
  38. * refill_inactive() almost never fail when there's
  39. * really plenty of memory free.
  40. */
  41. shrink_dcache_memory(priority, gfp_mask);
  42. shrink_icache_memory(priority, gfp_mask);
  43. /*试图找出一个进程,扫描其映射表,找到可以转入不活跃状态页面
  44. * Then, try to page stuff out..
  45. */
  46. while (swap_out(priority, gfp_mask)) {
  47. made_progress = 1;
  48. if (--count <= 0)
  49. goto done;
  50. }
  51. /*
  52. * If we either have enough free memory, or if
  53. * page_launder() will be able to make enough
  54. * free memory, then stop.
  55. */
  56. if (!inactive_shortage() || !free_shortage())
  57. goto done;
  58. /*
  59. * Only switch to a lower "priority" if we
  60. * didn't make any useful progress in the
  61. * last loop.
  62. */
  63. if (!made_progress)
  64. priority--;
  65. } while (priority >= 0);
  66. /* Always end on a refill_inactive.., may sleep... */
  67. while (refill_inactive_scan(0, 1)) {
  68. if (--count <= 0)
  69. goto done;
  70. }
  71. done:
  72. return (count < start_count);
  73. }
接着看看swap_out函数的实现
根据内核中进程的个数跟调用swap_out的优先级计算得到的counter.counter表示循环次数,每次循环的任务从所有进程中找出最合适的进程best,断开页面印射,进一步转换成不活跃状态,最合适的准则是"劫富济贫“和”轮流坐庄“的结合
 
  1. static int swap_out(unsigned int priority, int gfp_mask)
  2. {
  3. int counter;//循环次数
  4. int __ret = 0;
  5. /*
  6. * We make one or two passes through the task list, indexed by
  7. * assign = {0, 1}:
  8. * Pass 1: select the swappable task with maximal RSS that has
  9. * not yet been swapped out.
  10. * Pass 2: re-assign rss swap_cnt values, then select as above.
  11. *
  12. * With this approach, there's no need to remember the last task
  13. * swapped out. If the swap-out fails, we clear swap_cnt so the
  14. * task won't be selected again until all others have been tried.
  15. *
  16. * Think of swap_cnt as a "shadow rss" - it tells us which process
  17. * we want to page out (always try largest first).
  18. *///根据内核中进程的个数跟调用swap_out的优先级计算得到的
  19. counter = (nr_threads << SWAP_SHIFT) >> priority;
  20. if (counter < 1)
  21. counter = 1;
  22. for (; counter >= 0; counter--) {
  23. struct list_head *p;
  24. unsigned long max_cnt = 0;
  25. struct mm_struct *best = NULL;
  26. int assign = 0;
  27. int found_task = 0;
  28. select:
  29. spin_lock(&mmlist_lock);
  30. p = init_mm.mmlist.next;
  31. for (; p != &init_mm.mmlist; p = p->next) {
  32. struct mm_struct *mm = list_entry(p, struct mm_struct, mmlist);
  33. if (mm->rss <= 0)
  34. continue;
  35. found_task++;
  36. /* Refresh swap_cnt? */
  37. if (assign == 1) {////增加这层判断目的是,但我们找不到mm->swap_cnt不为0的mm时候,
  38. 我们就会设置assign=1,然后再从新扫描一遍,此次就会直接把内存页面数量赋值给尚未考察页面数量,
  39. 从而从新刷新一次,这样我们就会从最富有的进程开始下手,mm->swap_cnt用于保证我们所说的轮流坐庄,
  40. mm->rss则是保证劫富济贫第二轮循环,将mm->rss拷贝到mm_swap_cnt,从最大的开始继续
  41. mm->swap_cnt = (mm->rss >> SWAP_SHIFT);//记录一次轮换中尚未内存页面尚未考察的数量
  42. if (mm->swap_cnt < SWAP_MIN)
  43. mm->swap_cnt = SWAP_MIN;
  44. }
  45. if (mm->swap_cnt > max_cnt) {
  46. max_cnt = mm->swap_cnt;
  47. best = mm;
  48. }
  49. }///从循环退出来,我们就找到了最大的mm->swap_cnt的mm
  50. /* Make sure it doesn't disappear */
  51. if (best)
  52. atomic_inc(&best->mm_users);
  53. spin_unlock(&mmlist_lock);
  54. /*
  55. * We have dropped the tasklist_lock, but we
  56. * know that "mm" still exists: we are running
  57. * with the big kernel lock, and exit_mm()
  58. * cannot race with us.
  59. */
  60. if (!best) {
  61. if (!assign && found_task > 0) {//第一次进入,表示所有进程mm->swap_cnt都为0,第2次不会再进入了,一般不会出现第2次
  62. assign = 1;//第二轮循环
  63. goto select;
  64. }
  65. break;
  66. } else {//扫出一个最佳换出的进程,调用swap_out_mm
  67. __ret = swap_out_mm(best, gfp_mask);
  68. mmput(best);
  69. break;
  70. }
  71. }
  72. return __ret;
  73. }
swap_out_vma会调用关系swap_out_vma()>swap_out_pgd()>swap_out_pmd()>try_to_swap_out() static int try_to_swap_out()(struct mm_struct * mm, struct vm_area_struct * vma, unsigned long address, pte_t * page_table, int gfp_mask){//page_table指向页面表项,不是页面表到了try_to_swap_out()这个是非常关键的..所以自己主要分析try_to_swap_out()函数的实现,
一开始判断准备换出的页的合法性,判断是否访问过,是的话增加其age,即使不在活跃队列,而且最近没有访问,还不能立刻换出,而要保留观察,直到其
page->age等于0为止,如果page->age等于0了,又通过了上面的测试,清除其页表项设置为0,接着判断该页是否已经在swap缓存中,如果存在就判断是否最近写过,如果是,那就设置该页为脏页,同时转移到不活跃脏队列中,并且释放页面的缓存.
如果页面不是脏页面也不在swap缓存中,那就直接把映射解除而不是暂时断开.如果页面来自于mmap映射也不在swap缓存中,把页面设置为脏页面,并且转移到该文件映射的脏页面队列中.
如果页面是脏页面又不属于文件映射也不在swap缓存,那就说明该页面很久都没访问了,那就必须先分配一个swap设备的磁盘页面,将其内容写入该磁盘页面.
同时通过add_swap_cache将页面链入swapper_space的队列中跟活跃页面队列中.
至此,对一个进程的空间页面的扫描就OK了
  1. /*
  2. * The swap-out functions return 1 if they successfully
  3. * threw something out, and we got a free page. It returns
  4. * zero if it couldn't do anything, and any other value
  5. * indicates it decreased rss, but the page was shared.
  6. *
  7. * NOTE! If it sleeps, it *must* return 1 to make sure we
  8. * don't continue with the swap-out. Otherwise we may be
  9. * using a process that no longer actually exists (it might
  10. * have died while we slept).
  11. */
  12. static int try_to_swap_out(struct mm_struct * mm, struct vm_area_struct* vma, unsigned long address, pte_t * page_table, int gfp_mask)
  13. {
  14. pte_t pte;
  15. swp_entry_t entry;
  16. struct page * page;
  17. int onlist;
  18. pte = *page_table;//获取页表项
  19. if (!pte_present(pte))//是否存在物理内存中
  20. goto out_failed;
  21. page = pte_page(pte);//获取具体的页
  22. if ((!VALID_PAGE(page)) || PageReserved(page))//页面不合法或者页面不允许换出swap分区
  23. goto out_failed;
  24. if (!mm->swap_cnt)
  25. return 1;
  26. //需要具体的考察访问一个页面,swap_cnt减一
  27. mm->swap_cnt--;
  28. onlist = PageActive(page);//判断是否活跃
  29. /* Don't look at this pte if it's been accessed recently. */
  30. if (ptep_test_and_clear_young(page_table)) {//测试页面是否访问过(访问过说明年轻)
  31. age_page_up(page);//增加保留观察时间
  32. goto out_failed;
  33. }
  34. if (!onlist)//即使不在活跃队列,而且最近没有访问,还不能立刻换出,而要保留观察,直到其
  35. //page->age等于0为止
  36. age_page_down_ageonly(page);
  37. /*
  38. * If the page is in active use by us, or if the page
  39. * is in active use by others, don't unmap it or
  40. * (worse) start unneeded IO.
  41. */
  42. if (page->age > 0)
  43. goto out_failed;
  44. if (TryLockPage(page))
  45. goto out_failed;
  46. /* From this point on, the odds are that we're going to
  47. * nuke this pte, so read and clear the pte. This hook
  48. * is needed on CPUs which update the accessed and dirty
  49. * bits in hardware.
  50. *///把页表项的内容清0(撤销了映射)
  51. pte = ptep_get_and_clear(page_table);
  52. flush_tlb_page(vma, address);
  53. /*
  54. * Is the page already in the swap cache? If so, then
  55. * we can just drop our reference to it without doing
  56. * any IO - it's already up-to-date on disk.
  57. *
  58. * Return 0, as we didn't actually free any real
  59. * memory, and we should just continue our scan.
  60. */
  61. if (PageSwapCache(page)) {//判断该页是否已经在swap缓存中
  62. entry.val = page->index;
  63. if (pte_dirty(pte))
  64. set_page_dirty(page);//转入脏页面
  65. set_swap_pte:
  66. swap_duplicate(entry);//对index做一些印证
  67. set_pte(page_table, swp_entry_to_pte(entry));//设置pte为swap的索引了,这样完成了交换
  68. drop_pte:
  69. UnlockPage(page);
  70. mm->rss--;//物理页面断开的映射,所以rss--
  71. deactivate_page(page);//将其从活跃队列移到不活跃队列中
  72. page_cache_release(page);//释放页面缓存
  73. out_failed:
  74. return 0;
  75. }
  76. /*
  77. * Is it a clean page? Then it must be recoverable
  78. * by just paging it in again, and we can just drop
  79. * it..
  80. *
  81. * However, this won't actually free any real
  82. * memory, as the page will just be in the page cache
  83. * somewhere, and as such we should just continue
  84. * our scan.
  85. *
  86. * Basically, this just makes it possible for us to do
  87. * some real work in the future in "refill_inactive()".
  88. */
  89. flush_cache_page(vma, address);
  90. if (!pte_dirty(pte))
  91. goto drop_pte;
  92. /*
  93. * Ok, it's really dirty. That means that
  94. * we should either create a new swap cache
  95. * entry for it, or we should write it back
  96. * to its own backing store.
  97. */
  98. if (page->mapping) {
  99. set_page_dirty(page);
  100. goto drop_pte;
  101. }
  102. /*
  103. * This is a dirty, swappable page. First of all,
  104. * get a suitable swap entry for it, and make sure
  105. * we have the swap cache set up to associate the
  106. * page with that swap entry.
  107. */
  108. entry = get_swap_page();
  109. if (!entry.val)
  110. goto out_unlock_restore; /* No swap space left */
  111. /* Add it to the swap cache and mark it dirty */
  112. add_to_swap_cache(page, entry);
  113. set_page_dirty(page);
  114. goto set_swap_pte;
  115. out_unlock_restore:
  116. set_pte(page_table, pte);
  117. UnlockPage(page);
  118. return 0;
  119. }
 
 

posted on 2017-05-11 20:07  笨拙的菜鸟  阅读(2294)  评论(0编辑  收藏  举报

导航