## Linux传统Huge Pages与Transparent Huge Pages再次学习总结

2017-10-31 09:38  潇湘隐者  阅读(11811)  评论(3编辑  收藏

Linux下的大页分为两种类型：标准大页（Huge Pages）和透明大页（Transparent Huge Pages）。Huge Pages有时候也翻译成大页/标准大页/传统大页，它们都是Huge Pages的不同中文翻译名而已，顺带提一下这个，免得有人被这些名词给混淆、误导了。Huge Pages是从Linux Kernel 2.6后被引入的。目的是使用更大的内存页面（memory page size） 以适应越来越大的系统内存，让操作系统可以支持现代硬件架构的大页面容量功能。透明大页（Transparent Huge Pages）缩写为THP，这个是RHEL 6（其它分支版本SUSE Linux Enterprise Server 11, and Oracle Linux 6 with earlier releases of Oracle Linux Unbreakable Enterprise Kernel 2 (UEK2)）开始引入的一个功能。具体可以参考官方文档。这两者有啥区别呢？这两者的区别在于大页的分配机制，标准大页管理是预分配的方式，而透明大页管理则是动态分配的方式。相信有不少人将Huge PageTransparent Huge Pages混为一谈。目前透明大页与传统HugePages联用会出现一些问题，导致性能问题和系统重启。Oracle 建议禁用透明大页（Transparent Huge Pages）。在 Oracle Linux 6.5 版中，已删除透明 HugePages

HugePages is a feature integrated into the Linux kernel with release 2.6. It is a method to have larger pages where it is useful for working with very large memory. It can be useful for both 32-bit and 64-bit configurations. HugePage sizes vary from 2MB to 256MB, depending on the kernel version and the hardware architecture. For Oracle Databases, using HugePages reduces the operating system maintenance of page states, and increases TLB (Translation Lookaside Buffer) hit ratio.

RHEL的官方文档对传统大页（Huge Pages）和透明大页（Transparent Huge Pages）这两者的描述如下(https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-transhuge.html)

Huge pages can be difficult to manage manually, and often require significant changes to code in order to be used effectively. As such, Red Hat Enterprise Linux 6 also implemented the use of transparent huge pages(THP). THP is an abstraction layer that automates most aspects of creating, managing, and using huge pages.

THP hides much of the complexity in using huge pages from system administrators and developers. As the goal of THP is improving performance, its developers (both from the community and Red Hat) have tested and optimized THP across a wide range of systems, configurations, applications, and workloads. This allows the default settings of THP to improve the performance of most system configurations. However, THP is not recommended for database workloads.

THP为系统管理员和开发人员减少了很多使用传统大页的复杂性, 因为THP的目标是改进性能, 因此其它开发人员 (来自社区和红帽) 已在各种系统、配置、应用程序和负载中对 THP 进行了测试和优化。这样可让 THP 的默认设置改进大多数系统配置性能。但是, 不建议对数据库工作负载使用 THP

# grep PageTables /proc/meminfo

PageTables:      1573080 kB

ORACLE为什么要使用标准大页（Huge Pages）来提高性能？因为ORACLE数据库使用共享内存(SGA)来管理可以共享的一些资源;比如shared pool中存储了共享的SQL语句及执行计划,buffer pool中存储了数据块。对这些资源的访问,其实就是ORACLE使用OSAPI来访问内存资源的过程。内存操作理应/通常意义上都是很快的,这时候Oracle数据库可以很正常的工作。但是有些情况下也会出现性能问题：

a)如果SGA内的某一部分被swap到硬盘上,那么再次访问它,就需要花非常多的时间。

b)如果OS本身的内存非常的大,那么管理/访问到我们需要的内存的过程就需要更长时间。

LinuxHugePage可以解决由以上两种问题引发的性能波动。

·         Page Table: A page table is the data structure of a virtual memory system in an operating system to store the mapping between virtual addresses and physical addresses. This means that on a virtual memory system, the memory is accessed by first accessing a page table and then accessing the actual memory location implicitly.

·         TLB: A Translation Lookaside Buffer (TLB) is a buffer (or cache) in a CPU that contains parts of the page table. This is a fixed size buffer being used to do virtual address translation faster.

·         hugetlb: This is an entry in the TLB that points to a HugePage (a large/big page larger than regular 4K and predefined in size). HugePages are implemented via hugetlb entries, i.e. we can say that a HugePage is handled by a "hugetlb page entry". The 'hugetlb" term is also (and mostly) used synonymously with a HugePage (See Note 261889.1). In this document the term "HugePage" is going to be used but keep in mind that mostly "hugetlb" refers to the same concept.

·         hugetlbfs: This is a new in-memory filesystem like tmpfs and is presented by 2.6 kernel. Pages allocated on hugetlbfs type filesystem are allocated in HugePages.

HugePages in 2.4 Kernels

The HugePages feature is backported to some 2.4 kernels. Kernel versions 2.4.21-* has this feature (See Note 311504.1 for the distributions with 2.4.21 kernels) but it is implemented in a different way. The feature is completely available. The difference from 2.6 implementation is the organization within the source code and the kernel parameters that are used for configuring HugePages. See Parameters/Setup section below.

Advantages of HugePages Over Normal Sharing Or AMM (see below)

·         Not swappable: HugePages are not swappable. Therefore there is no page-in/page-out mechanism overhead.HugePages are universally regarded as pinned.

不可交换：HugePages不可交换。 因此没有页面换入/页面换出的机制开销.HugePages被普遍认为是固定在RAM中的。

·         Relief of TLB pressure:

o   Hugepge uses fewer pages to cover the physical address space, so the size of “book keeping” (mapping from the virtual to the physical address) decreases, so it requiring fewer entries in the TLB

o   TLB entries will cover a larger part of the address space when use HugePages, there will be fewer TLB misses before the entire or most of the SGA is mapped in the SGA

o   Fewer TLB entries for the SGA also means more for other parts of the address space

减轻TLB的压力

Hugepge使用较少的页面来覆盖物理地址空间，因此保留书签（从虚拟地址到物理地址的映射）的大小减少，因此在TLB中要求较少的条目

当使用HugePages时，TLB条目将涵盖更大的地址空间，对于SGA中的全部或大部分的内存映射，TLB未命中将会大大减少。

SGA需要更少的TLB条目意味着TLB中可以有更多的条目来保存其他地址空间。

TLB是直接缓存虚拟地址到物理地址的缓存表，用于提升性能，省去查找page table从而减少开销，但是如果出现的大量的TLB miss，必然会给系统的性能带来较大的负面影响，尤其对于连续的读操作。从第二篇文章中我们知道如果使用hugepages能大量减少PTE的数量，也就意味着访问同样多的内容需要的PTE会更少，而通常TLB的槽位是有限的，一般只有512个，所以更少的PTE也就意味着更高的TLB的命中率。

·         Decreased page table overhead: Each page table entry can be as large as 64 bytes and if we are trying to handle 50GB of RAM, the pagetable will be approximately 800MB in size which is practically will not fit in 880MB size lowmem (in 2.4 kernels - the page table is not necessarily in lowmem in 2.6 kernels) considering the other uses of lowmem. When 95% of memory is accessed via 256MB hugepages, this can work with a page table of approximately 40MB in total. See also Document 361468.1.

减少页表的开销：每个页表条目可以高达64字节，如果我们50GBRAM保存在页表（page table）当中，那么页表（page table）大小大约为800MB，实际上对于lowmem来说，考虑到lowmem的其他用途，880MB大小是不合适的（在2.4内核当中 ,page tabel在低于2.6的内核当中不是必须的）， lowmem中通过256MBhugepages访问95％的内存时，可以使用大约40MB的页表。 另见文档361468.1

·         Eliminated page table lookup overhead: Since the pages are not subject to replacement, page table lookups are not required.

减少页表查询的开销：PTE的数量减少，那么使得很多页表的查询就不需要了，并且更少的PTE使得页表的查询更快。如果TLB miss，则可能需要额外三次内存读取操作才能将线性地址翻译为物理地址。

·         Faster overall memory performance: On virtual memory systems each memory operation is actually two abstract memory operations. Since there are fewer pages to work on, the possible bottleneck on page table access is clearly avoided.

提升内存访问的整体性能：使用虚拟内存，每一次对内存的访问实际上都是由两次抽象的内存操作组成。如果只要使用更少的页面，那么原本在页表访问的瓶颈也得以避免

HugePages Reservation

The HugePages reservation feature is fully implemented in 2.6.17 kernel, and thus EL5 (based on 2.6.18) has this feature. The alloc_huge_page() is improved for this. (See kernel source mm/hugetlb.c)

From /usr/share/doc/kernel-doc-2.6.18/Documentation/vm/hugetlbpage.txt:

HugePages_Rsvd is short for "reserved," and is the number of hugepages for which a commitment to allocate from the pool has been made, but no allocation has yet been made. It's vaguely analogous to overcommit.

This feature in the Linux kernel enables the Oracle Database to be able to allocate hugepages for the sublevels of the SGA on-demand. The same behaviour is expected for various Oracle Database versions that are certified on EL5.

HugePages and Oracle 11g Automatic Memory Management (AMM)

The AMM and HugePages are not compatible. One needs to disable AMM on 11g to be able to use HugePages. See Document 749851.1 for further information.

Linux中， kswapd是负责内核页面交换管理的一个守护进程，它的职责是保证Linux内存管理操作的高效。当物理内存不够时，它就会变得非常aggressive，有些情况下能占用单核CPU100%.   kswapd 进程负责确保内存空间总是在被释放中，它监控内核中的pages_highpages_low阀值。如果空闲内存的数值低于pages_low,则每次 kswapd 进程启动扫描并尝试释放32free pages.并一直重复这个过程,直到空闲内存的数值高于 pages_highkswapd 进程完成以下几个操作:

o    如果该页处于未修改状态,则将该页放置回空闲列表中.

o    如果该页处于已修改状态并可备份回文件系统,则将页内容写入到磁盘.

o    如果该页处于已修改状态但没有任何磁盘备份,则将页内容写入到swap device.

[root@DB-Server ~]$grep Hugepagesize /proc/meminfo Hugepagesize: 2048 kB 确认标准大页（传统大页/大页/HuagePage）是否配置、并在使用的方法： [oracle@DB-Server ~]$ cat /proc/sys/vm/nr_hugepages
0
[oracle@DB-Server ~]$grep -i HugePages_Total /proc/meminfo  HugePages_Total: 0 如果HugePages_Total0，意味着标准大页（大页、传统大页）没有设置或使用。nr_hugepages0，意味着标准大页没有设置。 标准大页的一些内核参数。如下所示： [oracle@DB-Server ~]$ more /etc/issue
Red Hat Enterprise Linux Server release 5.7 (Tikanga)
Kernel \r on an \m

[oracle@DB-Server ~]$grep Huge /proc/meminfo HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB   [root@mylnx02 ~]# more /etc/issue Red Hat Enterprise Linux Server release 6.6 (Santiago) Kernel \r on an \m   [root@mylnx02 ~]# grep Huge /proc/meminfo AnonHugePages: 18432 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB AnonHugePages: 匿名 HugePages 数量。Oracle Linux 6.5 中已删除此计数器。与透明 HugePages 有关。 HugePages_Total: 分配的页面数目，和Hugepagesize相乘后得到所分配的内存大小 HugePages_Free: 从来没有被使用过的Hugepages数目。即使oracle sga已经分配了这部分内存，但是如果没有实际写入，那么看到的还是Free的。这是很容易误解的地方（池中尚未分配的 HugePages 数量。） HugePages_Rsvd: 已经被分配预留但是还没有使用的page数目。在Oracle刚刚启动时，大部分内存应该都是Reserved并且Free的，随着ORACLE SGA的使用，ReservedFree都会不断的降低 HugePages_Surp: surplus的缩写形式，表示池中大于/proc/sys/vm/nr_hugepages 中值的 HugePages 数量。剩余 HugePages 的最大数量由 /proc/sys/vm/nr_overcommit_hugepages 控制。此值为0的情况很常见 Hugepagesize: 页面大小 HugePages_Free HugePages_Rsvd 这部分是没有被使用到的内存，如果没有其他的oracle instance，这部分内存也许永远都不会被使用到，也就是被浪费了。HugePages_Total-HugePages_Free+HugePages_Rsvd 就是目前实例需要的页面数量. 如何设置标准大页（Huge Page）的大小呢？一般都是修改内核参数nr_hugepages。在/etc/sysctl.conf配置文件中设置参数vm.nr_hugepages # echo "vm.nr_hugepages=512" >> /etc/sysctl.conf 下面我们来介绍一下，在64Linux服务器下为ORACLE数据库设置标准大页的基本步骤，具体需要根据实际情况作出调整。关于如何配置标准大页，可以参考官方文档HugePages on Oracle Linux 64-bit (文档 ID 361468.1)或文档（来演示一下如何设置Huge Pages 步骤1：在/etc/security/limits.conf文件中添加memlock的限制，注意该值略微小于实际物理内存的大小（单位为KB）。比如物理内存是64GB，可以设置为如下： * soft memlock 60397977 * hard memlock 60397977 如果这里的值超过了SGA的需求，也没有不利的影响。如果使用了Oracle Linuxoracle­-validated包，或者Exadata DB compute则会自动配置这个参数下面来看看一个实际测试环境，内存为16G [root@mylnx02 ~]# free -m  total used free shared buffers cached Mem: 16077 9520 6556 0 37 766 -/+ buffers/cache: 8716 7361 Swap: 14015 0 14015 那么我们修改/etc/security/limits.conf，设置memlock的值为1638400016077*1024 vi /etc/security/limits.conf * soft memlock 16384000 * hard memlock 16384000 步骤2：重新登录安装Oracle产品的账号并验证memlock。如下所示，当前测试环境的账号为oracle [oracle@mylnx02 ~]$ ulimit -l

16384000

[oracle@DB-Server ~]$sqlplus / as sysdba   SQL*Plus: Release 11.2.0.1.0 Production on Fri Oct 27 14:43:12 2017   Copyright (c) 1982, 2009, Oracle. All rights reserved.     Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options   SQL> show parameter memory_target;   NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ memory_target big integer 1552M SQL> show parameter memory_max_target;   NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ memory_max_target big integer 1552M SQL>    SQL> alter system set memory_target=0 scope=both; SQL> alter system set memory_max_target=0 scope=spfile; 如果Oracle 11g以后的版本，那么默认创建的实例会使用Automatic Memory Management (AMM)的特性，该特性与HugePage不兼容。在设置HugePage之前需要先禁用AMM。设置初始化参数MEMORY_TARGET MEMORY_MAX_TARGET 0即可。 使用AMM的情况下，所有的SGA 内存都是在/dev/shm 下分配的，因此在分配SGA时不会使用HugePage。这也是AMM HugePage不兼容的原因。 另外：默认情况下ASM instance 也是使用AMM的，但因为ASM 实例不需要大SGA，所以对ASM 实例使用HugePages意义不大。 如果我们要使用HugePage，那么就必须先确保没有设置MEMORY_TARGET/ MEMORY_MAX_TARGET参数。 步骤4：确保你所有的ORACLE数据库实例都已经启动（包括ASM实例），然后运行hugepages_settings.sh（具体参考Mos文档Document 401749.1，脚本内容如下）获取内核参数vm.nr_hugepages的大小。 #! /bin/bash # # hugepages_settings.sh # # Linux bash script to compute values for the # recommended HugePages/HugeTLB configuration # on Oracle Linux # # Note: This script does calculation for all shared memory # segments available when the script is run, no matter it # is an Oracle RDBMS shared memory segment or not. # # This script is provided by Doc ID 401749.1 from My Oracle Support  # http://support.oracle.com   # Welcome text echo " This script is provided by Doc ID 401749.1 from My Oracle Support  (http://support.oracle.com) where it is intended to compute values for  the recommended HugePages/HugeTLB configuration for the current shared  memory segments on Oracle Linux. Before proceeding with the execution please note following:  * For ASM instance, it needs to configure ASMM instead of AMM.  * The 'pga_aggregate_target' is outside the SGA and   you should accommodate this while calculating SGA size.  * In case you changes the DB SGA size,   as the new SGA will not fit in the previous HugePages configuration,   it had better disable the whole HugePages,   start the DB with new SGA size and run the script again. And make sure that:  * Oracle Database instance(s) are up and running  * Oracle Database 11g Automatic Memory Management (AMM) is not setup   (See Doc ID 749851.1)  * The shared memory segments can be listed by command:  # ipcs -m     Press Enter to proceed..."   read   # Check for the kernel version KERN=uname -r | awk -F. '{ printf("%d.%d\n",$1,$2); }'   # Find out the HugePage size HPG_SZ=grep Hugepagesize /proc/meminfo | awk '{print$2}'
if [ -z "$HPG_SZ" ];then  echo "The hugepages may not be supported in the system where the script is being executed."  exit 1 fi   # Initialize the counter NUM_PG=0   # Cumulative number of pages required to handle the running shared memory segments for SEG_BYTES in ipcs -m | cut -c44-300 | awk '{print$1}' | grep "[0-9][0-9]*"
do
    MIN_PG=echo "$SEG_BYTES/($HPG_SZ*1024)" | bc -q
    if [ $MIN_PG -gt 0 ]; then  NUM_PG=echo "$NUM_PG+$MIN_PG+1" | bc -q  fi done   RES_BYTES=echo "$NUM_PG * $HPG_SZ * 1024" | bc -q   # An SGA less than 100MB does not make sense # Bail out if that is the case if [$RES_BYTES -lt 100000000 ]; then
    echo "***********"
    echo "** ERROR **"
    echo "***********"
    echo "Sorry! There are not enough total of shared memory segments allocated for
HugePages configuration. HugePages can only be used for shared memory segments
that you can list by command:

    # ipcs -m

of a size that can match an Oracle Database SGA. Please make sure that:
 * Oracle Database instance is up and running
 * Oracle Database 11g Automatic Memory Management (AMM) is not configured"
    exit 1
fi

# Finish with results
case $KERN in  '2.2') echo "Kernel version$KERN is not supported. Exiting." ;;
    '2.4') HUGETLB_POOL=echo "$NUM_PG*$HPG_SZ/1024" | bc -q;
           echo "Recommended setting: vm.hugetlb_pool = $HUGETLB_POOL" ;;  '2.6') echo "Recommended setting: vm.nr_hugepages =$NUM_PG" ;;
    '3.8') echo "Recommended setting: vm.nr_hugepages = $NUM_PG" ;;  '3.10') echo "Recommended setting: vm.nr_hugepages =$NUM_PG" ;;
    '4.1') echo "Recommended setting: vm.nr_hugepages = $NUM_PG" ;; esac   # End [root@mylnx02 ~]# ./hugepages_settings.sh    This script is provided by Doc ID 401749.1 from My Oracle Support  (http://support.oracle.com) where it is intended to compute values for  the recommended HugePages/HugeTLB configuration for the current shared  memory segments on Oracle Linux. Before proceeding with the execution please note following:  * For ASM instance, it needs to configure ASMM instead of AMM.  * The 'pga_aggregate_target' is outside the SGA and   you should accommodate this while calculating SGA size.  * In case you changes the DB SGA size,   as the new SGA will not fit in the previous HugePages configuration,   it had better disable the whole HugePages,   start the DB with new SGA size and run the script again. And make sure that:  * Oracle Database instance(s) are up and running  * Oracle Database 11g Automatic Memory Management (AMM) is not setup   (See Doc ID 749851.1)  * The shared memory segments can be listed by command:  # ipcs -m     Press Enter to proceed...   Recommended setting: vm.nr_hugepages = 4098 Step 5：在/etc/sysctl.conf文件中设置vm.nr_hugepages参数 [root@mylnx02 ~]# vi /etc/sysctl.conf vm.nr_hugepages=4098 Step 6： 关闭所有的数据库实例并重启服务器。 Step 7： 验证配置是否正确，如下所示： [oracle@mylnx02 ~]$ grep HugePages /proc/meminfo

HugePages_Total:    4098

HugePages_Free:     3439

HugePages_Rsvd:     3438

HugePages_Surp:        0

The values in the output will vary. To make sure that the configuration is valid, the HugePages_Free value should be smaller than HugePages_Total and there should be some HugePages_Rsvd. HugePages_Rsvd counts free pages that are reserved for use (requested for an SGA, but not touched/mapped yet).

The sum of Hugepages_Free and HugePages_Rsvd may be smaller than your total combined SGA as instances allocate pages dynamically and proactively as needed.

Oracle 11.2.0.3以及以后版本，可以通过检查警报日志来验证是否对数据库实例启用了大页面。启动实例时，您应在警报日志中参数列表前面看到如下内容：

****************** Large Pages Information *****************

Total Shared Global Region in Large Pages = 28 GB (100%)

Large Pages used by this instance: 14497 (28 GB)

Large Pages unused system wide = 1015 (2030 MB) (alloc incr 64 MB)

Large Pages configured system wide = 19680 (38 GB)

Large Page size = 2048 KB

use_large_pages = {true/only/false/auto}

SQL> alter system set use_large_pages=true scope=spfile sid='*';

https://help.marklogic.com/Knowledgebase/Article/View/16/0/linux-huge-pages-and-transparent-huge-pages

https://support.oracle.com/epmos/faces/DocumentDisplay?parent=DOCUMENT&sourceId=361468.1&id=401749.1

http://www.oracle.com/technetwork/cn/articles/servers-storage-dev/hugepages-2099009-zhs.html

https://developers.redhat.com/blog/2014/03/10/examining-huge-pages-or-transparent-huge-pages-performance/

https://oracle-base.com/articles/linux/configuring-huge-pages-for-oracle-on-linux-64

https://access.redhat.com/documentation/zh-CN/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-transhuge.html

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Large_Memory_Optimization_Big_Pages_and_Huge_Pages-Configuring_Huge_Pages_in_Red_Hat_Enterprise_Linux_4_or_5.html

https://blogs.oracle.com/database4cn/linux-64hugepage

https://docs.oracle.com/cd/E11882_01/install.112/e41961/memry.htm#CBAFIFGJ