虚拟化与NUMA
NUMA 简介
NUMA(Non-Uniform Memory Access,非统一内存访问架构)是相对于UMA(Uniform Memory Access)而言的。早年的计算机架构都是UMA,如图所示。

所有的CPU处理单元(Processor)均质地通过共享的总线访问内存,所有CPU访问所有内存单元的速度是一样的。在多处理器的情形下,多个任务会被分派在各个处理器上并发执行,则它们竞争内存资源的情况会非常频繁,从而引起效率的下降。
所以,随着多处理器架构的逐渐普及以及数量的不断增长,NUMA架构兴起,如图所示。

处理器与内存被划分成一个个的节点(node),处理器访问自己节点内的内存会比访问其他节点的内存快
NUMA工具
numastat
numastat用来查看某个(些)进程或者整个系统的内存消耗在各个NUMA节点的分布情况。
numastat -c -z -m -n
numastat -czs libvirt kvm qemu
watch -n1 numastat
watch -n1 --differences=cumulative numastat
numastat with no command options or arguments at all, displays per-
node NUMA hit and miss system statistics from the kernel memory
allocator. This default numastat behavior is strictly compatible
with the previous long-standing numastat perl script, written by Andi
Kleen. The default numastat statistics shows per-node numbers (in
units of pages of memory) in these categories:
numa_hit is memory successfully allocated on this node as intended.
numa_miss is memory allocated on this node despite the process
preferring some different node. Each numa_miss has a numa_foreign on
another node.
numa_foreign is memory intended for this node, but actually allocated
on some different node. Each numa_foreign has a numa_miss on another
node.
interleave_hit is interleaved memory successfully allocated on this
node as intended.
local_node is memory allocated on this node while a process was
running on it.
other_node is memory allocated on this node while a process was
running on some other node.
Any supplied options or arguments with the numastat command will
significantly change both the content and the format of the display.
Specified options will cause display units to change to megabytes of
memory, and will change other specific behaviors of numastat as
described below.
Memory usage information reflects the resident pages on the system.
numad
numad是一个可以自动管理NUMA亲和性(affinity)的工具(同时也是一个后台进程)。它实时监控NUMA拓扑结构(topology)和资源使用,并动态调整。同时它还可以在启动一个程序前,提供NUMA优化建议。
与numad功能类似,Kernel的auto NUMA balancing(/proc/sys/kernel/numa_balancing)也是进行动态NUMA资源的调节。numad启动后会覆盖Kernel的auto NUMA balancing功能。
numactl
如果说numad是事后(客户机起来以后)调节NUMA资源分配,那么numactl则是主动地在程序起来时候就指定好它的NUMA节点。numactl其实不止它名字表示的那样设置NUMA相关亲和性,它还可以设置共享内存/大页文件系统的内存策略,以及进程的CPU和内存的亲和性。
The emerging standard for easily binding processes to processors on Linux-based NUMA supercomputers is numactl. It can operate on a coarser-grained basis (i.e., CPU sockets rather than individual CPU cores) than taskset (only CPU cores) because it is aware of the processor topology and how the CPU cores map to CPU sockets. Using numact is typically easier–after all, the common goal is to confine a process to a numa pool (or “cpu node”) rather than specific CPU cores. To that end, numactl also lets you bind a processor’s memory locality to prevent processes from having to jump across NUMA pools (called “memory nodes” in numactl parlance). The policy is set for command and inherited by all of its children. In addition it can set persistent policy for shared memory segments or files.
Example uses:
numactl --cpubind=0 --membind=0,1 myprog Runs program “myprog” on cpu 0, using memory on nodes 0 and 1.
numactl --physcpubind=+0-4,8-12 myapplic arguments Run myapplic on cpus 0-4 and 8-12 of the current cpuset.
numactl --interleave=all bigdatabase arguments Run big database with its memory interleaved on all CPUs.
numactl --cpubind=0 --membind=0,1 process Run process on node 0 with memory allocated on node 0 and 1.
numactl --preferred=1 numactl --show Set preferred node 1 and show the resulting state.
numactl --interleave=all --shmkeyfile /tmp/shmkey Interleave all of the sysv shared memory region specified by /tmp/shmkey over all nodes.
numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch Bind the second gigabyte in the tmpfs file /dev/shm/A to node 1.
numactl --localalloc /dev/shm/file Reset the policy for the shared memory file file to the default localalloc policy.
虚拟化与NUMA
NUMA存在属于cpu使用其他node的内存的场景,这会影响到该场景下虚机的内存性能。
查询虚机numa设置
使用virsh查询虚机的numatune信息和vcpu信息:
#!/bin/bash
DOMAIN=$1
while [ 1 ] ; do
DOM_STATE=`virsh list --all | awk '/'$DOMAIN'/ {print $NF}'`
echo "${DOMAIN}: $DOM_STATE"
virsh numatune $DOMAIN
virsh vcpuinfo $DOMAIN | awk '/VCPU:/ {printf "VCPU" $NF }
/^CPU:/ {printf "%s %d %s %d %s\n", " on pCPU:", $NF, " ( part of numa node:", $NF/8, ")"}'
sleep 2
done
使用virsh vcpuinfo可以查看虚拟机VCPU和物理CPU的对应关系。
# virsh vcpuinfo deepin
VCPU: 0
CPU: 10
State: running
CPU time: 36962.5s
CPU Affinity: yyyyyyyyyyyy
VCPU: 1
CPU: 5
State: running
CPU time: 37943.3s
CPU Affinity: yyyyyyyyyyyy
VCPU: 2
CPU: 1
State: running
CPU time: 50309.6s
CPU Affinity: yyyyyyyyyyyy
VCPU: 3
CPU: 11
State: running
CPU time: 38696.9s
CPU Affinity: yyyyyyyyyyyy
yyyy表示可以使用的物理CPU内部的逻辑核。
使用emulatorpin可以查看虚拟机可以使用哪些物理逻辑CPU。
# virsh emulatorpin deepin
emulator: CPU Affinity
----------------------------------
*: 0-11
一般家用cpu都是单node的,需要测试numa还得需要服务器cpu。
利用virsh设置虚机的numa参数
利用 virsh numatune 命令设置指定虚机的NUMA参数,这些参数在虚机xml中的 <numatune> 元素里。virsh numatune domain 命令如果不加参数,那么只显示指定虚机的当前设置, 可以加上如下的参数:
--mode- The mode can be set to eitherstrict,interleave, orpreferred. Running domains cannot have their mode changed while live unless the guest virtual machine was started withinstrictmode.--nodesetcontains a list of NUMA nodes that are used by the host physical machine for running the guest virtual machine. The list contains nodes, each separated by a comma, with a dash-used for node ranges and a caret^used for excluding a node.- Only one of the three following flags can be used per instance
--configwill effect the next boot of a persistent guest virtual machine--livewill set the scheduler information of a running guest virtual machine.--currentwill effect the current state of the guest virtual machine.
例子:设置正在运行的虚机guest1的NUMA mode为strict,并将其node设为0和1
virsh numatune guest1 --mode strict --nodeset 0,1 --live
运行该命令会修改guest1的xml:
<numatune>
<memory mode='strict' nodeset='0,1'/>
</numatune>
实验:通过virsh 命令使得运行的虚机均匀分布在各个node上
#!/bin/bash
VMS=`virsh list|sed '1,2d' | awk '{print $2F}'`
NUM=`lscpu | grep -i "NUMA node(s)" | cut -d : -f 2 | tr -d " "`
let i=0
for vm in $VMS
do
let i+=1
N=`expr $i % $NUM`
CPUSET=`lscpu | grep -i "NUMA node" | grep -i "node$N"| cut -d : -f 2 | tr -d " "`
#echo $i
#echo $N
virsh numatune $vm --mode preferred --nodeset $N --live
done
通过命令numastat -c qemu可以看到qemu进程在每个NUMA node的内存分布。
]# numastat -c qemu
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
--------------- ------ ------ -----
990 (qemu-system 26 4210 4235
6222 (qemu-syste 3706 599 4306
7491 (qemu-syste 1609 554 2164
11412 (qemu-syst 3472 525 3997
20626 (qemu-syst 2224 34 2258
27078 (qemu-syst 1596 572 2168
27780 (qemu-syst 3846 234 4080
27883 (qemu-syst 26 4303 4329
38295 (qemu-syst 3769 401 4170
41111 (qemu-syst 1892 0 1893
42006 (qemu-syst 4246 70 4316
--------------- ------ ------ -----
Total 26412 11503 37915
服务器cpu拓扑结构
# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 79938 MB
node 0 free: 14577 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 80611 MB
node 1 free: 52045 MB
node distances:
node 0 1
0: 10 21
1: 21 10
# numastat
node0 node1
numa_hit 202778086242 166320549405 #使用本节点内存次数
numa_miss 0 34572039 # 计划使用本节点内存而被调度到其他节点次数
numa_foreign 34572039 0 # 计划使用其他节点内存而使用本节点内存次数
interleave_hit 34013 34184 # 交叉分配使用的内存中使用本节点的内存次数
local_node 202697418366 166320386633 # 在本节点运行的程序使用本节点内存次数
other_node 80667876 34734811 # 在其他节点运行的程序使用本节点内存次数
linux系统默认是自动NUMA平衡策略。如果需要关闭自动平衡,可以使用:
echo 0 > /proc/sys/kernel/numa_balancing
如果需要开启,使用:echo 1 > /proc/sys/kernel/numa_balancing
另外,还可以通过virsh emulatorpin来强制让虚拟机只能在部分物理CPU之间调度。
root@node1:/opt/VM2/centos6# virsh emulatorpin centos7 0-31 --live
root@node1:/opt/VM2/centos6# virsh emulatorpin centos7
emulator: CPU Affinity
----------------------------------
*: 0-31
强制VCPU和物理CPU一对一绑定
root@node1:/opt/VM2/centos6# virsh vcpupin centos7 0 1
root@node1:/opt/VM2/centos6# virsh vcpupin centos7 1 2
root@node1:/opt/VM2/centos6# virsh vcpupin centos7 2 3
root@node1:/opt/VM2/centos6# virsh vcpupin centos7 3 0
root@node1:/opt/VM2/centos6# virsh vcpuinfo centos7
VCPU: 0
CPU: 1
State: running
CPU time: 619.6s
CPU Affinity: -y--------------------------------------------------------------
VCPU: 1
CPU: 2
State: running
CPU time: 196.6s
CPU Affinity: --y-------------------------------------------------------------
VCPU: 2
CPU: 3
State: running
CPU time: 306.2s
CPU Affinity: ---y------------------------------------------------------------
VCPU: 3
CPU: 0
State: running
CPU time: 872.0s
CPU Affinity: y---------------------------------------------------------------
利用xml设置虚机的numa参数
利用virsh 设置numa参数好象是无法设置vcpu placement,这时候直接用xml配置。
通过命令lscpu | grep -i "NUMA node(s)" | cut -d : -f 2 | tr -d " "可以获取服务器上的numa节点数,使用ps aux | grep -i qemu | grep -i monitor.sock | grep -v grep | wc -l命令可以获取当前已经开启的虚机个数。
计算出虚机启动的节点,在虚机xml中加入如下配置
<numatune>
<memory mode='strict' nodeset='0'/>
</numatune>
通过lscpu | grep -i "NUMA node" | grep -i node0| cut -d : -f 2 | tr -d " "命令可以获取到node对应的cpu,在xml中加入如下配置
<vcpu placement='static' cpuset='0-7,32-39'>1</vcpu>
通过上述的方法,就可以将虚机绑定到对应的node节点上。注意这里的memory mode是使用strict模式,也就是内存只能在该节点上分配,如果该节点内存不足,启动虚机会失败。vcpu的placement赋值为static,就是将cpu绑定到cpuset上。
如果未来不同虚机的核数不是固定的,比如普通性能和高性能虚机可能会同时使用的话,就需要进行numa检测,检测每个node运行了多少个虚机,使用了多少个线程,用运行最少线程数的node来运行最新的虚机,这里就需要增加numa自动检测机制了。
针对虚机NUMA优化方案

针对系统中已经存在许多虚机,由此产生的NUMA负载不均衡的情况,可以采用如下的方法再平衡:

问题:
- 如何获取qemu进程在每个node的分布:
numastat -cs qemu
/proc/*/numa_maps
/proc/*/sched
/sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node*/numastat
- 已知N个node
[1,2,3,...,N],M个进程[1,2,3,...,M],进程i在节点j上占用的资源为\([\text{cpu}_{i,j},\text{mem}_{i,j}]\),节点j的资源为\([\text{cpu}_j,\text{mem}_j]\),现在新增一个进程k,所需要的cpu资源和内存资源为\([\text{cpu}_k,\text{mem}_k]\),问:将该进程分配到哪个节点最优。最优的标准是什么?(每个节点cpu负载均衡,内存负载均衡),其中cpu资源为自然数。- 每个进程可以对应多个线程,cpu资源应该按进程分还是线程分?
测试
NUMA 的影响主要是内存,所以在这方面主要进行内存测试。
- STREAM, STREAM2: http://www.cs.virginia.edu/stream/
- LMbench: http://www.bitmover.com/lmbench/
- memtester: http://pyropus.ca/software/memtester/
- unixbench
- mbw
- sysbench
参考
Non-uniform memory access
numastat
numactl
numa
The Effect of NUMA Tunings on CPU Performance
amd-epyc-9005-tg-architecture-overview
AMD Optimizes EPYC Memory with NUMA
Non-uniform memory access (NUMA)
optimizing-applications-for-numa
NUMA (Non-Uniform Memory Access): An Overview
What is NUMA?
KVM实战:原理、进阶与性能调优

浙公网安备 33010602011771号