虚拟化与NUMA

NUMA 简介

NUMA(Non-Uniform Memory Access,非统一内存访问架构)是相对于UMA(Uniform Memory Access)而言的。早年的计算机架构都是UMA,如图所示。
在这里插入图片描述

所有的CPU处理单元(Processor)均质地通过共享的总线访问内存,所有CPU访问所有内存单元的速度是一样的。在多处理器的情形下,多个任务会被分派在各个处理器上并发执行,则它们竞争内存资源的情况会非常频繁,从而引起效率的下降。
所以,随着多处理器架构的逐渐普及以及数量的不断增长,NUMA架构兴起,如图所示。
在这里插入图片描述

处理器与内存被划分成一个个的节点(node),处理器访问自己节点内的内存会比访问其他节点的内存快

NUMA工具

numastat

numastat用来查看某个(些)进程或者整个系统的内存消耗在各个NUMA节点的分布情况。

       numastat -c -z -m -n
       numastat -czs libvirt kvm qemu
       watch -n1 numastat
       watch -n1 --differences=cumulative numastat

       numastat with no command options or arguments at all, displays per-
       node NUMA hit and miss system statistics from the kernel memory
       allocator.  This default numastat behavior is strictly compatible
       with the previous long-standing numastat perl script, written by Andi
       Kleen.  The default numastat statistics shows per-node numbers (in
       units of pages of memory) in these categories:

       numa_hit is memory successfully allocated on this node as intended.

       numa_miss is memory allocated on this node despite the process
       preferring some different node. Each numa_miss has a numa_foreign on
       another node.

       numa_foreign is memory intended for this node, but actually allocated
       on some different node.  Each numa_foreign has a numa_miss on another
       node.

       interleave_hit is interleaved memory successfully allocated on this
       node as intended.

       local_node is memory allocated on this node while a process was
       running on it.

       other_node is memory allocated on this node while a process was
       running on some other node.

       Any supplied options or arguments with the numastat command will
       significantly change both the content and the format of the display.
       Specified options will cause display units to change to megabytes of
       memory, and will change other specific behaviors of numastat as
       described below.

       Memory usage information reflects the resident pages on the system.

numad

numad是一个可以自动管理NUMA亲和性(affinity)的工具(同时也是一个后台进程)。它实时监控NUMA拓扑结构(topology)和资源使用,并动态调整。同时它还可以在启动一个程序前,提供NUMA优化建议。
与numad功能类似,Kernel的auto NUMA balancing(/proc/sys/kernel/numa_balancing)也是进行动态NUMA资源的调节。numad启动后会覆盖Kernel的auto NUMA balancing功能。

numactl

如果说numad是事后(客户机起来以后)调节NUMA资源分配,那么numactl则是主动地在程序起来时候就指定好它的NUMA节点。numactl其实不止它名字表示的那样设置NUMA相关亲和性,它还可以设置共享内存/大页文件系统的内存策略,以及进程的CPU和内存的亲和性。

The emerging standard for easily binding processes to processors on Linux-based NUMA supercomputers is numactl. It can operate on a coarser-grained basis (i.e., CPU sockets rather than individual CPU cores) than taskset (only CPU cores) because it is aware of the processor topology and how the CPU cores map to CPU sockets. Using numact is typically easier–after all, the common goal is to confine a process to a numa pool (or “cpu node”) rather than specific CPU cores. To that end, numactl also lets you bind a processor’s memory locality to prevent processes from having to jump across NUMA pools (called “memory nodes” in numactl parlance). The policy is set for command and inherited by all of its children. In addition it can set persistent policy for shared memory segments or files.

Example uses:

numactl --cpubind=0 --membind=0,1 myprog Runs program “myprog” on cpu 0, using memory on nodes 0 and 1.

numactl --physcpubind=+0-4,8-12 myapplic arguments Run myapplic on cpus 0-4 and 8-12 of the current cpuset.

numactl --interleave=all bigdatabase arguments Run big database with its memory interleaved on all CPUs.

numactl --cpubind=0 --membind=0,1 process Run process on node 0 with memory allocated on node 0 and 1.

numactl --preferred=1 numactl --show Set preferred node 1 and show the resulting state.

numactl --interleave=all --shmkeyfile /tmp/shmkey Interleave all of the sysv shared memory region specified by /tmp/shmkey over all nodes.

numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch Bind the second gigabyte in the tmpfs file /dev/shm/A to node 1.

numactl --localalloc /dev/shm/file Reset the policy for the shared memory file file to the default localalloc policy.

虚拟化与NUMA

NUMA存在属于cpu使用其他node的内存的场景,这会影响到该场景下虚机的内存性能。

查询虚机numa设置

使用virsh查询虚机的numatune信息和vcpu信息:

#!/bin/bash
DOMAIN=$1
while [ 1 ] ; do
    DOM_STATE=`virsh list --all | awk '/'$DOMAIN'/ {print $NF}'`
    echo "${DOMAIN}: $DOM_STATE"
    virsh numatune $DOMAIN
    virsh vcpuinfo $DOMAIN | awk '/VCPU:/ {printf "VCPU" $NF }
    /^CPU:/ {printf "%s %d %s %d %s\n", " on pCPU:", $NF, " ( part of numa node:", $NF/8, ")"}'
    sleep 2
done

使用virsh vcpuinfo可以查看虚拟机VCPU和物理CPU的对应关系。

# virsh vcpuinfo  deepin
VCPU:           0
CPU:            10
State:          running
CPU time:       36962.5s
CPU Affinity:   yyyyyyyyyyyy

VCPU:           1
CPU:            5
State:          running
CPU time:       37943.3s
CPU Affinity:   yyyyyyyyyyyy

VCPU:           2
CPU:            1
State:          running
CPU time:       50309.6s
CPU Affinity:   yyyyyyyyyyyy

VCPU:           3
CPU:            11
State:          running
CPU time:       38696.9s
CPU Affinity:   yyyyyyyyyyyy

yyyy表示可以使用的物理CPU内部的逻辑核。

使用emulatorpin可以查看虚拟机可以使用哪些物理逻辑CPU。

# virsh emulatorpin deepin
emulator: CPU Affinity
----------------------------------
       *: 0-11

一般家用cpu都是单node的,需要测试numa还得需要服务器cpu。

利用virsh设置虚机的numa参数

HOST MACHINE MANAGEMENT

利用 virsh numatune 命令设置指定虚机的NUMA参数,这些参数在虚机xml中的 <numatune> 元素里。virsh numatune domain 命令如果不加参数,那么只显示指定虚机的当前设置, 可以加上如下的参数:

  • --mode - The mode can be set to either strict, interleave, or preferred. Running domains cannot have their mode changed while live unless the guest virtual machine was started within strict mode.
  • --nodeset contains a list of NUMA nodes that are used by the host physical machine for running the guest virtual machine. The list contains nodes, each separated by a comma, with a dash - used for node ranges and a caret ^ used for excluding a node.
  • Only one of the three following flags can be used per instance
    • --config will effect the next boot of a persistent guest virtual machine
    • --live will set the scheduler information of a running guest virtual machine.
    • --current will effect the current state of the guest virtual machine.

例子:设置正在运行的虚机guest1的NUMA mode为strict,并将其node设为0和1

virsh numatune guest1 --mode strict --nodeset 0,1 --live

运行该命令会修改guest1的xml:

<numatune>
        <memory mode='strict' nodeset='0,1'/>
</numatune>

实验:通过virsh 命令使得运行的虚机均匀分布在各个node上

#!/bin/bash

VMS=`virsh list|sed '1,2d' | awk '{print $2F}'`
NUM=`lscpu | grep -i "NUMA node(s)" | cut -d : -f 2 | tr -d " "`

let i=0
for vm in $VMS
do
    let i+=1
    N=`expr $i % $NUM`
    CPUSET=`lscpu | grep -i "NUMA node" | grep -i "node$N"| cut -d : -f 2 | tr -d " "`
    #echo $i
    #echo $N
    virsh numatune $vm --mode preferred --nodeset $N --live
done

通过命令numastat -c qemu可以看到qemu进程在每个NUMA node的内存分布。

]# numastat -c qemu

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Total
---------------  ------ ------ -----
990 (qemu-system     26   4210  4235
6222 (qemu-syste   3706    599  4306
7491 (qemu-syste   1609    554  2164
11412 (qemu-syst   3472    525  3997
20626 (qemu-syst   2224     34  2258
27078 (qemu-syst   1596    572  2168
27780 (qemu-syst   3846    234  4080
27883 (qemu-syst     26   4303  4329
38295 (qemu-syst   3769    401  4170
41111 (qemu-syst   1892      0  1893
42006 (qemu-syst   4246     70  4316
---------------  ------ ------ -----
Total             26412  11503 37915

服务器cpu拓扑结构

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 79938 MB
node 0 free: 14577 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 80611 MB
node 1 free: 52045 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

# numastat
                           node0           node1
numa_hit            202778086242    166320549405    #使用本节点内存次数
numa_miss                      0        34572039    # 计划使用本节点内存而被调度到其他节点次数
numa_foreign            34572039               0    # 计划使用其他节点内存而使用本节点内存次数
interleave_hit             34013           34184    # 交叉分配使用的内存中使用本节点的内存次数
local_node          202697418366    166320386633    # 在本节点运行的程序使用本节点内存次数
other_node              80667876        34734811    # 在其他节点运行的程序使用本节点内存次数

linux系统默认是自动NUMA平衡策略。如果需要关闭自动平衡,可以使用:

echo 0 > /proc/sys/kernel/numa_balancing

如果需要开启,使用:echo 1 > /proc/sys/kernel/numa_balancing

另外,还可以通过virsh emulatorpin来强制让虚拟机只能在部分物理CPU之间调度。

root@node1:/opt/VM2/centos6# virsh emulatorpin centos7 0-31 --live

root@node1:/opt/VM2/centos6# virsh emulatorpin centos7
emulator: CPU Affinity
----------------------------------
       *: 0-31

强制VCPU和物理CPU一对一绑定

root@node1:/opt/VM2/centos6# virsh vcpupin centos7 0 1

root@node1:/opt/VM2/centos6# virsh vcpupin centos7 1 2

root@node1:/opt/VM2/centos6# virsh vcpupin centos7 2 3

root@node1:/opt/VM2/centos6# virsh vcpupin centos7 3 0

root@node1:/opt/VM2/centos6# virsh vcpuinfo centos7
VCPU:           0
CPU:            1
State:          running
CPU time:       619.6s
CPU Affinity:   -y--------------------------------------------------------------

VCPU:           1
CPU:            2
State:          running
CPU time:       196.6s
CPU Affinity:   --y-------------------------------------------------------------

VCPU:           2
CPU:            3
State:          running
CPU time:       306.2s
CPU Affinity:   ---y------------------------------------------------------------

VCPU:           3
CPU:            0
State:          running
CPU time:       872.0s
CPU Affinity:   y---------------------------------------------------------------

利用xml设置虚机的numa参数

利用virsh 设置numa参数好象是无法设置vcpu placement,这时候直接用xml配置。

通过命令lscpu | grep -i "NUMA node(s)" | cut -d : -f 2 | tr -d " "可以获取服务器上的numa节点数,使用ps aux | grep -i qemu | grep -i monitor.sock | grep -v grep | wc -l命令可以获取当前已经开启的虚机个数。

计算出虚机启动的节点,在虚机xml中加入如下配置

<numatune>
  <memory mode='strict' nodeset='0'/>
</numatune>

通过lscpu | grep -i "NUMA node" | grep -i node0| cut -d : -f 2 | tr -d " "命令可以获取到node对应的cpu,在xml中加入如下配置

<vcpu placement='static' cpuset='0-7,32-39'>1</vcpu>

通过上述的方法,就可以将虚机绑定到对应的node节点上。注意这里的memory mode是使用strict模式,也就是内存只能在该节点上分配,如果该节点内存不足,启动虚机会失败。vcpu的placement赋值为static,就是将cpu绑定到cpuset上。

如果未来不同虚机的核数不是固定的,比如普通性能和高性能虚机可能会同时使用的话,就需要进行numa检测,检测每个node运行了多少个虚机,使用了多少个线程,用运行最少线程数的node来运行最新的虚机,这里就需要增加numa自动检测机制了。

针对虚机NUMA优化方案

请添加图片描述
针对系统中已经存在许多虚机,由此产生的NUMA负载不均衡的情况,可以采用如下的方法再平衡:
请添加图片描述

问题:

  1. 如何获取qemu进程在每个node的分布:numastat -cs qemu
  /proc/*/numa_maps
  /proc/*/sched
  /sys/devices/system/node/node*/meminfo
  /sys/devices/system/node/node*/numastat
  1. 已知N个node[1,2,3,...,N],M个进程[1,2,3,...,M],进程i在节点j上占用的资源为\([\text{cpu}_{i,j},\text{mem}_{i,j}]\),节点j的资源为\([\text{cpu}_j,\text{mem}_j]\),现在新增一个进程k,所需要的cpu资源和内存资源为\([\text{cpu}_k,\text{mem}_k]\),问:将该进程分配到哪个节点最优。最优的标准是什么?(每个节点cpu负载均衡,内存负载均衡),其中cpu资源为自然数。
  2. 每个进程可以对应多个线程,cpu资源应该按进程分还是线程分?

测试

NUMA 的影响主要是内存,所以在这方面主要进行内存测试。

  1. STREAM, STREAM2: http://www.cs.virginia.edu/stream/
  2. LMbench: http://www.bitmover.com/lmbench/
  3. memtester: http://pyropus.ca/software/memtester/
  4. unixbench
  5. mbw
  6. sysbench

参考

Non-uniform memory access
numastat
numactl
numa
The Effect of NUMA Tunings on CPU Performance
amd-epyc-9005-tg-architecture-overview
AMD Optimizes EPYC Memory with NUMA
Non-uniform memory access (NUMA)
optimizing-applications-for-numa
NUMA (Non-Uniform Memory Access): An Overview
What is NUMA?
KVM实战:原理、进阶与性能调优

posted @ 2025-01-01 20:16  main_c  阅读(3)  评论(0)    收藏  举报  来源