netmap performance
转自 http://mnstory.net/2014/11/netmap-performance/
本验证基于LINUX平台,E1000E网卡。
目标设备:有三个网口,eth0,eth1皆为Intel 82574L网卡,用于数据转发,eth2为管理口,方便eth0,eth1转发数据时还能远程连接。
使用smartflow来测试流量,拓扑图为:
编译
先进入LINUX目录
|
1
|
# cd netmap-src-dir/LINUX
|
我们测试e1000e驱动,根据版本号的命名规则,COPY对应的e1000e驱动patch到patches目录
例如,我的KERNEL版本为 3.10.0
|
1
2
|
# uname -a
Linux host-001e67a1aaf9 3.10.0 #128 SMP x86_64 GNU/Linux
|
那么,3.10.0翻译过来就是31000,查看final-patches下面的文件,可以推出对应patch文件版本号范围为30900--99999
|
1
|
# mkdir patches
|
将对应e1000e网卡的patch拷贝到patches目录
|
1
|
# cp final-patches/diff--e1000e--30900--99999 patches/
|
编译的时候,需要指定目标机器的源码目录KSRC
|
1
|
# make clean; make KSRC=/src/VMP4.0/src/linux
|
生成netmap_lin.ko 和 e1000e/e1000e.ko,拷贝到目标机器。
编译pkt-gen等examples
|
1
2
|
# cd ../examples
# make
|
生成的文件,COPY到目标机。
运行测试
运行的时候,发现一个问题,启动ptk-gen后内核会出CORE,demsg打印
错误信息为:
[ 3.562540] {1}[Hardware Error]: APEI generic hardware error status
[ 3.563146] {1}[Hardware Error]: severity: 1, fatal
[ 3.563734] {1}[Hardware Error]: section: 0, severity: 1, fatal
[ 3.564326] {1}[Hardware Error]: flags: 0x01
[ 3.564911] {1}[Hardware Error]: primary
[ 3.565516] {1}[Hardware Error]: section_type: PCIe error
[ 3.566105] {1}[Hardware Error]: port_type: 0, PCIe end point
[ 3.566700] {1}[Hardware Error]: version: 1.16
[ 3.567289] {1}[Hardware Error]: command: 0x4010, status: 0x0547
[ 3.567878] {1}[Hardware Error]: device_id: 0000:00:00.0
[ 3.568464] {1}[Hardware Error]: slot: 0
[ 3.569046] {1}[Hardware Error]: secondary_bus: 0x00
[ 3.569652] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x10d3
[ 3.570246] {1}[Hardware Error]: class_code: 000002
[ 3.570856] Kernel panic - not syncing: Fatal hardware error!
其中vendor_id: 0x8086, device_id: 0x10d3为发生错误的设备厂商和设备,通过lspci查找device_id
|
1
2
|
# lspci -nn -vv | grep 10d3
03:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3]
|
发现是就是发包网卡,后续又发现dmesg:
e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width x1) 00:1e:67:a1:aa:f9
e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection
e1000e 0000:00:19.0 eth0: MAC: 10, PHY: 11, PBA No: 0100FF-0FF
ACPI Warning: SystemIO range 0x0000000000005000-0x000000000000501f conflicts with OpRegion 0x0000000000005000-0x000000000000500f (\_SB_.PCI0.SBUS.SMBI) (20130517/utaddress-254)
ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
ahci 0000:00:1f.2: version 3.0
e1000e 0000:03:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
e1000e 0000:03:00.0: irq 46 for MSI/MSI-X
DMAR:[fault reason 05] PTE Write access is not set
dmar: DRHD: handling fault status reg 2
dmar: DMAR:[DMA Write] Request device [03:00.0] fault addr 7285ac000
DMAR:[fault reason 05] PTE Write access is not set
irq 48: nobody cared (try booting with the "irqpoll" option)
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O-------------- 3.10.0 #128
Hardware name: Intel Corporation S1200BTL/S1200BTL, BIOS S1200BT.86B.02.00.0035.030220120927 03/02/2012
ffff8807053f5e84 ffff88081e803e28 ffffffff81653b0a ffff88081e803e58
ffffffff810dba9d 0000000000013080 ffff8807053f5e00 0000000000000030
0000000000000000 ffff88081e803ea8 ffffffff810dbf41 00000030669e0b7c
Call Trace:
[] dump_stack+0x19/0x1b
[] __report_bad_irq+0x3d/0xe0
[] note_interrupt+0x1b1/0x200
[] ? cpuidle_enter_state+0x5b/0xe0
[] handle_irq_event_percpu+0xa2/0x1e0
[] handle_irq_event+0x42/0x70
[] handle_edge_irq+0x6f/0x110
[] handle_irq+0x22/0x40
[] do_IRQ+0x5e/0x110
[] common_interrupt+0x6a/0x6a
[] ? cpuidle_enter_state+0x5b/0xe0
[] ? cpuidle_enter_state+0x57/0xe0
[] cpuidle_idle_call+0xbb/0x200
[] arch_cpu_idle+0xe/0x30
[] cpu_startup_entry+0x9a/0x220
[] rest_init+0x77/0x80
[] start_kernel+0x44e/0x45b
[] ? repair_env_string+0x5e/0x5e
[] x86_64_start_reservations+0x2a/0x2c
[] x86_64_start_kernel+0xfd/0x101
handlers:
[] e1000_msix_other [e1000e]
Disabling IRQ #48
从上述信息,分析出,可能和DMA有关,查看内核编译选项:
|
1
2
3
4
|
# cat .config | grep INTEL_IOMMU
CONFIG_INTEL_IOMMU=y
CONFIG_INTEL_IOMMU_DEFAULT_ON=y
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
|
默认IOMMU都开启了,于是在grub启动项里修改,将iommu关闭:
|
1
2
|
# cat /proc/cmdline
BOOT_IMAGE=/firmware/current/package/files/vm crashkernel=400M loglevel=3 elevator=deadline softlockup_panic=1 reboot=force nohz=off intel_iommu=off
|
LINUX原生测试
原生LINUX测试的时候,遇到一个问题是没开启ip_forward导致三层数据不通,数据只到eth0,无法从eth1出去:
|
1
2
3
4
|
# pps.sh eth0 eth1
02:25:03 IFN , , ,RXB , , ,TXB , ,RXP , ,TXP
eth0 18661696 507 291589 5
eth1 0 0 0 0
|
查看ip_forward发现关闭了转发,开启它:
|
1
2
3
|
# cat /proc/sys/net/ipv4/ip_forward
0
# echo 1 > /proc/sys/net/ipv4/ip_forward
|
然后通了:
|
1
2
3
4
|
# pps.sh eth0 eth1
02:27:51 IFN , , ,RXB , , ,TXB , ,RXP , ,TXP
eth0 27094848 95 423357 1
eth1 0 27058240 0 422789
|
Netmap vale测试
初始化环境的时候,将网口中断绑定到CPU 1上,防止在CPU 0上受干扰过多,写成脚本,方便复用:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
|
lcmdrun()
{
cmd="$*"
echo "$cmd"
eval "$cmd"
}
#此函数设置网口中断到指定CPU
w.eth.irqaffinity()
{
eth="$1"
affinity="$2"
if [ "$eth" = "" ]; then
lerror "w.eth.irqaffinity \$eth [\$affinity]";
return;
fi
#make mult lie result to one line
lines=`cat /proc/interrupts | grep $eth | awk -F: '{print $2}'`
#lines=' 47
#48
#49'
SAVE_IFS=$IFS
IFS=$'\n'
#if we set IFS=$'\n' and pass lines as args, it become => trim ' 47' ' 48' ' 49'
#if we set IFS=$' ' and pass lines as args, it become =>
#trim '47
#' '48
#' 49
lines=$(trim $lines)
#trim return lines='47 48 49'
IFS=$' '
#in $lines, not in "$lines", "$lines" can't split by space, by $lines can be
for i in $lines; do
i=$(trim $i)
if [ "$affinity" == "" ]; then
#just echo
cmd="cat /proc/irq/$i/smp_affinity"
else
cmd="echo \"$affinity\" > /proc/irq/$i/smp_affinity"
fi
lcmdrun "$cmd"
done
IFS=$SAVE_IFS
}
w.nm.stop()
{
nic=$1
shift
ifs=( $@ )
ifs_nr=$#
#stop interface
for ((i=0; i < $ifs_nr; i++))
do
lcmdrun "ifconfig ${ifs[$i]} down"
done
#remove old modules
lcmdrun rmmod "${nic}.ko"
lcmdrun rmmod "netmap_lin.ko"
}
w.nm.start()
{
nic=$1
shift
ifs=( $@ )
if [ ! -f "${nic}.ko" ]; then
echo "can't found ${nic}.ko"
return 1
fi
if [ ! -f "netmap_lin.ko" ]; then
echo "can't found netmap_lin.ko"
return 1
fi
#clean dmsg
dmesg -c > /dev/null
#set ring slot, set before netmap attach when eth is down
for eth in "${ifs[@]}"; do
lcmdrun "ifconfig $eth down 2>/dev/null"
done
#insert new modules
lcmdrun insmod "netmap_lin.ko"
lcmdrun insmod "${nic}.ko"
for eth in "${ifs[@]}"; do
#"ethtool -G $eth rx 4096 tx 4096"
#start interface
lcmdrun "ifconfig $eth up"
#set irq affinity to cpu1
lcmdrun "w.eth.irqaffinity $eth 2"
#set promisc module, important
lcmdrun "ifconfig $eth promisc"
#ethtool -K $eth tso off
#ethtool -K $eth gso off
done
#show log
lcmdrun dmesg
lcmdrun "lsmod | grep \"${nic}\""
}
w.nm.restart()
{
w.nm.stop $*
w.nm.start $*
}
w.nm.restart.e1000e()
{
w.nm.restart "e1000e" "eth0" "eth1"
}
|
只需调用w.nm.restart.e1000e即可初始化netmap环境,初始化完成后,设置虚拟bridge:
|
1
2
3
4
5
6
7
|
# vale-ctl -h vale23:eth0
# vale-ctl -h vale23:eth1
# vale-ctl -l
bdg_ctl [98] bridge:0 port:0 vale23:eth0
bdg_ctl [98] bridge:0 port:1 eth0
bdg_ctl [98] bridge:0 port:2 vale23:eth1
bdg_ctl [98] bridge:0 port:3 eth1
|
此时eth0和eth1就桥接到同一bridge上了,数据包就可以从eth0转发到eth1,反之亦然。
Netmap bridge测试
上面使用的是vale方式转发,netmap应用场景还有通过socket:轮询,收取,转发数据,其提供了示例代码bridge,我们就用bridge测试。
使用bridge的时候发现netmap不能收发数据,后来发现没有开启网口混杂模式,例如eth0,通过ifconfig eth0 | grep PROMISC查看是否设置了混杂模式,如果没有,使用ifconfig eth0 promisc设置。
Bridge测试比较简单,直接:
|
1
|
#./bridge -i netmap:eth0 -i netmap:eth1
|
当然,为了更优性能,也将此进程绑定到网口中断发生的CPU 1,并设置其为实时进程:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
|
pidexpend()
{
ip=$1;
if [ "$ip" == "" ] ; then
return;
fi
if [ "$ip" -eq "$ip" ] 2>/dev/null; then
echo $ip;
return;
fi
#not digit
pidof $ip | awk '{print $1}'
}
w.cpu.affinity.show()
{
pid=$(pidexpend $1)
if [ "$pid" = "" ]; then
echo "w.cpu.affinity.show (pid|progname)"
return 1
fi
lcmdrun "taskset -p $pid"
lcmdrun "chrt -p $pid"
return 0
}
w.cpu.affinity()
{
#set eth affinity to cpu 1, hex 0x2
pid=$(pidexpend $1)
cpumask=2
if [ "$2" != "" ]; then
cpumask=$2
fi
realtime=90
if [ "$3" != "" ]; then
realtime=$3
fi
if [ "$pid" = "" ]; then
echo "w.cpu.affinity (pid|progname) [\$cpumask=$cpumask] [\$realtime=$realtime]"
return 1
fi
if [ "$cpumask" -eq "$cpumask" ] 2>/dev/null; then
lcmdrun "taskset -p $cpumask $pid"
fi
if [ "$realtime" -eq "$realtime" ] 2>/dev/null; then
lcmdrun "chrt -p -r $realtime $pid"
fi
return 0
}
|
运行了bridge之后,再调用w.cpu.affinity bridge可以设置其为CPU亲和性与实时性,然后可以开始测试。
测试结果
- netmap bridge和netmap vale性能相当。
- 在发64字节包的时候,netmap性能为121万PPS,LINUX原生为83万PPS,提高了45%。官方给出的万兆网卡性能可达到,1488万PPS,折换为千兆网卡大约为148万PPS,比我测试的数据高22%,也不知道它如何测试的。
- 当frame size增大后,性能无差别,因为流量到了瓶颈,除了64字节frame size的时候,CPU利用率都不会100%。



浙公网安备 33010602011771号