【linux】网卡overruns报错问题原因及解决方案
环境信息:
- cpu:40c
- 操作系统:ceontos6.7
- 部署服务:DataNode、NodeManager、Impala服务。
一、前言:
之前发生过某台节点网卡报错,影响结果 presto任务失败、HDFS读取变慢、Yarn任务执行变慢。
于是后续对net.if.total.errors这个指标统一加上了监控,过了一段时间后,在别的节点也收到了类似的报警。

于是想到还是之前的错误,于是让OP同学帮忙重新切换了网卡,切换网卡后一段时间确实没有收到告警了。但是过段时间(4-5小时)又收到了新告警。
于是将运行的NodeManager、Impala的服务停止到,只保留了DataNode服务。
结果网卡报错的情况依旧没有缓解。仍然收到大量的告警。通过观察监控,发现该节点的数据写入的平均时间较别的节点时间较长。
二、调研:
查看网卡信息
在使用ifconfig命令查看网卡信息时,对于收发包的统计里有dropped与overruns两个字段,看上去都是丢包,但它们有什么区别呢?
ifconfig -a
查看结果:

具体解释如下:
dropped,表示这个数据包已经进入到网卡的接收缓存fifo队列,并且开始被系统中断处理准备进行数据包拷贝(从网卡缓存fifo队列拷贝到系统内存),但由于此时的系统原因(比如内存不够等)导致这个数据包被丢掉,即这个数据包被Linux系统丢掉。overruns,表示这个数据包还没有被进入到网卡的接收缓存fifo队列就被丢掉,因此此时网卡的fifo是满的。为什么fifo会是满的?因为系统繁忙,来不及响应网卡中断,导致网卡里的数据包没有及时的拷贝到系统内存,fifo是满的就导致后面的数据包进不来,即这个数据包被网卡硬件丢掉。
所以,个人觉得遇到overruns非0,需要检测cpu负载与cpu中断情况。
查看cpu 软中断
下图选中的为优化前和优化后的对比。

通过查看Cpu idle,发现会有个别的cpu 被打满的现象 观察软中断也相较别的节点较高。于是决定对网卡软中断绑定优化。
三、多队列网卡
多队列网卡顾名思义就是由原来的单网卡单队列变成了现在的单网卡多队列。多队列网卡是一种技术,最初是用来解决网络IO QoS (quality of service)问题的,后来随着网络IO的带宽的不断提升,单核CPU不能完全处满足网卡的需求,体现最为明显的就是单核CPU处理不了网卡大量的数据包请求(软中断)而造成大量丢包,
其实当网卡收到数据包时会产生中断,通知内核有新数据包,然后内核调用中断处理程序进行响应,把数据包从网卡缓存拷贝到内存,因为网卡缓存大小有限,如果不及时拷出数据,后续数据包将会因为缓存溢出被丢弃,因此这一工作需要立即完成。
剩下的处理和操作数据包的工作就会交给软中断,高负载的网卡是软中断产生的大户,很容易形成瓶颈。但通过多队列网卡驱动的支持,将各个队列通过中断绑定到不同的CPU核上,以满足网卡的需求,这就是多队列网卡的应用。
网卡软中断不平衡:
- 集中在一个CPU核心上(mpstat 查看%soft集中,通常是cpu0)。
- 网卡的硬件中断队列不够, < CPU 核心数,无法一对一绑定,导致部分CPU核心%soft 较少,CPU使用不均衡。
所以如果网络流量大的时候,就需要将软中断均匀的分散到各个核上,避免单个CPU core成为瓶颈。
四、SMP IRQ affinity
Linux 2.4内核之后引入了将特定中断绑定到指定的CPU的技术,称为SMP IRQ affinity.
原理:
当一个硬件(如磁盘控制器或者以太网卡), 需要打断CPU的工作时, 它就触发一个中断. 该中断通知CPU发生了某些事情并且CPU应该放下当前的工作去处理这个事情. 为了防止多个设置发送相同的中断, Linux设计了一套中断请求系统, 使得计算机系统中的每个设备被分配了各自的中断号, 以确保它的中断请求的唯一性.
从2.4 内核开始, Linux改进了分配特定中断到指定的处理器(或处理器组)的功能. 这被称为SMP IRQ affinity, 它可以控制系统如何响应各种硬件事件. 允许你限制或者重新分配服务器的工作负载, 从而让服务器更有效的工作. 以网卡中断为例,在没有设置SMP IRQ affinity时, 所有网卡中断都关联到CPU0, 这导致了CPU0负载过高,而无法有效快速的处理网络数据包,导致了瓶颈。
通过SMP IRQ affinity, 把网卡多个中断分配到多个CPU上,可以分散CPU压力,提高数据处理速度。
使用前提:
- 需要多CPU的系统
- 需要大于等于2.4的Linux 内核
五、解决方法:
脚本内容:
#!/bin/bash
#
# Copyright (c) 2014, Intel Corporation
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# * Redistributions of source code must retain the above copyright notice,
# this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of Intel Corporation nor the names of its contributors
# may be used to endorse or promote products derived from this software
# without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# Affinitize interrupts to cores
#
# typical usage is (as root):
# set_irq_affinity -x local eth1 <eth2> <eth3>
#
# to get help:
# set_irq_affinity
usage()
{
echo
echo "Usage: $0 [-x|-X] {all|local|remote|one|custom} [ethX] <[ethY]>"
echo " options: -x Configure XPS as well as smp_affinity"
echo " options: -X Disable XPS but set smp_affinity"
echo " options: {remote|one} can be followed by a specific node number"
echo " Ex: $0 local eth0"
echo " Ex: $0 remote 1 eth0"
echo " Ex: $0 custom eth0 eth1"
echo " Ex: $0 0-7,16-23 eth0"
echo
exit 1
}
usageX()
{
echo "options -x and -X cannot both be specified, pick one"
exit 1
}
if [ "$1" == "-x" ]; then
XPS_ENA=1
shift
fi
if [ "$1" == "-X" ]; then
if [ -n "$XPS_ENA" ]; then
usageX
fi
XPS_DIS=2
shift
fi
if [ "$1" == -x ]; then
usageX
fi
if [ -n "$XPS_ENA" ] && [ -n "$XPS_DIS" ]; then
usageX
fi
if [ -z "$XPS_ENA" ]; then
XPS_ENA=$XPS_DIS
fi
num='^[0-9]+$'
# Vars
AFF=$1
shift
case "$AFF" in
remote) [[ $1 =~ $num ]] && rnode=$1 && shift ;;
one) [[ $1 =~ $num ]] && cnt=$1 && shift ;;
all) ;;
local) ;;
custom) ;;
[0-9]*) ;;
-h|--help) usage ;;
"") usage ;;
*) IFACES=$AFF && AFF=all ;; # Backwards compat mode
esac
# append the interfaces listed to the string with spaces
while [ "$#" -ne "0" ] ; do
IFACES+=" $1"
shift
done
# for now the user must specify interfaces
if [ -z "$IFACES" ]; then
usage
exit 1
fi
# support functions
set_affinity()
{
VEC=$core
if [ $VEC -ge 32 ]
then
MASK_FILL=""
MASK_ZERO="00000000"
let "IDX = $VEC / 32"
for ((i=1; i<=$IDX;i++))
do
MASK_FILL="${MASK_FILL},${MASK_ZERO}"
done
let "VEC -= 32 * $IDX"
MASK_TMP=$((1<<$VEC))
MASK=$(printf "%X%s" $MASK_TMP $MASK_FILL)
else
MASK_TMP=$((1<<$VEC))
MASK=$(printf "%X" $MASK_TMP)
fi
printf "%s" $MASK > /proc/irq/$IRQ/smp_affinity
printf "%s %d %s -> /proc/irq/$IRQ/smp_affinity\n" $IFACE $core $MASK
case "$XPS_ENA" in
1)
printf "%s %d %s -> /sys/class/net/%s/queues/tx-%d/xps_cpus\n" $IFACE $core $MASK $IFACE $((n-1))
printf "%s" $MASK > /sys/class/net/$IFACE/queues/tx-$((n-1))/xps_cpus
;;
2)
MASK=0
printf "%s %d %s -> /sys/class/net/%s/queues/tx-%d/xps_cpus\n" $IFACE $core $MASK $IFACE $((n-1))
printf "%s" $MASK > /sys/class/net/$IFACE/queues/tx-$((n-1))/xps_cpus
;;
*)
esac
}
# Allow usage of , or -
#
parse_range () {
RANGE=${@//,/ }
RANGE=${RANGE//-/..}
LIST=""
for r in $RANGE; do
# eval lets us use vars in {#..#} range
[[ $r =~ '..' ]] && r="$(eval echo {$r})"
LIST+=" $r"
done
echo $LIST
}
# Affinitize interrupts
#
setaff()
{
CORES=$(parse_range $CORES)
ncores=$(echo $CORES | wc -w)
n=1
# this script only supports interrupt vectors in pairs,
# modification would be required to support a single Tx or Rx queue
# per interrupt vector
queues="${IFACE}-.*TxRx"
irqs=$(grep "$queues" /proc/interrupts | cut -f1 -d:)
[ -z "$irqs" ] && irqs=$(grep $IFACE /proc/interrupts | cut -f1 -d:)
[ -z "$irqs" ] && irqs=$(for i in `ls -Ux /sys/class/net/$IFACE/device/msi_irqs` ;\
do grep "$i:.*TxRx" /proc/interrupts | grep -v fdir | cut -f 1 -d : ;\
done)
[ -z "$irqs" ] && echo "Error: Could not find interrupts for $IFACE"
echo "IFACE CORE MASK -> FILE"
echo "======================="
for IRQ in $irqs; do
[ "$n" -gt "$ncores" ] && n=1
j=1
# much faster than calling cut for each
for i in $CORES; do
[ $((j++)) -ge $n ] && break
done
core=$i
set_affinity
((n++))
done
}
# now the actual useful bits of code
# these next 2 lines would allow script to auto-determine interfaces
#[ -z "$IFACES" ] && IFACES=$(ls /sys/class/net)
#[ -z "$IFACES" ] && echo "Error: No interfaces up" && exit 1
# echo IFACES is $IFACES
CORES=$(</sys/devices/system/cpu/online)
[ "$CORES" ] || CORES=$(grep ^proc /proc/cpuinfo | cut -f2 -d:)
# Core list for each node from sysfs
node_dir=/sys/devices/system/node
for i in $(ls -d $node_dir/node*); do
i=${i/*node/}
corelist[$i]=$(<$node_dir/node${i}/cpulist)
done
for IFACE in $IFACES; do
# echo $IFACE being modified
dev_dir=/sys/class/net/$IFACE/device
[ -e $dev_dir/numa_node ] && node=$(<$dev_dir/numa_node)
[ "$node" ] && [ "$node" -gt 0 ] || node=0
case "$AFF" in
local)
CORES=${corelist[$node]}
;;
remote)
[ "$rnode" ] || { [ $node -eq 0 ] && rnode=1 || rnode=0; }
CORES=${corelist[$rnode]}
;;
one)
[ -n "$cnt" ] || cnt=0
CORES=$cnt
;;
all)
CORES=$CORES
;;
custom)
echo -n "Input cores for $IFACE (ex. 0-7,15-23): "
read CORES
;;
[0-9]*)
CORES=$AFF
;;
*)
usage
exit 1
;;
esac
# call the worker function
setaff
done
# check for irqbalance running
IRQBALANCE_ON=`ps ax | grep -v grep | grep -q irqbalance; echo $?`
if [ "$IRQBALANCE_ON" == "0" ] ; then
echo " WARNING: irqbalance is running and will"
echo " likely override this script's affinitization."
echo " Please stop the irqbalance service and/or execute"
echo " 'killall irqbalance'"
fi
执行脚本
sh set_irq.sh local eth0

浙公网安备 33010602011771号