代码改变世界

Linux Buffer I/O error on device dm-4, logical block

2016-07-27 16:32  潇湘隐者  阅读(23171)  评论(2编辑  收藏  举报

Linux服务器日志(Oracle Linux Server release 5.7)里面出现了一些"Buffer I/O error on device dm-4, logical block 0"之类的错误,如下所示:

Jul 3 02:33:24 localhost kernel: Buffer I/O error on device dm-4, logical block 0

Jul 3 02:33:24 localhost kernel: Buffer I/O error on device dm-4, logical block 1

Jul 3 02:33:24 localhost kernel: Buffer I/O error on device dm-4, logical block 2

Jul 3 02:33:24 localhost kernel: Buffer I/O error on device dm-4, logical block 3

Jul 3 02:33:24 localhost kernel: Buffer I/O error on device dm-4, logical block 0

Jul 3 02:33:24 localhost kernel: Buffer I/O error on device dm-4, logical block 0

Jul 3 02:33:24 localhost kernel: Buffer I/O error on device dm-4, logical block 1

Jul 3 02:33:24 localhost kernel: Buffer I/O error on device dm-4, logical block 2

Jul 3 02:33:24 localhost kernel: Buffer I/O error on device dm-4, logical block 3

clip_image001

在这篇"Buffer I/O Error in /var/log/messages"博客里面的英文介绍:

 

A server using a LUN, which is presented by a storage array through fabric channels, may show buffer I/O errors while the server is booting or commands such as fdisk and vgscan are being run. The access can be a read or write attempt. These messages are sometimes harmless. When using PowerPath, these errors are suppressed. However, in the case where Linux native multipathing is used, there is no automatic provision for filtering these messages.

The errors can occur when using an active/passive storage array, such as EMC Clarion series. These types of SANs contain two storage processors. LUNs are assigned to only one of the processors at the time of LUN creation. The LUN can receive I/O only via that one processor. The other processor is passive; it acts as a backup, ready to receive I/O if the active controller fails, or if all paths to the LUN via the active controller fails.

Paths to the LUN going via the passive controller are passive paths and will generate an I/O errors should I/O be sent over them. At bootup, the kernel's SCSI mid-layer scans all paths to find devices. Thus it will scan both active and passive paths and will generate buffer I/O errors for the passive paths.

This is a normal behavior for Linux native multipath, and the errors do not indicate an array issue. The errors can safely be filtered through the OS logging configuration or the user can avoid access to native devices (as opposed to using /dev/mapper devices). Alternatively, a qualified version of PowerPath may be installed, which will automatically filter these errors.

 

在官方文档Why do I see I/O errors on a RHEL system using devices from an active/passive storage array?亦有介绍。

 

· Storage arrays in a SAN are generally implemented in a redundant manner such that the host can access logical units (LUN) on one of many different paths. Typically, these operate in one of two different modes: active/active or active/passive. With active/active network, the I/O can be sent to any path of a LUN and it will be handled by the controller. With active/passive arrays, a controller is considered the main for each LUN, while the other controller is waiting and acts as a backup plan. Some windows will accept I/O to a LUN on the backup controller (passive) but this will not be optimized (worst performance). However, other active/passive arrays will not accept I/O to the backup controller for a LUN, so all commands sent to it will result in an I/O error.

· In RHEL, there are a number of commands and utilities that can send I/O to various devices, such as LVM, udev, fdisk, etc., not to mention applications such as databases, web servers, etc.Si one of them had to issue I/O to a passive way on a bay that does not accept it, it will cause an error I/O in newspapers .The messages are harmless and do not indicate a problem, but they can fill newspapers or become unduly concern .Therefore, some may want to try to avoid these errors by preventing applications from accessing passive paths. Typically, filtering from LVM will disappear the majority of these erreurs.Aussi to reduce the number of errors, avoid commands like 'fdisk -l' that scan all devices. Finally, the configuration of the applications that scan or accesses multiple devices so that only accesses the appropriate active path or multipath logical device (/dev/mapper/mpath *, /dev/emcpower *, /dev/ * sddlma etc. ) can reduce the number of errors.

 

看来这个错误信息是可以忽略的(harmless)。并不是存储出现了什么问题。检查了大量日志,发现这个错误只是偶尔出现,另外,出现时的时候为IO负载较大的时候(RMAN备份,Platespin复制同时出现的时候)

 

参考资料:

http://blog.csdn.net/kinges/article/details/40425841

https://access.redhat.com/solutions/18746