I/O wait is a per-CPU performance metric showing time spent idle, when there are threads on the CPU dispatcher queue (in sleep state) that are blocked on disk I/O. This divides CPU idle time into time spent with nothing to do, and time spent blocked on disk I/O. A high rate of I/O wait per CPU shows that the disks may be a bottleneck, leaving the CPU idle while it waits on them.
I/O wait can be a very confusing metric. If another CPU-hungry process comes along, the I/O wait value can drop: the CPUs now have something to do, instead of being idle. However, the same disk I/O is still present and blocking threads, despite the drop in the I/O wait metric. The reverse has sometimes happened when system administrators have upgraded application software and the newer version is more efficient and uses fewer CPU cycles, revealing I/O wait. This can make the system administrator think that the upgrade has caused a disk issue and made performance worse, when in fact disk performance is the same, and CPU performance is improved.
There are also some subtle issues with how I/O wait was being calculated on Solaris. For the Solaris 10 release, the I/O wait metric was deprecated and hardwired to zero for tools that still needed to display it (for compatibility).
A more reliable metric may be the time that application threads are blocked on disk I/O. This captures the pain endured by application threads caused by disk I/O, regardless of what other work the CPUs may be doing. This metric can be measured using static or dynamic tracing.
I/O wait is still a popular metric on Linux systems, and despite its confusing nature, it is used successfully to identify a type of disk bottleneck: disks busy, CPUs idle. One way to interpret it is to treat any wait I/O as a sign of a system bottleneck, and then tune the system to minimize it—even if the I/O is still occurring concurrently with CPU utilization. Concurrent I/O is more likely to be non-blocking I/O, and less likely to cause a direct issue. Nonconcurrent I/O, as identified by I/O wait, is more likely to be application blocking I/O, and a bottleneck.
摘录自《Systems Performance: Enterprise and the Cloud》