Softlockup&Hardlockup检测机制
前言
Linux自身具备一定的异常检测机制,softlockup和hardlockup是典型的两种,softlockup检测内核是否出现了长时间不调度其他任务执行的异常情况。hardlockup则更进一步检测内核是否出现了长时间不响应中断的异常情况。softlockup和hardlockup的定义如下:
A 'softlockup' is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds, without giving other tasks a chance to run.
A 'hardlockup' is defined as a bug that causes the CPU to loop in kernel mode for more than 10 seconds, without letting other interrupts have a chance to run.
这两种异常检测机制具有一定的相似性,因此设计的思路是一体的。但是在检测的目标上又存在差异,所以实现上有一些不同。
watchdog
watchdog机制是一种常见的keep-alive方法,其原理是周期性的执行一个任务检查某个值是否已经更新,这个检查过程称之为watch dog,而更新值的动作被称为touch dog。
softlockup和hardlockup机制针对的是单核的检测,因此对于每一个CPU内核都有两个dog分别对应softlockup和hqrdlockup。
softlockup的dog是watchdog_touch_ts,记录了上一次touch dog的时间戳。hardlockup的dog是hrtimer_interrupts,记录hrtimer高精度定时器中断发生的次数。
static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
在内核中存在三类程序可以被执行的,按照优先级从高到底分别是NMI处理函数、Normal Interrupt处理函数和Task。从本质上来说,softlockup检测的是NMI和Normal Interrupt正常响应的情况下,Task之间的调度能否正常发生,hardlockup检测的是NMI正常响应的情况下,Normal Interrupt能否正常响应和被调度执行。
Note:NMI作为不可屏蔽中断,保证了任何条件下都能执行。
softlockup
为了满足检测目标,softlockup需要有一个内核线程能够touch dog(更新watchdog_touch_ns),并且该线程必须在softlockup检查时启动。同时还需要一个周期定时器任务,检查watchdog_touch_ts与now之间的距离是否超过门限,如果超过就认为发生了softlockup。默认超时时长softlockup_thresh是20s(2 * watchdog_thresh)。softlockup检查在is_softlockup(kernel/watchdog.c)中实现:
static int is_softlockup(unsigned long touch_ts)
{
unsigned long now = get_timestamp();
if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){
/* Warn about unreasonable delays. */
if (time_after(now, touch_ts + get_softlockup_thresh()))
return now - touch_ts;
}
return 0;
}
为了保证softlockup的有效性,更新watchdog_touch_ns的Task必须拥有最高的任务优先级,否则即使正常发生调度低优先级任务也无法及时更新时间戳。因此在老的内核版本更新watch_touch_ns的Task是[watchdog/x],随着STOP调度类(比实时任务的优先级更高)的引入,更新线程变成了[migration/x]。
migration线程作为内核中优先级最高的线程,负责内核热插拔、停止CPU运行等工作。migration线程管理了一个work_queue,当有任务需要执行时migration就会进入RUNNABLE状态等待调度,一旦发生调度migration一定能够拿到执行权更新watchdog_touch_ns,保证了softlockup检查的有效性。
而检查softlockup的任务必须交给优先级更高的中断,内核中的hrtimer可以周期性的触发中断,在hrtimer的处理函数watchdog_timer_fn中可以检查[migration/x]是否正常更新了watchdog_touch_ns,hrtimer定时器的触发周期是softlockup_thresh / 5(默认值是4s)。
softlockup检查机制的整体流程如下:
hrtimer周期性的触发执行中断处理程序watchdog_timer_fn:- 向
work_queue插入任务softlockup_fn - 检查
watchdog_touch_ns是否异常 - 睡眠,等待下一次触发
- 向
migration线程- 被
work_queue唤醒 - 检查队列,取出
softlockup_fn执行 - 更新
watchdog_touch_ns work_queue为空,进入睡眠
- 被
如果migration线程在任务队列中长时间没有被调度执行(核上的任务长时间的占据了CPU),则说明出现了softlockup异常,需要对现场进行dump。
hardlockup
hardlockup的检测机制和softlockup类似,但是检测的目标不同,hardlockup检测的是普通中断长时间不响应,hardlockup的检查在kernel/watchdog.c的is_hardlockup中实现,判断hrtimer_interrupts是否在进行递增,如果没有递增则认为发生了hardlockup。
/* watchdog detector functions */
bool is_hardlockup(void)
{
unsigned long hrint = __this_cpu_read(hrtimer_interrupts);
if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)
return true;
__this_cpu_write(hrtimer_interrupts_saved, hrint);
return false;
}
hardlockup的默认超时时长watchdog_thresh是10s,是softlockup的一半。和softlockup不一样的是hrtimer_interrupts没有记录时间戳信息,如何判断是否超时呢?
Linux使用的是周期性的NMI。基于perf subsystem的cycles事件,perf的counter可以设置溢出阈值,当perf event的发生次数达到阈值时会触发一次NMI中断,同时cycles与时间存在一定的关系,具体可以看kernel/watchdog.c的watchdog_nmi_enable函数。顺着调用链可以看到hardlockup_detector_event_create函数(在kernel/watchdog_hld.c中)调用了hw_nmi_get_sample_period(在arch/x86/kernel/apic/hw_nmi.c中),这个函数是一个体系结构相关的函数,在这里获取了cycles溢出的NMI中断的触发周期watchdog_thresh。
u64 hw_nmi_get_sample_period(int watchdog_thresh)
{
return (u64)(cpu_khz) * 1000 * watchdog_thresh;
}
周期性的NMI触发执行回调函数进行watch(检查hrtimer_interrupts是否递增),hrtimer则负责定期的touch(增加hrtimer_interrupts)。
hardlockup和softlockup之间通过hrtimer产生了交集,所以hrtiemr的处理函数不仅要watch watchdog_touch_ts进行softlockup检查,同时还需要touch hrtimer_interrupts更新中断触发次数。
NOTE:2024-03-15更新
hardlockup的超时周期是通过cycles NMI中断的触发周期来保障的,但是在一些具有睿频模式(turbo mode)的CPU上通过cycles数量推算时间这个方法会不准确,NMI中断的触发周期会缩小导致误报。所谓睿频模式指的是CPU会根据情况自动的调整CPU的频率和关闭CPU,比如在一个四核处理器上运行单线程程序,此时会关闭三个核心,提高运行核心的频率从而提高性能,并且降低功耗。但是这会带来两个问题,动态频率会导致基于cycles NMI中断周期不准,第二个问题是停止的CPU的时钟会不更新。因此在这个场景下内核中有一个配置选项CONFIG_HARDLOCKUP_CHECK_TIMESTAMP,开启这个配置选项以后在NMI中断的回调函数中会检查时间戳,如果距离上一次hardlockup检查过去了4/5 * watchdog_thresh(能够保证至少一次hrtimer_interrupts更新)才进行hardlockup检查。此外,如果ktime是基于jiffies(每个时钟中断更新一次)的,在停止的CPU上jiffies并不会更新,此时通过一个计数器nmi_rearmed判断是否达到了时间间隔要求。这个特性可以参考如下代码:
#ifdef CONFIG_HARDLOCKUP_CHECK_TIMESTAMP
static DEFINE_PER_CPU(ktime_t, last_timestamp);
static DEFINE_PER_CPU(unsigned int, nmi_rearmed);
static ktime_t watchdog_hrtimer_sample_threshold __read_mostly;
void watchdog_update_hrtimer_threshold(u64 period)
{
/*
* The hrtimer runs with a period of (watchdog_threshold * 2) / 5
*
* So it runs effectively with 2.5 times the rate of the NMI
* watchdog. That means the hrtimer should fire 2-3 times before
* the NMI watchdog expires. The NMI watchdog on x86 is based on
* unhalted CPU cycles, so if Turbo-Mode is enabled the CPU cycles
* might run way faster than expected and the NMI fires in a
* smaller period than the one deduced from the nominal CPU
* frequency. Depending on the Turbo-Mode factor this might be fast
* enough to get the NMI period smaller than the hrtimer watchdog
* period and trigger false positives.
*
* The sample threshold is used to check in the NMI handler whether
* the minimum time between two NMI samples has elapsed. That
* prevents false positives.
*
* Set this to 4/5 of the actual watchdog threshold period so the
* hrtimer is guaranteed to fire at least once within the real
* watchdog threshold.
*/
watchdog_hrtimer_sample_threshold = period * 2;
}
static bool watchdog_check_timestamp(void)
{
ktime_t delta, now = ktime_get_mono_fast_ns();
delta = now - __this_cpu_read(last_timestamp);
if (delta < watchdog_hrtimer_sample_threshold) {
/*
* If ktime is jiffies based, a stalled timer would prevent
* jiffies from being incremented and the filter would look
* at a stale timestamp and never trigger.
*/
if (__this_cpu_inc_return(nmi_rearmed) < 10)
return false;
}
__this_cpu_write(nmi_rearmed, 0);
__this_cpu_write(last_timestamp, now);
return true;
}
#else
static inline bool watchdog_check_timestamp(void)
{
return true;
}
#endif
watchdog相关配置接口
启用或禁用watchdog:
/proc/sys/kernel/soft_watchdog:启用或禁用softlockup/proc/sys/kernel/nmi_watchdog:启用或禁用hardlockup/proc/sys/kernel/watchdog: 同时启用或禁用softlockup和hardlockup,读取的返回值是soft_watchdog和nmi_watchdog取或。
设置哪些core启用watchdog:
/proc/sys/kernel/watchdog_cpumask
设置lockup超时门限:
/proc/sys/kernel/watchdog_thresh:设置NMI watchdog超时门限,softlockup_thresh是2 * watchdog_thresh
设置超时的处理:
/proc/sys/kernel/hardlockup_panic:出现hardlockup时是否panic
相关源码
watchdog初始化
watchdog_enable(kernel/watchdog.c)执行流程:
- 启动
hrtimer- 设置
hrtimer的中断处理函数watchdog_timer_fn - 设置定时器触发周期为
2 * watchdog_thresh / 5
- 设置
- 创建
cycles perf event:watchdog_nmi_enable -> hardlockup_detector_perf_enable -> hardlockup_detector_event_create- 设置NMI中断触发周期为
watchdog_thresh - 设置中断处理函数
watchdog_overflow_callback
- 设置NMI中断触发周期为
static void watchdog_enable(unsigned int cpu)
{
...
hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
hrtimer->function = watchdog_timer_fn;
hrtimer_start(hrtimer, ns_to_ktime(sample_period),
HRTIMER_MODE_REL_PINNED_HARD);
...
if (watchdog_enabled & NMI_WATCHDOG_ENABLED)
watchdog_nmi_enable(cpu);
...
}
hrtimer
watchdog_timer_fn(kernel/watchdog.c)执行流程:
- 递增
hrtimer_interrupts - 向
migration线程的work_queue插入softlockup_fn,让migration进入调度队列 - 检查
softlockup
/* watchdog kicker functions */
static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
{
...
/* kick the hardlockup detector */
watchdog_interrupt_count();
...
/* kick the softlockup detector */
if (completion_done(this_cpu_ptr(&softlockup_completion))) {
reinit_completion(this_cpu_ptr(&softlockup_completion));
stop_one_cpu_nowait(smp_processor_id(),
softlockup_fn, NULL,
this_cpu_ptr(&softlockup_stop_work));
}
duration = is_softlockup(touch_ts);
if (unlikely(duration)) {
....
}
return HRTIMER_RESTART;
}
cycles NMI
在cycles计数器溢出触发NMI回调函数watchdog_overflow_callback(kerne/watchdog_hld.c)中检查hardlockup。
static void watchdog_overflow_callback(struct perf_event *event,
struct perf_sample_data *data,
struct pt_regs *regs)
{
if (is_hardlockup()) {
....
}
...
return;
}
References
[1] lockup-watchdogs

浙公网安备 33010602011771号