为什么RUNNING状态进程会调用schedule函数

intro

之前看内核代码的时候，只是大致浏览了内核的主要常规流程：也就是正常情况下的正常流程，通常对于异常/边界场景并没有怎么关注。在遇到一些非常规问题时，就会主动或者被动的思考一些边界的异常问题。

内核的一个重要的作用就是进行各种调度(进程运行/阻塞)/同步(资源访问)/分配(内存/CPU)操作。在当前多核CPU已经成为主流架构的情况下，内核为了处理这些情况，也需要在内核内部实现各种互斥/同步操作。在这些过程中，内核如何保证自己的一致性？

特别的，当一个进程阻塞在一个文件操作(例如，从终端读取输入时)，如果保证进程的等待休眠和终端唤醒之间的一致？中间的边界情况如何处理？会不会出现进程判断没有资源，之后进入休眠，但是在判断和休眠之间有输入到达。如何保证此次输入会避免(在进程判断之后)真正进入休眠。

RUNNING态调用schedule

常见的RUNNING态进程调用schedule，可以想到的还包括了下面两种简单情况：

time slice

这是最简单/常见的情况，尽管严格意义上讲这个并不是进程主动调用schedule函数，而是在定时器中断中判断进程用完了自己的时间片，从而剥夺CPU资源的情况。此时进程依然处于RUNNING状态。比较常见的就是计算密集型进程会出现这种情况。

yield

一些进程主动通过yield系统调用主动让渡出CPU资源，此时进程依然处于RUNNING状态，只是可能被放在运行队列的最后；或者直接设置该进程已经用完了当前时间片资源。

The purpose of sched_yield() is to temporarily give up the processor to other processes. The process calling sched_yield() remains in the runnable state, and can normally expect to run again in short order. The 2.5 development series has made a subtle change in the way sched_yield() works, however. This call used to simply move the process to the end of the run queue; now it moves the process to the "expired" queue, effectively cancelling the rest of the process's time slice. So a process calling sched_yield() now must wait until all other runnable processes in the system have used up their time slices before it will get the processor again.

内核中对应的实现(不确定是不是和前面描述的语义完全一致，只是看起来神似)

///@file: linux\kernel\sched\fair.c
/*
 * sched_yield() is very simple
 */
static void yield_task_fair(struct rq *rq)
{
	struct task_struct *curr = rq->curr;
	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
	struct sched_entity *se = &curr->se;

	/*
	 * Are we the only task in the tree?
	 */
	if (unlikely(rq->nr_running == 1))
		return;

	clear_buddies(cfs_rq, se);

	update_rq_clock(rq);
	/*
	 * Update run-time statistics of the 'current'.
	 */
	update_curr(cfs_rq);
	/*
	 * Tell update_rq_clock() that we've just updated,
	 * so we don't do microscopic update in schedule()
	 * and double the fastpath cost.
	 */
	rq_clock_skip_update(rq);

	se->deadline += calc_delta_fair(se->slice, se);
}

其它

这里尝试分析的当前并不是前面的两种情况，而是一些稍微subtle的情况：特别是内核在面临并发及同步逻辑时需要关注的问题。这里可以想到的包括很多内核分析中都会提到的在schedule的时候，如果进程是TASK_INTERRUPTIBLE但是进程已经收到了信号；另外一个是大部分资料都没有提到的TASK_INTERRUPTIBLE被其它CPU直接设置为RUNNING态的场景。

schedule中TASK_INTERRUPTIBLE状态且有pending signal

TASK_INTERRUPTIBLE并且有信号pending时，这个很容易理解：TASK_INTERRUPTIBLE状态本来就是可以被信号唤醒的，所以当有信号时可以推导出进程是可以调度的；或者说TASK_INTERRUPTIBLE状态标识的就是在等待信号，而调度时如果信号已经达到，那就可以(发生状态变化并)继续执行。

From what you said above:

... if it has nonblocked pending signals and its state is TASK_INTERRUPTIBLE ...

By leaving it in the run queue, it'll give the process an opportunity to handle its pending signals. A process ought to handle pending signals before it sleeps.

对应处理代码如下(注意其中在判断前添加了)：

///@file: linux\kernel\sched\core.c
static void __sched notrace __schedule(unsigned int sched_mode)
{
	struct task_struct *prev, *next;
	unsigned long *switch_count;
	unsigned long prev_state;
	struct rq_flags rf;
	struct rq *rq;
	int cpu;

	cpu = smp_processor_id();
	rq = cpu_rq(cpu);
	prev = rq->curr;
///...
	/*
	 * Make sure that signal_pending_state()->signal_pending() below
	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
	 * done by the caller to avoid the race with signal_wake_up():
	 *
	 * __set_current_state(@state)		signal_wake_up()
	 * schedule()				  set_tsk_thread_flag(p, TIF_SIGPENDING)
	 *					  wake_up_state(p, state)
	 *   LOCK rq->lock			    LOCK p->pi_state
	 *   smp_mb__after_spinlock()		    smp_mb__after_spinlock()
	 *     if (signal_pending_state())	    if (p->state & @state)
	 *
	 * Also, the membarrier system call requires a full memory barrier
	 * after coming from user-space, before storing to rq->curr; this
	 * barrier matches a full barrier in the proximity of the membarrier
	 * system call exit.
	 */
	rq_lock(rq, &rf);
	smp_mb__after_spinlock();
///...
	/*
	 * We must load prev->state once (task_struct::state is volatile), such
	 * that we form a control dependency vs deactivate_task() below.
	 */
	prev_state = READ_ONCE(prev->__state);
	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
		if (signal_pending_state(prev_state, prev)) {
			WRITE_ONCE(prev->__state, TASK_RUNNING);
		} 
///...
}

在对应的signal_wake_up唤醒逻辑中，同样会获取和__schedule函数中相同的，当前cpu的runqueue锁。

///@file: linux\kernel\sched\core.c
static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
{
	struct rq *rq = cpu_rq(cpu);
	struct rq_flags rf;

	if (ttwu_queue_wakelist(p, cpu, wake_flags))
		return;

	rq_lock(rq, &rf);
	update_rq_clock(rq);
	ttwu_do_activate(rq, p, wake_flags, &rf);
	rq_unlock(rq, &rf);
}

poll/select

用户态的等待早期是通过poll/select来进行等待，这种等待通常是等待IO资源，

注意代码中的时序：先设置状态为TASK_INTERRUPTIBLE，然后再判断是触发了唤醒(pwq->triggered)。
如果执行

if (!pwq->triggered)

时为0，说明还没有执行过唤醒，之后如果会唤醒将会覆盖前面设置的TASK_INTERRUPTIBLE并变化为RUNNING，所以不会丢失。
反过来说，如果先执行

if (!pwq->triggered)

再执行

set_current_state(state);

此时，如果其它CPU在两个语句之间设置状态就会出现唤醒丢失。

///@file: linux\fs\select.c
static int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
			  ktime_t *expires, unsigned long slack)
{
	int rc = -EINTR;

	set_current_state(state);
	if (!pwq->triggered)
		rc = schedule_hrtimeout_range(expires, slack, HRTIMER_MODE_ABS);
	__set_current_state(TASK_RUNNING);

	/*
	 * Prepare for the next iteration.
	 *
	 * The following smp_store_mb() serves two purposes.  First, it's
	 * the counterpart rmb of the wmb in pollwake() such that data
	 * written before wake up is always visible after wake up.
	 * Second, the full barrier guarantees that triggered clearing
	 * doesn't pass event check of the next iteration.  Note that
	 * this problem doesn't exist for the first iteration as
	 * add_wait_queue() has full barrier semantics.
	 */
	smp_store_mb(pwq->triggered, 0);

	return rc;
}

内核流程中是在无锁的状态下向各个文件描述符注册poll函数，然后设置自己状态为TASK_INTERRUPTIBLE并调用schedule函数。此时如果在注册之后，调用schedule函数之前，某个文件就绪并唤醒进程，此时进程将会处于RUNNING状态。这个时候进入schedule之后会判断状态为RUNNING，所以并不会从runqueue中剔除，所以它还会继续获得CPU资源并进行调度。

早期select实现

在稍微久远一些的linux内核版本(例如2.6.12[https://github.com/torvalds/linux/blob/v2.6.12/fs/select.c])中，select系统调用的实现就比较简单直观：

在循环内部，在执行poll之前，不管三七二十一，先把当前进程设置为

set_current_state(TASK_INTERRUPTIBLE);

可唤醒睡眠状态(TASK_INTERRUPTIBLE)。注意： 此时该任务依然在CPU的运行队列中。

///@file:
int do_select(int n, fd_set_bits *fds, long *timeout)
{
	struct poll_wqueues table;
	poll_table *wait;
	int retval, i;
	long __timeout = *timeout;

 	spin_lock(&current->files->file_lock);
	retval = max_select_fd(n, fds);
	spin_unlock(&current->files->file_lock);

	if (retval < 0)
		return retval;
	n = retval;

	poll_initwait(&table);
	wait = &table.pt;
	if (!__timeout)
		wait = NULL;
	retval = 0;
	for (;;) {
		unsigned long *rinp, *routp, *rexp, *inp, *outp, *exp;

		set_current_state(TASK_INTERRUPTIBLE);

		inp = fds->in; outp = fds->out; exp = fds->ex;
		rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex;

		for (i = 0; i < n; ++rinp, ++routp, ++rexp) {
			unsigned long in, out, ex, all_bits, bit = 1, mask, j;
			unsigned long res_in = 0, res_out = 0, res_ex = 0;
			struct file_operations *f_op = NULL;
			struct file *file = NULL;

			in = *inp++; out = *outp++; ex = *exp++;
			all_bits = in | out | ex;
			if (all_bits == 0) {
				i += __NFDBITS;
				continue;
			}

			for (j = 0; j < __NFDBITS; ++j, ++i, bit <<= 1) {
				if (i >= n)
					break;
				if (!(bit & all_bits))
					continue;
				file = fget(i);
				if (file) {
					f_op = file->f_op;
					mask = DEFAULT_POLLMASK;
					if (f_op && f_op->poll)
						mask = (*f_op->poll)(file, retval ? NULL : wait);
					fput(file);
					if ((mask & POLLIN_SET) && (in & bit)) {
						res_in |= bit;
						retval++;
					}
					if ((mask & POLLOUT_SET) && (out & bit)) {
						res_out |= bit;
						retval++;
					}
					if ((mask & POLLEX_SET) && (ex & bit)) {
						res_ex |= bit;
						retval++;
					}
				}
				cond_resched();
			}
			if (res_in)
				*rinp = res_in;
			if (res_out)
				*routp = res_out;
			if (res_ex)
				*rexp = res_ex;
		}
		wait = NULL;
		if (retval || !__timeout || signal_pending(current))
			break;
		if(table.error) {
			retval = table.error;
			break;
		}
		__timeout = schedule_timeout(__timeout);
	}
	__set_current_state(TASK_RUNNING);

	poll_freewait(&table);

	/*
	 * Up-to-date the caller timeout.
	 */
	*timeout = __timeout;
	return retval;
}

至于进程是否需要从真正从运行队列中剔除(从而不再获得CPU资源)，此时就要右schedule函数来(主要根据进程的状态)来判断。假设在select函数内执行完

f_op->poll

之后，某个文件将进程唤醒。则在执行schedule函数时，判断进程状态是RUNNING状态，那么该进程依然会继续从schedule函数返回并继续执行(当然是当前进程的时间片没有被用完)。

///@file: https://github.com/torvalds/linux/blob/9ee1c939d1cb936b1f98e8d81aeffab57bae46ab/kernel/sched.c#L2607
asmlinkage void __sched schedule(void)
{
	long *switch_count;
	task_t *prev, *next;
	runqueue_t *rq;
	prio_array_t *array;
	struct list_head *queue;
	unsigned long long now;
	unsigned long run_time;
	int cpu, idx;

	/*
	 * Test if we are atomic.  Since do_exit() needs to call into
	 * schedule() atomically, we ignore that path for now.
	 * Otherwise, whine if we are scheduling when we should not be.
	 */
	if (likely(!current->exit_state)) {
		if (unlikely(in_atomic())) {
			printk(KERN_ERR "scheduling while atomic: "
				"%s/0x%08x/%d\n",
				current->comm, preempt_count(), current->pid);
			dump_stack();
		}
	}
	profile_hit(SCHED_PROFILING, __builtin_return_address(0));

need_resched:
	preempt_disable();
	prev = current;
	release_kernel_lock(prev);
need_resched_nonpreemptible:
	rq = this_rq();

	/*
	 * The idle thread is not allowed to schedule!
	 * Remove this check after it has been exercised a bit.
	 */
	if (unlikely(prev == rq->idle) && prev->state != TASK_RUNNING) {
		printk(KERN_ERR "bad: scheduling from the idle thread!\n");
		dump_stack();
	}

	schedstat_inc(rq, sched_cnt);
	now = sched_clock();
	if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {
		run_time = now - prev->timestamp;
		if (unlikely((long long)(now - prev->timestamp) < 0))
			run_time = 0;
	} else
		run_time = NS_MAX_SLEEP_AVG;

	/*
	 * Tasks charged proportionately less run_time at high sleep_avg to
	 * delay them losing their interactive status
	 */
	run_time /= (CURRENT_BONUS(prev) ? : 1);

	spin_lock_irq(&rq->lock);

	if (unlikely(prev->flags & PF_DEAD))
		prev->state = EXIT_DEAD;

	switch_count = &prev->nivcsw;
	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
		switch_count = &prev->nvcsw;
		if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
				unlikely(signal_pending(prev))))
			prev->state = TASK_RUNNING;
		else {
			if (prev->state == TASK_UNINTERRUPTIBLE)
				rq->nr_uninterruptible++;
			deactivate_task(prev, rq);
		}
	}

	cpu = smp_processor_id();
	if (unlikely(!rq->nr_running)) {
go_idle:
		idle_balance(cpu, rq);
		if (!rq->nr_running) {
			next = rq->idle;
			rq->expired_timestamp = 0;
			wake_sleeping_dependent(cpu, rq);
			/*
			 * wake_sleeping_dependent() might have released
			 * the runqueue, so break out if we got new
			 * tasks meanwhile:
			 */
			if (!rq->nr_running)
				goto switch_tasks;
		}
	} else {
		if (dependent_sleeper(cpu, rq)) {
			next = rq->idle;
			goto switch_tasks;
		}
		/*
		 * dependent_sleeper() releases and reacquires the runqueue
		 * lock, hence go into the idle loop if the rq went
		 * empty meanwhile:
		 */
		if (unlikely(!rq->nr_running))
			goto go_idle;
	}

	array = rq->active;
	if (unlikely(!array->nr_active)) {
		/*
		 * Switch the active and expired arrays.
		 */
		schedstat_inc(rq, sched_switch);
		rq->active = rq->expired;
		rq->expired = array;
		array = rq->active;
		rq->expired_timestamp = 0;
		rq->best_expired_prio = MAX_PRIO;
	}

	idx = sched_find_first_bit(array->bitmap);
	queue = array->queue + idx;
	next = list_entry(queue->next, task_t, run_list);

	if (!rt_task(next) && next->activated > 0) {
		unsigned long long delta = now - next->timestamp;
		if (unlikely((long long)(now - next->timestamp) < 0))
			delta = 0;

		if (next->activated == 1)
			delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;

		array = next->array;
		dequeue_task(next, array);
		recalc_task_prio(next, next->timestamp + delta);
		enqueue_task(next, array);
	}
	next->activated = 0;
switch_tasks:
	if (next == rq->idle)
		schedstat_inc(rq, sched_goidle);
	prefetch(next);
	clear_tsk_need_resched(prev);
	rcu_qsctr_inc(task_cpu(prev));

	update_cpu_clock(prev, rq, now);

	prev->sleep_avg -= run_time;
	if ((long)prev->sleep_avg <= 0)
		prev->sleep_avg = 0;
	prev->timestamp = prev->last_ran = now;

	sched_info_switch(prev, next);
	if (likely(prev != next)) {
		next->timestamp = now;
		rq->nr_switches++;
		rq->curr = next;
		++*switch_count;

		prepare_arch_switch(rq, next);
		prev = context_switch(rq, prev, next);
		barrier();

		finish_task_switch(prev);
	} else
		spin_unlock_irq(&rq->lock);

	prev = current;
	if (unlikely(reacquire_kernel_lock(prev) < 0))
		goto need_resched_nonpreemptible;
	preempt_enable_no_resched();
	if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
		goto need_resched;
}

注意sched.h头文件中TASK_RUNNING宏的值为0

///@file: https://github.com/torvalds/linux/blob/v2.6.12/include/linux/sched.h
#define TASK_RUNNING		0
#define TASK_INTERRUPTIBLE	1

和schedule函数中

if (prev->state && !(preempt_count() & PREEMPT_ACTIVE))

判断条件的对应关系。

///@file: https://github.com/torvalds/linux/blob/v2.6.12/kernel/sched.c
	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
		switch_count = &prev->nvcsw;
		if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
				unlikely(signal_pending(prev))))
			prev->state = TASK_RUNNING;
		else {
			if (prev->state == TASK_UNINTERRUPTIBLE)
				rq->nr_uninterruptible++;
			deactivate_task(prev, rq);
		}
	}

修改原因

一个直观的优点是：对于在执行poll之后，schedule之前如果有文件将进程唤醒，此时可以避免进入schedule函数(从而造成非必要的CPU浪费)。

但是，查看这代码的提交日志可以发现，这里的修改并不是为了性能，而是为了解除poll函数不能休眠(not allowed to slee)。尽管依然不是很理解下面提交日志描述的具体意义，还是摘录在这里做个备份：

poll: allow f_op->poll to sleep
f_op->poll is the only vfs operation which is not allowed to sleep. It's
because poll and select implementation used task state to synchronize
against wake ups, which doesn't have to be the case anymore as wait/wake
interface can now use custom wake up functions. The non-sleep restriction
can be a bit tricky because ->poll is not called from an atomic context
and the result of accidentally sleeping in ->poll only shows up as
temporary busy looping when the timing is right or rather wrong.

This patch converts poll/select to use custom wake up function and use
separate triggered variable to synchronize against wake up events. The
only added overhead is an extra function call during wake up and
negligible.

This patch removes the one non-sleep exception from vfs locking rules and
is beneficial to userland filesystem implementations like FUSE, 9p or
peculiar fs like spufs as it's very difficult for those to implement
non-sleeping poll method.

While at it, make the following cosmetic changes to make poll.h and
select.c checkpatch friendly.

outro

我记得之前应该考虑过这种类似于“内核如何避免唤醒丢失”的问题，只是当时没有能力来搞清楚，更多的时间花费在了"low-hanging fruit"上。现在有时间/精力来重新认识这个问题，也算是一种“不忘初心”了:-)？

posted on 2025-08-10 21:44 tsecer 阅读(25) 评论(0) 收藏举报

刷新页面返回顶部

tsecer