linux脏页回写控制剖析

背景

在生产环境中遇到了fuse + 脏页回写导致的进程block问题,围绕这个问题对脏页回写做一个逻辑梳理,分析问题根因,在文章最后给出解决方案。
分析基于linux kernel 5.10代码

问题描述

现象: 进程在写数据到fuse文件系统时,长时间block
看进程stack:

[<0>] __switch_to+0xdc/0x148
[<0>] balance_dirty_pages+0x688/0x6fc
[<0>] balance_dirty_pages_ratelimited+0xf0/0x23c
[<0>] generic_perform_write+0x14c/0x1b4
[<0>] __generic_file_write_iter+0x118/0x148
[<0>] generic_file_write_iter+0x5c/0xc4
[<0>] fuse_file_write_iter+0xc0/0x2f0 [fuse]
[<0>] new_sync_write+0x80/0xd8
[<0>] vfs_write+0x114/0x138
[<0>] ksys_write+0x78/0xe4
[<0>] __arm64_sys_write+0x20/0x2c
[<0>] el0_svc_common.constprop.0+0x13c/0x1ec
[<0>] do_el0_svc+0x8c/0x98
[<0>] el0_svc+0x20/0x30
[<0>] el0_sync_handler+0xb4/0x134
[<0>] el0_sync+0x1a0/0x1c0

看stack中调用地址,是block在了balance_dirty_pages函数中的io_schedule_timeout(pause)
pause值较大在80~100ms左右,且频繁调用导致了进程看起来一直block。

与这篇文章记录的issue一致,浅析 fuse kernel mmap write 过程及性能问题
作者给出的解决方案是取消strictlimit cap,本文最后给出另外一种解决方法。

  • 下面是回写相关的逻辑分析,只看问题分析和解决方法可以直接到最后一节:

基本概念简介

页缓存(page cache)

由于读写硬盘的速度比读写内存要慢很多(DDR4 内存读写速度是机械硬盘500倍,是固态硬盘的200倍),所以为了避免每次读写文件时,都需要对硬盘进行读写操作,Linux内核使用页缓存(Page Cache)机制来对文件中的数据进行缓存。几乎所有的文件读写操作都依赖于高速缓存,除非在打开文件时设置 O_DIRECT flag

脏页回写(dirty page writeback)

在文件写入时,数据其实是写在页缓存中,这个被新数据写入的页为脏页,脏页数据需要在某个时机写回到磁盘,这个过程就是脏页回写。

fuse文件系统

这里只简单介绍一下
架构:

Fuse包含一个内核模块和一个用户空间守护进程(下文称fuse daemon)。内核模块加载时被注册成 Linux 虚拟文件系统的一个 fuse 文件系统驱动。此外,还注册了一个/dev/fuse的块设备。该块设备作为fuse daemon与内核通信的桥梁,fuse daemon通过/dev/fuse读取fuse request,处理后将reply写入/dev/fuse。

基本处理流程:
当application挂在fuse文件系统上,并且执行一些系统调用时,VFS会将这些操作路由至fuse driver,fuse driver创建了一个fuse request结构体,并把request保存在请求队列中。同时fuse daemon通过读取/dev/fuse将request从内核队列中取出,并且提交操作到底层文件系统中(例如 EXT4、F2FS,或者rbd、nfs等网络文件系统)。当处理完请求后,fuse daemon会将reply写回/dev/fuse,fuse driver此时把requset标记为completed。

具体参考:
FUSE文件系统

脏页回写相关参数

配置文件 功能 默认值
dirty_background_ratio 脏页总数占可用内存的百分比阈值,超过后唤醒回写线程 10
dirty_background_bytes 脏页总量阈值,超过后唤醒回写线程 0
dirty_ratio 脏页总数占可用内存的百分比阈值,超过后pause进程 20
dirty_bytes 脏页总量阈值,超过后pause进程 0
dirty_expire_centisecs 脏数据超时回写时间(单位:1/100s) 3000
dirty_writeback_centisecs 回写进程定时唤醒时间(单位:1/100s) 500

说明:

  • 可用内存 = free page - reserved + NR_INACTIVE_FILE + NR_ACTIVE_FILE
  • 每个bdi都有一个回写线程,在linux woker
  • 默认每隔5s会唤醒所有bdi的回写线程回写数据

脏页控制原理

目前kernel上的脏页控制算法由Fengguang Wu提出:
IO-less Dirty Throttling


基本思想:

  1. app写数据到page cache可以理解为脏页生产者,回写线程(如图中flusher)把脏页数据写回磁盘为脏页消费者;
  2. app写,经系统调用进入内核态,检查系统总脏页百分比达到dirty_background_ratio,启动回写线程回写;
  3. 此时生产者速度大于消费者速度,脏页增长放缓,但总量扔持续上升;
  4. 当脏页百分比超过dirty_ratio后(图中所示,但实际是freerun_ratio=(dirty_background_ratio+dirty_ratio)/2),开始对生产者限速,限速通过让进程休眠一段时间实现;
  5. 生产者速度等于消费者速度,脏页不再增长,保持在某个水平位(图中nr_dirty),为了让脏页水位达到理想值((setpoint =(freerun_ratio+dirty_ratio)/2)),引入pos_ratio作为控制参数来调节水位。

知道原理,下面的计算方法就容易理解,具体实现代码也是根据这个框架来的:

/*
N表示同时写bdi设备的进程数, 通过下面方式估算:
When started N dd, throttle each dd at
ratelimit_0 (any non-zero initial value is OK)
After 200ms, we measures
dirty_rate = (new dirtied pages by apps in 200ms) / 200ms
to estimate N:
N = dirty_rate / ratelimit_0
*/

/* 
bdi_bandwidth为bdi设备的io写入带宽
设置app限速速率=bdi设备速度/N 
*/
balanced_ratelimit = bdi_bandwidth / N;

/* balanced_ratelimit有离散杂点,经过去除、平滑后得到dirty_ratelimit */
if (dirty_ratelimit > both balanced_ratelimit, task_ratelimit)
  dirty_ratelimit => max(balanced_ratelimit, task_ratelimit)
if (dirty_ratelimit < both balanced_ratelimit, task_ratelimit)
  dirty_ratelimit => min(balanced_ratelimit, task_ratelimit)

/*
pos_ratio为水位控制参数
= 1.0 if at setpoint 
> 1.0 if below setpoint
< 1.0 if above setpoint
*/
task_ratelimit = dirty_ratelimit * pos_ratio

/* 脏页控制的最终目标就是block app,计算进程休眠时长:*/
period = pages_dirtied / task_ratelimit;
io_schedule_timeout(period);

实现的关键点是计算
N,bdi_bandwidth,dirty_ratelimit,pos_ratio
下面就看代码是具体怎么实现的:

数据结构

backing_dev_info

/*
表示一个块设备对应的脏页回写相关的信息
对bdi的操作接口在backing-dev.c中实现:
bdi_alloc:alloc bdi设备并初始化bdi
bdi_register:注册,把bdi插入到bdi_tree,bdi_list中
bdi_put:dec bdi refcnt,为0后release bdi
bdi_dev_name:获取bdi name
min_ratio_store: 设置min ratio
max_ratio_store: 设置max ratio
*/
struct backing_dev_info {
	u64 id;
	// rb_node、bdi_list用于链接到红黑树和链表中
	struct rb_node rb_node; /* keyed by ->id */
	struct list_head bdi_list;
	unsigned long ra_pages;	/* max readahead in PAGE_SIZE units */
	unsigned long io_pages;	/* max allowed IO size */

	struct kref refcnt;	/* Reference counter for the structure */
	// 目前有BDI_CAP_WRITEBACK,BDI_CAP_WRITEBACK_ACCT,BDI_CAP_STRICTLIMIT
	unsigned int capabilities; /* Device capabilities */
	// 记录min_ratio,max_ratio
	unsigned int min_ratio;
	unsigned int max_ratio, max_prop_frac;

	/*
	 * Sum of avg_write_bw of wbs with dirty inodes.  > 0 if there are
	 * any dirty wbs, which is depended upon by bdi_has_dirty().
	 */
	// bdi下所有wbs avg_write_bw之和
	atomic_long_t tot_write_bandwidth;

	/* 
	 * 在cgroup writeback enable时,存在一个bdi下有多个cgwb情况,这些wbs链接到wb_list
	 * disable时,bdi和wb是一对一关系;
	 */
	struct bdi_writeback wb;  /* the root writeback info for this bdi */
	struct list_head wb_list; /* list of all wbs */
#ifdef CONFIG_CGROUP_WRITEBACK
	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
	struct mutex cgwb_release_mutex;  /* protect shutdown of wb structs */
	struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
#endif
	wait_queue_head_t wb_waitq;

	struct device *dev;
	char dev_name[64];  //bdi dev名
	struct device *owner;

	struct timer_list laptop_mode_wb_timer;

#ifdef CONFIG_DEBUG_FS
	struct dentry *debug_dir;
#endif

	ANDROID_KABI_RESERVE(1);
	ANDROID_KABI_RESERVE(2);
};

bdi_writeback

/*
 * Each wb (bdi_writeback) can perform writeback operations, is measured
 * and throttled, independently.  Without cgroup writeback, each bdi
 * (bdi_writeback) is served by its embedded bdi->wb.
 *
 * On the default hierarchy, blkcg implicitly enables memcg.  This allows
 * using memcg's page ownership for attributing writeback IOs, and every
 * memcg - blkcg combination can be served by its own wb by assigning a
 * dedicated wb to each memcg, which enables isolation across different
 * cgroups and propagation of IO back pressure down from the IO layer upto
 * the tasks which are generating the dirty pages to be written back.
 *
 * A cgroup wb is indexed on its bdi by the ID of the associated memcg,
 * refcounted with the number of inodes attached to it, and pins the memcg
 * and the corresponding blkcg.  As the corresponding blkcg for a memcg may
 * change as blkcg is disabled and enabled higher up in the hierarchy, a wb
 * is tested for blkcg after lookup and removed from index on mismatch so
 * that a new wb for the combination can be created.
 */
/*
 * bdi_writeback: bdi回写相关数据
 * cgroup writeback disable:bdi与wb为一对一, **本文后面的分析基于bdi与wb为一对一**
 * cgroup writeback enable:一个bdi可以有多个wb
 * cgroup writeback feature用来支持cgroup回写限速, 比较复杂暂不讨论
 * 参考:https://www.alibabacloud.com/help/zh/alinux/user-guide/enable-the-cgroup-writeback-feature
 * cgroup writeback feature需要memcg和blkcg协同工作
 * memcg和blkcg规则:
 * 1. 一对一:属于同一个memcg的进程A、B,只能映射到同一个blkcg
 * 2. 多对一:属于不同memcg的进程A、B,可以映射到同一个blkcg,也可以映射到不同blkcg
 * 3. 一个memcg - blkcg combination 对应一个cgroup wb
*/
struct bdi_writeback {
	struct backing_dev_info *bdi;	/* our parent bdi */

	unsigned long state;		/* Always use atomic bitops on this */
	unsigned long last_old_flush;	/* last old data flush */

	struct list_head b_dirty;	/* dirty inodes */
	struct list_head b_io;		/* parked for writeback */
	struct list_head b_more_io;	/* parked for more writeback */
	struct list_head b_dirty_time;	/* time stamps are dirty */
	spinlock_t list_lock;		/* protects the b_* lists */

	struct percpu_counter stat[NR_WB_STAT_ITEMS];

	unsigned long congested;	/* WB_[a]sync_congested flags */

	// 上次update bandwidth时间戳
	unsigned long bw_time_stamp;	/* last time write bw is updated */
	// 上次update bandwidth时,dirtied值
	unsigned long dirtied_stamp;
	// 上次update bandwidth时,written值
	unsigned long written_stamp;	/* pages written at bw_time_stamp */
	// 回写带宽
	unsigned long write_bandwidth;	/* the estimated write bandwidth */
	// 平滑后的回写带宽
	unsigned long avg_write_bandwidth; /* further smoothed write bw, > 0 */

	/*
	 * The base dirty throttle rate, re-calculated on every 200ms.
	 * All the bdi tasks' dirty rate will be curbed under it.
	 * @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit
	 * in small steps and is much more smooth/stable than the latter.
	 */
	// 平滑后的限速值,单位pages/s
	unsigned long dirty_ratelimit;
	// 原始限速值
	unsigned long balanced_dirty_ratelimit;

	/*
	 * percpu计数,回写完成一个page,+1
	 * 注意,计数先累加到percpu的count,当超过max_prop_frac后才会更新到总count,取总count有一定误差;
	 * completions不一定递增的,为了保证时效性,会随根据设置的周期衰减
	*/
	struct fprop_local_percpu completions;
	// dirty pages数是否超过thresh,超过后会更频繁调用balance_dirty_pages
	int dirty_exceeded;
	enum wb_reason start_all_reason;

	spinlock_t work_lock;		/* protects work_list & dwork scheduling */
	struct list_head work_list;
	// dwork:干活的,在这里执行回写
	struct delayed_work dwork;	/* work item used for writeback */

	unsigned long dirty_sleep;	/* last wait */

	struct list_head bdi_node;	/* anchored at bdi->wb_list */

#ifdef CONFIG_CGROUP_WRITEBACK
	struct percpu_ref refcnt;	/* used only for !root wb's */
	struct fprop_local_percpu memcg_completions;
	struct cgroup_subsys_state *memcg_css; /* the associated memcg */
	struct cgroup_subsys_state *blkcg_css; /* and blkcg */
	struct list_head memcg_node;	/* anchored at memcg->cgwb_list */
	struct list_head blkcg_node;	/* anchored at blkcg->cgwb_list */

	union {
		struct work_struct release_work;
		struct rcu_head rcu;
	};
#endif

	ANDROID_KABI_RESERVE(1);
	ANDROID_KABI_RESERVE(2);
};

dirty_throttle_control

struct dirty_throttle_control {
#ifdef CONFIG_CGROUP_WRITEBACK
	struct wb_domain	*dom;
	struct dirty_throttle_control *gdtc;	/* only set in memcg dtc's */
#endif
	struct bdi_writeback	*wb;
	struct fprop_local_percpu *wb_completions;

	// global parameters
	unsigned long		avail;		/* dirtyable */
	unsigned long		dirty;		/* file_dirty + write + nfs */
	unsigned long		thresh;		/* dirty threshold */
	unsigned long		bg_thresh;	/* dirty background threshold */

	// wb parameters
	unsigned long		wb_dirty;	/* per-wb counterparts */
	unsigned long		wb_thresh;
	unsigned long		wb_bg_thresh;

	unsigned long		pos_ratio;
};

wb_domain

/*
 * A wb_domain represents a domain that wb's (bdi_writeback's) belong to
 * and are measured against each other in.  There always is one global
 * domain, global_wb_domain, that every wb in the system is a member of.
 * This allows measuring the relative bandwidth of each wb to distribute
 * dirtyable memory accordingly.
 */
/*
 * 表示wb所属的的domian, 全局只有一个global_wb_domain,wb都在这个domain中(假设cgroup writeback disable,暂不考虑cgwb)
 * completions: 所有wb回写的总page数,用来计算周期内某个wb completion占总completions比例,
 * 进一步计算wb_thresh。 注意这个completions不一定递增的,会随周期指数衰减以保证时效性。
*/
struct wb_domain {
	spinlock_t lock;

	/*
	 * Scale the writeback cache size proportional to the relative
	 * writeout speed.
	 *
	 * We do this by keeping a floating proportion between BDIs, based
	 * on page writeback completions [end_page_writeback()]. Those
	 * devices that write out pages fastest will get the larger share,
	 * while the slower will get a smaller share.
	 *
	 * We use page writeout completions because we are interested in
	 * getting rid of dirty pages. Having them written out is the
	 * primary goal.
	 *
	 * We introduce a concept of time, a period over which we measure
	 * these events, because demand can/will vary over time. The length
	 * of this period itself is measured in page writeback completions.
	 */
	struct fprop_global completions;
	struct timer_list period_timer;	/* timer for aging of completions */
	unsigned long period_time;

	/*
	 * The dirtyable memory and dirty threshold could be suddenly
	 * knocked down by a large amount (eg. on the startup of KVM in a
	 * swapless system). This may throw the system into deep dirty
	 * exceeded state and throttle heavy/light dirtiers alike. To
	 * retain good responsiveness, maintain global_dirty_limit for
	 * tracking slowly down to the knocked down dirty threshold.
	 *
	 * Both fields are protected by ->lock.
	 */
	unsigned long dirty_limit_tstamp;
	unsigned long dirty_limit;
};

主要函数分析

进程写文件时,会先写到page cache, 然后调用balance_dirty_pages_ratelimited,check脏页数量,达到阈值后会做相应的处理。
callstack如下:

[<0>] balance_dirty_pages+0x688/0x6fc
[<0>] balance_dirty_pages_ratelimited+0xf0/0x23c
[<0>] generic_perform_write+0x14c/0x1b4
[<0>] __generic_file_write_iter+0x118/0x148
[<0>] generic_file_write_iter+0x5c/0xc4
[<0>] fuse_file_write_iter+0xc0/0x2f0 [fuse]
[<0>] new_sync_write+0x80/0xd8
[<0>] vfs_write+0x114/0x138
[<0>] ksys_write+0x78/0xe4

balance_dirty_pages_ratelimited

/**
 * balance_dirty_pages_ratelimited - balance dirty memory state
 * @mapping: address_space which was dirtied
 *
 * Processes which are dirtying memory should call in here once for each page
 * which was newly dirtied.  The function will periodically check the system's
 * dirty state and will initiate writeback if needed.
 *
 * On really big machines, get_writeback_state is expensive, so try to avoid
 * calling it too often (ratelimiting).  But once we're over the dirty memory
 * limit we decrease the ratelimiting by a lot, to prevent individual processes
 * from overshooting the limit by (ratelimit_pages) each.
 */
void balance_dirty_pages_ratelimited(struct address_space *mapping)
{
	struct inode *inode = mapping->host;
	struct backing_dev_info *bdi = inode_to_bdi(inode);
	struct bdi_writeback *wb = NULL;
	int ratelimit;
	int *p;

	if (!(bdi->capabilities & BDI_CAP_WRITEBACK))
		return;

	if (inode_cgwb_enabled(inode))
		wb = wb_get_create_current(bdi, GFP_KERNEL);
	if (!wb)
		wb = &bdi->wb;

	// ratelimit用来判断是否继续调用balance_dirty_pages去balance脏页
	// fork时置初值为32:p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
	ratelimit = current->nr_dirtied_pause;
	// dirty_exceeded表示脏页数量已经超过thresh,数值设小,为8
	if (wb->dirty_exceeded)
		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

	preempt_disable();
	/*
	 * This prevents one CPU to accumulate too many dirtied pages without
	 * calling into balance_dirty_pages(), which can happen when there are
	 * 1000+ tasks, all of them start dirtying pages at exactly the same
	 * time, hence all honoured too large initial task->nr_dirtied_pause.
	 */
	// bdp_ratelimits: percpu上记录dirty page数,超过ratelimit_pages(32)后ratelimit置0
	p =  this_cpu_ptr(&bdp_ratelimits);
	if (unlikely(current->nr_dirtied >= ratelimit))
		*p = 0;
	else if (unlikely(*p >= ratelimit_pages)) {
		*p = 0;
		ratelimit = 0;
	}
	/*
	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
	 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
	 * the dirty throttling and livelock other long-run dirtiers.
	 */
	// pick short-lived tasks遗漏的脏页
	p = this_cpu_ptr(&dirty_throttle_leaks);
	if (*p > 0 && current->nr_dirtied < ratelimit) {
		unsigned long nr_pages_dirtied;
		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
		*p -= nr_pages_dirtied;
		current->nr_dirtied += nr_pages_dirtied;
	}
	preempt_enable();

	// 当前进程制造的脏页>=ratelimit, 需要调用balance_dirty_pages去balance脏页
	if (unlikely(current->nr_dirtied >= ratelimit))
		balance_dirty_pages(wb, current->nr_dirtied);

	wb_put(wb);
}

balance_dirty_pages

/*
 * balance_dirty_pages() must be called by processes which are generating dirty
 * data.  It looks at the number of dirty pages in the machine and will force
 * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
 * If we're over `background_thresh' then the writeback threads are woken to
 * perform some writeout.
 */
static void balance_dirty_pages(struct bdi_writeback *wb,
				unsigned long pages_dirtied)
{
	struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
	struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) };
	struct dirty_throttle_control * const gdtc = &gdtc_stor; // global domain
	struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? // memcg domain,暂不考虑
						     &mdtc_stor : NULL;
	struct dirty_throttle_control *sdtc; // select from gdtc or mdtc
	unsigned long nr_reclaimable;	/* = file_dirty */
	long period;
	long pause;
	long max_pause;
	long min_pause;
	int nr_dirtied_pause;
	bool dirty_exceeded = false;
	unsigned long task_ratelimit;
	unsigned long dirty_ratelimit;
	struct backing_dev_info *bdi = wb->bdi;
	bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
	unsigned long start_time = jiffies;

	for (;;) {
		unsigned long now = jiffies;
		unsigned long dirty, thresh, bg_thresh;
		unsigned long m_dirty = 0;	/* stop bogus uninit warnings */
		unsigned long m_thresh = 0;
		unsigned long m_bg_thresh = 0;

		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
		// 可用于脏页的内存 = free page - reserved + NR_INACTIVE_FILE + NR_ACTIVE_FILE
		gdtc->avail = global_dirtyable_memory();
		// 当前系统全部脏页,global_node_page_state(NR_WRITEBACK)为回写处理中的脏页
		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);

		// 根据配置项dirty_background_ratio、dirty_ratio计算bg_thresh、thresh
		// bg_thresh = dirty_background_ratio * gdtc->avail, 默认10%,单位page数;
		// thresh = dirty_ratio * gdtc->avail, 默认20%;
		domain_dirty_limits(gdtc);

		// 如果bdi有strictlimit capabilities,如fuse,则要按照per-wb数据来计算dirty和thresh
		// strictlimit即是字面意思有这个cap的bid比其他对脏页增长限制更加严格
		if (unlikely(strictlimit)) {
			// 计算wb上的脏页、thresh和bg_thresh
			// 即:wb_dirty、wb_thresh、wb_bg_thresh
			wb_dirty_limits(gdtc);

			dirty = gdtc->wb_dirty;
			thresh = gdtc->wb_thresh;
			bg_thresh = gdtc->wb_bg_thresh;
		} else {
			dirty = gdtc->dirty;
			thresh = gdtc->thresh;
			bg_thresh = gdtc->bg_thresh;
		}

		// memcg blkio相关限制
		if (mdtc) {
			unsigned long filepages, headroom, writeback;

			/*
			 * If @wb belongs to !root memcg, repeat the same
			 * basic calculations for the memcg domain.
			 */
			mem_cgroup_wb_stats(wb, &filepages, &headroom,
					    &mdtc->dirty, &writeback);
			mdtc->dirty += writeback;
			mdtc_calc_avail(mdtc, filepages, headroom);

			domain_dirty_limits(mdtc);

			if (unlikely(strictlimit)) {
				wb_dirty_limits(mdtc);
				m_dirty = mdtc->wb_dirty;
				m_thresh = mdtc->wb_thresh;
				m_bg_thresh = mdtc->wb_bg_thresh;
			} else {
				m_dirty = mdtc->dirty;
				m_thresh = mdtc->thresh;
				m_bg_thresh = mdtc->bg_thresh;
			}
		}

		trace_android_vh_mm_dirty_limits(gdtc, strictlimit, dirty, bg_thresh,
				nr_reclaimable, pages_dirtied);

		/*
		 * Throttle it only when the background writeback cannot
		 * catch-up. This avoids (excessively) small writeouts
		 * when the wb limits are ramping up in case of !strictlimit.
		 *
		 * In strictlimit case make decision based on the wb counters
		 * and limits. Small writeouts when the wb limits are ramping
		 * up are the price we consciously pay for strictlimit-ing.
		 *
		 * If memcg domain is in effect, @dirty should be under
		 * both global and memcg freerun ceilings.
		 */
		// dirty_freerun_ceiling = (thresh + bg_thresh) / 2
		// freerun意思是不throttle进程,只唤醒回写进程
		if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) &&
		    (!mdtc ||
		     m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) {
			unsigned long intv;
			unsigned long m_intv;

free_running:
			intv = dirty_poll_interval(dirty, thresh);
			m_intv = ULONG_MAX;

			current->dirty_paused_when = now;
			current->nr_dirtied = 0;
			if (mdtc)
				m_intv = dirty_poll_interval(m_dirty, m_thresh);
			// 更新nr_dirtied_pause,用于balance_dirty_pages_ratelimited中ratelimit赋值
			current->nr_dirtied_pause = min(intv, m_intv);
			break;
		}

		// dirty > dirty_freerun_ceiling,先启动后台回写任务
		if (unlikely(!writeback_in_progress(wb)))
			wb_start_background_writeback(wb);

		mem_cgroup_flush_foreign(wb);

		/*
		 * Calculate global domain's pos_ratio and select the
		 * global dtc by default.
		 */
		if (!strictlimit) {
			// 计算wb上的脏页、thresh和bg_thresh
			// 即:wb_dirty、wb_thresh、wb_bg_thresh
			wb_dirty_limits(gdtc);
			// wb脏页数低于wb freerun,goto free_running, 不throttle
			if ((current->flags & PF_LOCAL_THROTTLE) &&
			    gdtc->wb_dirty <
			    dirty_freerun_ceiling(gdtc->wb_thresh,
						  gdtc->wb_bg_thresh))
				/*
				 * LOCAL_THROTTLE tasks must not be throttled
				 * when below the per-wb freerun ceiling.
				 */
				goto free_running;
		}

		// 下面要考虑throttle进程了

		// 总dirty和wb dirty都超过对应的thresh,标记为dirty_exceeded
		dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) &&
			((gdtc->dirty > gdtc->thresh) || strictlimit);

		// 计算pos_ratio
		wb_position_ratio(gdtc);
		sdtc = gdtc;

		if (mdtc) {
			/*
			 * If memcg domain is in effect, calculate its
			 * pos_ratio.  @wb should satisfy constraints from
			 * both global and memcg domains.  Choose the one
			 * w/ lower pos_ratio.
			 */
			if (!strictlimit) {
				wb_dirty_limits(mdtc);

				if ((current->flags & PF_LOCAL_THROTTLE) &&
				    mdtc->wb_dirty <
				    dirty_freerun_ceiling(mdtc->wb_thresh,
							  mdtc->wb_bg_thresh))
					/*
					 * LOCAL_THROTTLE tasks must not be
					 * throttled when below the per-wb
					 * freerun ceiling.
					 */
					goto free_running;
			}
			dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) &&
				((mdtc->dirty > mdtc->thresh) || strictlimit);

			wb_position_ratio(mdtc);
			if (mdtc->pos_ratio < gdtc->pos_ratio)
				sdtc = mdtc;
		}

		if (dirty_exceeded && !wb->dirty_exceeded)
			wb->dirty_exceeded = 1;

		// 200ms更新一次bandwidth和dirty_ratelimit
		// bandwidth是wb回写到磁盘的速度
		if (time_is_before_jiffies(wb->bw_time_stamp +
					   BANDWIDTH_INTERVAL)) {
			spin_lock(&wb->list_lock);
			// wb_update_dirty_ratelimit:更新dirty_ratelimit
			// wb_update_write_bandwidth:更新bandwidth
			__wb_update_bandwidth(gdtc, mdtc, start_time, true);
			spin_unlock(&wb->list_lock);
		}

		/* throttle according to the chosen dtc */
		// balanced_ratelimit = write_bandwidth / N
		// dirty_ratelimit:wb上脏页增长率限制阈值,约等于磁盘写入bandwidth / N,N为同时写入该wb的进程数;
		// pos_ratio:控制比率,用于控制脏页水位在setpoint附近;
		// setpoint:把脏页数量降低到的目标值
		// = 1.0 if at setpoint
		// > 1.0 if below setpoint
		// < 1.0 if above setpoint
		dirty_ratelimit = wb->dirty_ratelimit;
		task_ratelimit = ((u64)dirty_ratelimit * sdtc->pos_ratio) >>
							RATELIMIT_CALC_SHIFT;
		max_pause = wb_max_pause(wb, sdtc->wb_dirty);
		min_pause = wb_min_pause(wb, max_pause,
					 task_ratelimit, dirty_ratelimit,
					 &nr_dirtied_pause);

		if (unlikely(task_ratelimit == 0)) {
			period = max_pause;
			pause = max_pause;
			goto pause;
		}
		period = HZ * pages_dirtied / task_ratelimit;
		pause = period;
		// dirty_paused_when 是上次调用balance_dirty_pages的时间
		// 距离上次调用时间越短,pause时间越长;
		// 距离上次调用时间越长,pause时间越短;
		if (current->dirty_paused_when)
			pause -= now - current->dirty_paused_when;
		/*
		 * For less than 1s think time (ext3/4 may block the dirtier
		 * for up to 800ms from time to time on 1-HDD; so does xfs,
		 * however at much less frequency), try to compensate it in
		 * future periods by updating the virtual time; otherwise just
		 * do a reset, as it may be a light dirtier.
		 */
		if (pause < min_pause) {
			trace_balance_dirty_pages(wb,
						  sdtc->thresh,
						  sdtc->bg_thresh,
						  sdtc->dirty,
						  sdtc->wb_thresh,
						  sdtc->wb_dirty,
						  dirty_ratelimit,
						  task_ratelimit,
						  pages_dirtied,
						  period,
						  min(pause, 0L),
						  start_time);
			if (pause < -HZ) {
				// 与上次调用balance_dirty_pages太久,不pause
				current->dirty_paused_when = now;
				// 走到这里,已经启动后台回写了,current->nr_dirtied清零,后面的清零逻辑一样;
				current->nr_dirtied = 0;
			} else if (period) {
				// pause -= now - current->dirty_paused_when;
				// => pause = period - now + current->dirty_paused_when
				// => current->dirty_paused_when + period = now + pause;
				// so:
				// current->dirty_paused_when += period
				// 等同于 current->dirty_paused_when = now + pause;
				// 留到下次再pause
				current->dirty_paused_when += period;
				current->nr_dirtied = 0;
			} else if (current->nr_dirtied_pause <= pages_dirtied)
				// 提高调用balance_dirty_pages的阈值
				current->nr_dirtied_pause += pages_dirtied;
			break;
		}
		if (unlikely(pause > max_pause)) {
			/* for occasional dropped task_ratelimit */
			now += min(pause - max_pause, max_pause);
			pause = max_pause;
		}

pause:
		trace_balance_dirty_pages(wb,
					  sdtc->thresh,
					  sdtc->bg_thresh,
					  sdtc->dirty,
					  sdtc->wb_thresh,
					  sdtc->wb_dirty,
					  dirty_ratelimit,
					  task_ratelimit,
					  pages_dirtied,
					  period,
					  pause,
					  start_time);
		__set_current_state(TASK_KILLABLE);
		wb->dirty_sleep = now;
		io_schedule_timeout(pause);

		current->dirty_paused_when = now + pause;
		current->nr_dirtied = 0;
		current->nr_dirtied_pause = nr_dirtied_pause;

		/*
		 * This is typically equal to (dirty < thresh) and can also
		 * keep "1000+ dd on a slow USB stick" under control.
		 */
		// 跳出for
		if (task_ratelimit)
			break;

		/*
		 * In the case of an unresponding NFS server and the NFS dirty
		 * pages exceeds dirty_thresh, give the other good wb's a pipe
		 * to go through, so that tasks on them still remain responsive.
		 *
		 * In theory 1 page is enough to keep the consumer-producer
		 * pipe going: the flusher cleans 1 page => the task dirties 1
		 * more page. However wb_dirty has accounting errors.  So use
		 * the larger and more IO friendly wb_stat_error.
		 */
		if (sdtc->wb_dirty <= wb_stat_error()) // wb_stat_error() = 256
			break;

		if (fatal_signal_pending(current))
			break;
	} // end for

	if (!dirty_exceeded && wb->dirty_exceeded)
		wb->dirty_exceeded = 0;

	if (writeback_in_progress(wb))
		return;

	/*
	 * In laptop mode, we wait until hitting the higher threshold before
	 * starting background writeout, and then write out all the way down
	 * to the lower threshold.  So slow writers cause minimal disk activity.
	 *
	 * In normal mode, we start background writeout at the lower
	 * background_thresh, to keep the amount of dirty memory low.
	 */
	if (laptop_mode)
		return;

	if (nr_reclaimable > gdtc->bg_thresh)
		wb_start_background_writeback(wb);
}

// 计算wb_dirty, wb_thresh, wb_bg_thresh
static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
{
	struct bdi_writeback *wb = dtc->wb;
	unsigned long wb_reclaimable;

	/*
	 * wb_thresh is not treated as some limiting factor as
	 * dirty_thresh, due to reasons
	 * - in JBOD setup, wb_thresh can fluctuate a lot
	 * - in a system with HDD and USB key, the USB key may somehow
	 *   go into state (wb_dirty >> wb_thresh) either because
	 *   wb_dirty starts high, or because wb_thresh drops low.
	 *   In this case we don't want to hard throttle the USB key
	 *   dirtiers for 100 seconds until wb_dirty drops under
	 *   wb_thresh. Instead the auxiliary wb control line in
	 *   wb_position_ratio() will let the dirtier task progress
	 *   at some rate <= (write_bw / 2) for bringing down wb_dirty.
	 */
	// wb_thresh = thresh * wb上完成回写page数 / global domain总回写完的page数;
	dtc->wb_thresh = __wb_calc_thresh(dtc);
	dtc->wb_bg_thresh = dtc->thresh ?
		div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;

	/*
	 * In order to avoid the stacked BDI deadlock we need
	 * to ensure we accurately count the 'dirty' pages when
	 * the threshold is low.
	 *
	 * Otherwise it would be possible to get thresh+n pages
	 * reported dirty, even though there are thresh-m pages
	 * actually dirty; with m+n sitting in the percpu
	 * deltas.
	 */
	// 当wb_thresh比较小时,wb_dirty需要精确
	if (dtc->wb_thresh < 2 * wb_stat_error()) { // wb_stat_error() = 256
		// wb_stat是取percpu_counter.count
		// wb_stat_sum是percpu_counter.count + 每个cpu上未update到count的percpu_counter.counters之和。
		// percpu_counter每个cpu有各自变量,平常更新到各自的变量,当超过WB_STAT_BATCH后更新到总count上;
        // 所以wb_stat_sum更准确,当然计算成本更大;
		wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
		dtc->wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK);
	} else {
		wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE);
		dtc->wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK);
	}
}

// 计算wb_thresh
static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)
{
	struct wb_domain *dom = dtc_dom(dtc);
	unsigned long thresh = dtc->thresh;
	u64 wb_thresh;
	unsigned long numerator, denominator;
	unsigned long wb_min_ratio, wb_max_ratio;

	/*
	 * Calculate this BDI's share of the thresh ratio.
	 */
	// dom->completions:domain总回写pages数,假设cgroup writeback disable,dom是global domain
	// dtc->wb_completions: wb回写pages数,所有wb都属于global domain
	fprop_fraction_percpu(&dom->completions, dtc->wb_completions,
			      &numerator, &denominator);

	// 扣除所有bdi设备的min_ratio之和
	wb_thresh = (thresh * (100 - bdi_min_ratio)) / 100;
	// numerator/denominator:wb回写page数占总回写数page数比例
	wb_thresh *= numerator;
	wb_thresh = div64_ul(wb_thresh, denominator);

	wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio);

	wb_thresh += (thresh * wb_min_ratio) / 100;
	// 不超过bdi设备的max_ratio
	if (wb_thresh > (thresh * wb_max_ratio) / 100)
		wb_thresh = thresh * wb_max_ratio / 100;

	return wb_thresh;
}

write_bandwidth计算

static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc,
				  struct dirty_throttle_control *mdtc,
				  unsigned long start_time,
				  bool update_ratelimit)
{
	struct bdi_writeback *wb = gdtc->wb;
	unsigned long now = jiffies;
	unsigned long elapsed = now - wb->bw_time_stamp;
	unsigned long dirtied;
	unsigned long written;

	lockdep_assert_held(&wb->list_lock);

	/*
	 * rate-limit, only update once every 200ms.
	 */
	if (elapsed < BANDWIDTH_INTERVAL)
		return;

	dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]);
	written = percpu_counter_read(&wb->stat[WB_WRITTEN]);

	/*
	 * Skip quiet periods when disk bandwidth is under-utilized.
	 * (at least 1s idle time between two flusher runs)
	 */
	// wb->bw_time_stamp: 上次__wb_update_bandwidth时间
	// start_time:开始调balance_dirty_pages时间,或者flusher开始运行时间
	// 上次update bandwidth至此超过1s,这段时间的回写不用来计入bandwidth
	if (elapsed > HZ && time_before(wb->bw_time_stamp, start_time))
		goto snapshot;

	if (update_ratelimit) {
		domain_update_bandwidth(gdtc, now);
		wb_update_dirty_ratelimit(gdtc, dirtied, elapsed);

		/*
		 * @mdtc is always NULL if !CGROUP_WRITEBACK but the
		 * compiler has no way to figure that out.  Help it.
		 */
		if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && mdtc) {
			domain_update_bandwidth(mdtc, now);
			// update ratelimit
			wb_update_dirty_ratelimit(mdtc, dirtied, elapsed);
		}
	}
	// update bandwidth
	wb_update_write_bandwidth(wb, elapsed, written);

snapshot:
	// 记录本次dirtied pages,written pages,本函数调用时间
	wb->dirtied_stamp = dirtied;
	wb->written_stamp = written;
	wb->bw_time_stamp = now;
}

// 计算回写带宽,单位是pages/s
static void wb_update_write_bandwidth(struct bdi_writeback *wb,
				      unsigned long elapsed,
				      unsigned long written)
{
	const unsigned long period = roundup_pow_of_two(3 * HZ);
	unsigned long avg = wb->avg_write_bandwidth;
	unsigned long old = wb->write_bandwidth;
	u64 bw;

	/*
	 * bw = written * HZ / elapsed
	 *
	 *                   bw * elapsed + write_bandwidth * (period - elapsed)
	 * write_bandwidth = ---------------------------------------------------
	 *                                          period
	 *
	 * @written may have decreased due to account_page_redirty().
	 * Avoid underflowing @bw calculation.
	 */
	// bw是elapsed时间内带宽
	// write_bandwidth是3s内bw和原有带宽加权之后的值
	bw = written - min(written, wb->written_stamp);
	bw *= HZ;
	if (unlikely(elapsed > period)) {
		bw = div64_ul(bw, elapsed);
		avg = bw;
		goto out;
	}
	bw += (u64)wb->write_bandwidth * (period - elapsed);
	bw >>= ilog2(period);

	/*
	 * one more level of smoothing, for filtering out sudden spikes
	 */
	// 平滑,过滤跳变值
	if (avg > old && old >= (unsigned long)bw)
		avg -= (avg - old) >> 3;

	if (avg < old && old <= (unsigned long)bw)
		avg += (old - avg) >> 3;

out:
	/* keep avg > 0 to guarantee that tot > 0 if there are dirty wbs */
	avg = max(avg, 1LU);
	if (wb_has_dirty_io(wb)) {
		long delta = avg - wb->avg_write_bandwidth;
		WARN_ON_ONCE(atomic_long_add_return(delta,
					&wb->bdi->tot_write_bandwidth) <= 0);
	}
	wb->write_bandwidth = bw;
	wb->avg_write_bandwidth = avg;
}

pos ratio计算

pos_ratio目的是调节脏页水位,使其不断靠近理想值setpoint,setpoint = (freerun + limit) / 2;
由于task_ratelimit = pos_ratio * dirty_ratelimit
所以有:
if (dirty < setpoint) scale up pos_ratio
if (dirty > setpoint) scale down pos_ratio
if (dirty == setpoint) pos_ratio = 1
具体实现,对global和wb,设计了两个函数分别计算global_pos_ratio, wb_pos_ratio
最终pos_rotio = global_pos_ratio * wb_pos_ratio (先不考虑strictlimit case)。

  1. global
                           setpoint-dirty    3
        f(dirty) := 1.0 + (----------------)
                            limit-setpoint

三次函数,曲线:

/*
 * (o) global control line
 *
 *     ^ global_pos_ratio
 *     |
 *     |            |<===== global dirty control scope ======>|
 * 2.0 .............*
 *     |            .*
 *     |            . *
 *     |            .   *
 *     |            .     *
 *     |            .        *
 *     |            .            *
 * 1.0 ................................*
 *     |            .                  .     *
 *     |            .                  .          *
 *     |            .                  .              *
 *     |            .                  .                 *
 *     |            .                  .                    *
 *   0 +------------.------------------.----------------------*------------->
 *           freerun^          setpoint^                 limit^   dirty pages
 *
*/

在函数pos_ratio_polynom中实现
(1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast
(2) f(setpoint) = 1.0 => the balance point
(3) f(limit) = 0 => the hard limit
(4) df/dx <= 0 => negative feedback control
(5) the closer to setpoint, the smaller |df/dx| (and the reverse)
=> fast response on large errors; small oscillation near setpoint

  1. wb
        f(wb_dirty) := 1.0 + k * (wb_dirty - wb_setpoint)
                        x_intercept - wb_dirty
                    := --------------------------
                        x_intercept - wb_setpoint

线性函数,曲线:

/*
 * (o) wb control line
 *
 *     ^ wb_pos_ratio
 *     |
 *     |            *
 *     |              *
 *     |                *
 *     |                  *
 *     |                    * |<=========== span ============>|
 * 1.0 .......................*
 *     |                      . *
 *     |                      .   *
 *     |                      .     *
 *     |                      .       *
 *     |                      .         *
 *     |                      .           *
 *     |                      .             *
 *     |                      .               *
 *     |                      .                 *
 *     |                      .                   *
 *     |                      .                     *
 * 1/4 ...............................................* * * * * * * * * * * *
 *     |                      .                         .
 *     |                      .                           .
 *     |                      .                             .
 *   0 +----------------------.-------------------------------.------------->
 *                wb_setpoint^                    x_intercept^
 * /

x_intercept是与x轴的截距
x_intercept = wb_setpoint + span


        wb_thresh                    thresh - wb_thresh
 span = --------- * (8 * write_bw) + ------------------ * wb_thresh
         thresh                           thresh

当global domain只有一个wb时,span=(8*write_bw)
JBOD case:span=wb_thresh
JBOD是指:将多个物理盘一个接一个串联到一起,组建一个大的逻辑盘;
span越小,斜率越大,更快速地达到setpoint,write_bw回写带宽是span重要决定因素;

wb_pos_ratio >= 1/4
为什么这么设置呢?
原因是对wb控制不必太严苛(没有设置strictlimit cap),毕竟如果global还有余量,单个wb上的脏页水位即使已经超过阈值,也可以缓慢降下来,
如果pos_ratio降到太低甚至等于0,进程被频繁长时间block,会非常影响用户体验。
因为最后pos_rotio = global_pos_ratio * wb_pos_ratio
不怕global水位超。

  • 代码实现
/*
 * Dirty position control.
 *
 * (o) global/bdi setpoints
 *
 * We want the dirty pages be balanced around the global/wb setpoints.
 * When the number of dirty pages is higher/lower than the setpoint, the
 * dirty position control ratio (and hence task dirty ratelimit) will be
 * decreased/increased to bring the dirty pages back to the setpoint.
 *
 *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
 *
 *     if (dirty < setpoint) scale up   pos_ratio
 *     if (dirty > setpoint) scale down pos_ratio
 *
 *     if (wb_dirty < wb_setpoint) scale up   pos_ratio
 *     if (wb_dirty > wb_setpoint) scale down pos_ratio
 *
 *     task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT
 *
 * (o) global control line
 *
 *     ^ pos_ratio
 *     |
 *     |            |<===== global dirty control scope ======>|
 * 2.0 .............*
 *     |            .*
 *     |            . *
 *     |            .   *
 *     |            .     *
 *     |            .        *
 *     |            .            *
 * 1.0 ................................*
 *     |            .                  .     *
 *     |            .                  .          *
 *     |            .                  .              *
 *     |            .                  .                 *
 *     |            .                  .                    *
 *   0 +------------.------------------.----------------------*------------->
 *           freerun^          setpoint^                 limit^   dirty pages
 *
 * (o) wb control line
 *
 *     ^ pos_ratio
 *     |
 *     |            *
 *     |              *
 *     |                *
 *     |                  *
 *     |                    * |<=========== span ============>|
 * 1.0 .......................*
 *     |                      . *
 *     |                      .   *
 *     |                      .     *
 *     |                      .       *
 *     |                      .         *
 *     |                      .           *
 *     |                      .             *
 *     |                      .               *
 *     |                      .                 *
 *     |                      .                   *
 *     |                      .                     *
 * 1/4 ...............................................* * * * * * * * * * * *
 *     |                      .                         .
 *     |                      .                           .
 *     |                      .                             .
 *   0 +----------------------.-------------------------------.------------->
 *                wb_setpoint^                    x_intercept^
 *
 * The wb control line won't drop below pos_ratio=1/4, so that wb_dirty can
 * be smoothly throttled down to normal if it starts high in situations like
 * - start writing to a slow SD card and a fast disk at the same time. The SD
 *   card's wb_dirty may rush to many times higher than wb_setpoint. // 避免直接限死,导致进程持续pause
 * - the wb dirty thresh drops quickly due to change of JBOD workload
 */
static void wb_position_ratio(struct dirty_throttle_control *dtc)
{
	struct bdi_writeback *wb = dtc->wb;
	unsigned long write_bw = wb->avg_write_bandwidth;
	unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
	unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
	unsigned long wb_thresh = dtc->wb_thresh;
	unsigned long x_intercept;
	unsigned long setpoint;		/* dirty pages' target balance point */
	unsigned long wb_setpoint;
	unsigned long span;
	long long pos_ratio;		/* for scaling up/down the rate limit */
	long x;

	dtc->pos_ratio = 0;

	if (unlikely(dtc->dirty >= limit))
		return;

	/*
	 * global setpoint
	 *
	 * See comment for pos_ratio_polynom().
	 */
	setpoint = (freerun + limit) / 2;
	pos_ratio = pos_ratio_polynom(setpoint, dtc->dirty, limit);

	/*
	 * The strictlimit feature is a tool preventing mistrusted filesystems
	 * from growing a large number of dirty pages before throttling. For
	 * such filesystems balance_dirty_pages always checks wb counters
	 * against wb limits. Even if global "nr_dirty" is under "freerun".
	 * This is especially important for fuse which sets bdi->max_ratio to
	 * 1% by default. Without strictlimit feature, fuse writeback may
	 * consume arbitrary amount of RAM because it is accounted in
	 * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
	 *
	 * Here, in wb_position_ratio(), we calculate pos_ratio based on
	 * two values: wb_dirty and wb_thresh. Let's consider an example:
	 * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
	 * limits are set by default to 10% and 20% (background and throttle).
	 * Then wb_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
	 * wb_calc_thresh(wb, bg_thresh) is about ~4K pages. wb_setpoint is
	 * about ~6K pages (as the average of background and throttle wb
	 * limits). The 3rd order polynomial will provide positive feedback if
	 * wb_dirty is under wb_setpoint and vice versa.
	 *
	 * Note, that we cannot use global counters in these calculations
	 * because we want to throttle process writing to a strictlimit wb
	 * much earlier than global "freerun" is reached (~23MB vs. ~2.3GB
	 * in the example above).
	 */
	// 由于fuse处理回写时消耗额外的内存,这些内存不计入nr_dirty,而是NR_WRITEBACK_TEMP
	// gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK); 不包含NR_WRITEBACK_TEMP
	// 所以fuse限制更严格,只要dtc->wb_dirty >= wb_thresh 就设置dtc->pos_ratio=0,强制pause
	// 即便此时总dirty数在freerun以下。
	if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
		long long wb_pos_ratio;

		if (dtc->wb_dirty < 8) {
			dtc->pos_ratio = min_t(long long, pos_ratio * 2,
					   2 << RATELIMIT_CALC_SHIFT);
			return;
		}

		if (dtc->wb_dirty >= wb_thresh)
			return;

		wb_setpoint = dirty_freerun_ceiling(wb_thresh,
						    dtc->wb_bg_thresh);

		if (wb_setpoint == 0 || wb_setpoint == wb_thresh)
			return;

		wb_pos_ratio = pos_ratio_polynom(wb_setpoint, dtc->wb_dirty,
						 wb_thresh);

		/*
		 * Typically, for strictlimit case, wb_setpoint << setpoint
		 * and pos_ratio >> wb_pos_ratio. In the other words global
		 * state ("dirty") is not limiting factor and we have to
		 * make decision based on wb counters. But there is an
		 * important case when global pos_ratio should get precedence:
		 * global limits are exceeded (e.g. due to activities on other
		 * wb's) while given strictlimit wb is below limit.
		 *
		 * "pos_ratio * wb_pos_ratio" would work for the case above,
		 * but it would look too non-natural for the case of all
		 * activity in the system coming from a single strictlimit wb
		 * with bdi->max_ratio == 100%.
		 *
		 * Note that min() below somewhat changes the dynamics of the
		 * control system. Normally, pos_ratio value can be well over 3
		 * (when globally we are at freerun and wb is well below wb
		 * setpoint). Now the maximum pos_ratio in the same situation
		 * is 2. We might want to tweak this if we observe the control
		 * system is too slow to adapt.
		 */
		dtc->pos_ratio = min(pos_ratio, wb_pos_ratio);
		return;
	}

	/*
	 * We have computed basic pos_ratio above based on global situation. If
	 * the wb is over/under its share of dirty pages, we want to scale
	 * pos_ratio further down/up. That is done by the following mechanism.
	 */

	/*
	 * wb setpoint
	 *
	 *        f(wb_dirty) := 1.0 + k * (wb_dirty - wb_setpoint)
	 *
	 *                        x_intercept - wb_dirty
	 *                     := --------------------------
	 *                        x_intercept - wb_setpoint
	 *
	 * The main wb control line is a linear function that subjects to
	 *
	 * (1) f(wb_setpoint) = 1.0
	 * (2) k = - 1 / (8 * write_bw)  (in single wb case)
	 *     or equally: x_intercept = wb_setpoint + 8 * write_bw
	 *
	 * For single wb case, the dirty pages are observed to fluctuate
	 * regularly within range
	 *        [wb_setpoint - write_bw/2, wb_setpoint + write_bw/2]
	 * for various filesystems, where (2) can yield in a reasonable 12.5%
	 * fluctuation range for pos_ratio.
	 *
	 * For JBOD case, wb_thresh (not wb_dirty!) could fluctuate up to its
	 * own size, so move the slope over accordingly and choose a slope that
	 * yields 100% pos_ratio fluctuation on suddenly doubled wb_thresh.
	 */
	if (unlikely(wb_thresh > dtc->thresh))
		wb_thresh = dtc->thresh;
	/*
	 * It's very possible that wb_thresh is close to 0 not because the
	 * device is slow, but that it has remained inactive for long time.
	 * Honour such devices a reasonable good (hopefully IO efficient)
	 * threshold, so that the occasional writes won't be blocked and active
	 * writes can rampup the threshold quickly.
	 */
	// 这里指出wb_thresh很可能是0,原因是写入是inactive的,并非wb本身慢
	wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
	/*
	 * scale global setpoint to wb's:
	 *	wb_setpoint = setpoint * wb_thresh / thresh // 按wb_thresh / thresh比例计算wb_setpoint
	 */
	x = div_u64((u64)wb_thresh << 16, dtc->thresh | 1);
	wb_setpoint = setpoint * (u64)x >> 16;
	/*
	 * Use span=(8*write_bw) in single wb case as indicated by
	 * (thresh - wb_thresh ~= 0) and transit to wb_thresh in JBOD case.
	 *
	 *        wb_thresh                    thresh - wb_thresh
	 * span = --------- * (8 * write_bw) + ------------------ * wb_thresh
	 *         thresh                           thresh
	 */
	// 构造span,当系统只有一个wb时,wb_thresh==thresh, =》 span=(8*write_bw)
	span = (dtc->thresh - wb_thresh + 8 * write_bw) * (u64)x >> 16;
	x_intercept = wb_setpoint + span; // x_intercept: 函数与x轴的截距

	if (dtc->wb_dirty < x_intercept - span / 4) {
		pos_ratio = div64_u64(pos_ratio * (x_intercept - dtc->wb_dirty),
				      (x_intercept - wb_setpoint) | 1);
	} else
		pos_ratio /= 4;

	/*
	 * wb reserve area, safeguard against dirty pool underrun and disk idle
	 * It may push the desired control point of global dirty pages higher
	 * than setpoint.
	 */
	/*
	这段注释说明了写缓冲区保留区的存在是为了防止脏页池耗尽和磁盘空闲的情况。
	为了实现这一目的,系统可能会将全局脏页数的期望控制点设置得比预定值更高,
	以确保有足够的脏页在写缓冲区中,从而维持磁盘的写入操作和系统性能。
	*/
	// 对于wb_dirty < wb_thresh / 2, 调大pos_ratio,乘数在(1, 8],保持一定数量的脏页,有助于提高系统性能;
	x_intercept = wb_thresh / 2;
	if (dtc->wb_dirty < x_intercept) {
		if (dtc->wb_dirty > x_intercept / 8)
			// pos_ratio = pos_ratio * (1, 8)
			pos_ratio = div_u64(pos_ratio * x_intercept,
					    dtc->wb_dirty);
		else
			pos_ratio *= 8;
	}

	dtc->pos_ratio = pos_ratio;
}

dirty ratelimit计算

核心计算公式:
dirty_rate = N * task_ratelimit => N = dirty_rate / task_ratelimit
balanced_dirty_ratelimit = write_bw / N
=> balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate
去跳变点,平滑
=> dirty_ratelimit

/*
 * Maintain wb->dirty_ratelimit, the base dirty throttle rate.
 *
 * Normal wb tasks will be curbed at or below it in long term.
 * Obviously it should be around (write_bw / N) when there are N dd tasks.
 */
static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
				      unsigned long dirtied,
				      unsigned long elapsed)
{
	struct bdi_writeback *wb = dtc->wb;
	unsigned long dirty = dtc->dirty;
	unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
	unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
	unsigned long setpoint = (freerun + limit) / 2;
	unsigned long write_bw = wb->avg_write_bandwidth;
	unsigned long dirty_ratelimit = wb->dirty_ratelimit;
	unsigned long dirty_rate;
	unsigned long task_ratelimit;
	unsigned long balanced_dirty_ratelimit;
	unsigned long step;
	unsigned long x;
	unsigned long shift;

	/*
	 * The dirty rate will match the writeout rate in long term, except
	 * when dirty pages are truncated by userspace or re-dirtied by FS.
	 */
	dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed;

	/*
	 * task_ratelimit reflects each dd's dirty rate for the past 200ms.
	 */
	task_ratelimit = (u64)dirty_ratelimit *
					dtc->pos_ratio >> RATELIMIT_CALC_SHIFT;
	task_ratelimit++; /* it helps rampup dirty_ratelimit from tiny values */

	/*
	 * A linear estimation of the "balanced" throttle rate. The theory is,
	 * if there are N dd tasks, each throttled at task_ratelimit, the wb's
	 * dirty_rate will be measured to be (N * task_ratelimit). So the below
	 * formula will yield the balanced rate limit (write_bw / N).
	 *
	 * Note that the expanded form is not a pure rate feedback:
	 *	rate_(i+1) = rate_(i) * (write_bw / dirty_rate)		     (1) // rate_ is dirty_ratelimit
	 * but also takes pos_ratio into account:
	 *	rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
	 *
	 * (1) is not realistic because pos_ratio also takes part in balancing
	 * the dirty rate.  Consider the state
	 *	pos_ratio = 0.5						     (3)
	 *	rate = 2 * (write_bw / N)				     (4)
	 * If (1) is used, it will stuck in that state! Because each dd will
	 * be throttled at
	 *	task_ratelimit = pos_ratio * rate = (write_bw / N)	     (5)
	 * yielding
	 *	dirty_rate = N * task_ratelimit = write_bw		     (6)
	 * put (6) into (1) we get
	 *	rate_(i+1) = rate_(i)					     (7)
	 *
	 * 
	 * So we end up using (2) to always keep
	 *	rate_(i+1) ~= (write_bw / N)				     (8)
	 * regardless of the value of pos_ratio. As long as (8) is satisfied,
	 * pos_ratio is able to drive itself to 1.0, which is not only where
	 * the dirty count meet the setpoint, but also where the slope of // slope:斜率
	 * pos_ratio is most flat and hence task_ratelimit is least fluctuated. // least fluctuated: 最少的波动
	 */
	// *** 核心计算公式 ***:
	// dirty_rate = N * task_ratelimit => N = dirty_rate / task_ratelimit
	// balanced_dirty_ratelimit = write_bw / N
	// => balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate
	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
					   dirty_rate | 1);
	/*
	 * balanced_dirty_ratelimit ~= (write_bw / N) <= write_bw
	 */
	if (unlikely(balanced_dirty_ratelimit > write_bw))
		balanced_dirty_ratelimit = write_bw;

	/*
	 * We could safely do this and return immediately:
	 *
	 *	wb->dirty_ratelimit = balanced_dirty_ratelimit;
	 *
	 * However to get a more stable dirty_ratelimit, the below elaborated
	 * code makes use of task_ratelimit to filter out singular points and // 过滤掉singular,特异点
	 * limit the step size.
	 *
	 * The below code essentially only uses the relative value of
	 *
	 *	task_ratelimit - dirty_ratelimit
	 *	= (pos_ratio - 1) * dirty_ratelimit
	 *
	 * which reflects the direction and size of dirty position error.
	 */

	/*
	 * dirty_ratelimit will follow balanced_dirty_ratelimit iff
	 * task_ratelimit is on the same side of dirty_ratelimit, too.
	 * For example, when
	 * - dirty_ratelimit > balanced_dirty_ratelimit
	 * - dirty_ratelimit > task_ratelimit (dirty pages are above setpoint)
	 * lowering dirty_ratelimit will help meet both the position and rate
	 * control targets. Otherwise, don't update dirty_ratelimit if it will
	 * only help meet the rate target. After all, what the users ultimately
	 * feel and care are stable dirty rate and small position error.
	 *
	 * |task_ratelimit - dirty_ratelimit| is used to limit the step size
	 * and filter out the singular points of balanced_dirty_ratelimit. Which
	 * keeps jumping around randomly and can even leap far away at times
	 * due to the small 200ms estimation period of dirty_rate (we want to
	 * keep that period small to reduce time lags).
	 */
	step = 0;

	/*
	 * For strictlimit case, calculations above were based on wb counters
	 * and limits (starting from pos_ratio = wb_position_ratio() and up to
	 * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
	 * Hence, to calculate "step" properly, we have to use wb_dirty as
	 * "dirty" and wb_setpoint as "setpoint".
	 *
	 * We rampup dirty_ratelimit forcibly if wb_dirty is low because
	 * it's possible that wb_thresh is close to zero due to inactivity
	 * of backing device.
	 */
	// STRICTLIMIT要单独计算setpoint;
	if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
		dirty = dtc->wb_dirty;
		if (dtc->wb_dirty < 8)
			setpoint = dtc->wb_dirty + 1;
		else
			setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
	}

	// 作用是过滤balanced_dirty_ratelimit偏离较大的离散点
	// 当dirty < setpoint, dirty=>setpoint,趋向是增加drity,达到setpoint,所以在
	// wb->balanced_dirty_ratelimit, balanced_dirty_ratelimit, task_ratelimit中取最小值x,如果x都>dirty_ratelimit
	// 说明此次有必要更新dirty_ratelimit,记下step,后面用来增加dirty_ratelimit。
	// 对于x<=dirty_ratelimit忽略,不更新dirty_ratelimit
	if (dirty < setpoint) {
		x = min3(wb->balanced_dirty_ratelimit,
			 balanced_dirty_ratelimit, task_ratelimit);
		if (dirty_ratelimit < x)
			step = x - dirty_ratelimit;
	} else {
		x = max3(wb->balanced_dirty_ratelimit,
			 balanced_dirty_ratelimit, task_ratelimit);
		if (dirty_ratelimit > x)
			step = dirty_ratelimit - x;
	}

	/*
	 * Don't pursue 100% rate matching. It's impossible since the balanced
	 * rate itself is constantly fluctuating. So decrease the track speed
	 * when it gets close to the target. Helps eliminate pointless tremors. // 消除无意义的抖动
	 */
	// 控制step值在一定范围内波动, 小step丢掉,大step缩小
	shift = dirty_ratelimit / (2 * step + 1);
	if (shift < BITS_PER_LONG)
		step = DIV_ROUND_UP(step >> shift, 8);
	else
		step = 0;

	if (dirty_ratelimit < balanced_dirty_ratelimit)
		dirty_ratelimit += step;
	else
		dirty_ratelimit -= step;

	wb->dirty_ratelimit = max(dirty_ratelimit, 1UL);
	wb->balanced_dirty_ratelimit = balanced_dirty_ratelimit;

	trace_bdi_dirty_ratelimit(wb, dirty_rate, task_ratelimit);
}

strictlimit feature

strictlimit feature patch:
https://github.com/torvalds/linux/commit/5a53748568f79641eaf40e41081a2f4987f005c2

comments大意是:
fuse文件系是mistrusted filesystem,fuse daemon是用户在userspace实现的,如果fuse daemon有bug或者出了什么问题,
那么脏页会无法及时回写,这可能导致脏页占用大量内存无法释放。
主要是2个问题

  1. fuse处理脏页时把page计入NR_WRITEBACK_TEMP的(代码 https://elixir.bootlin.com/linux/v5.10.225/source/fs/fuse/file.c)
    balance_dirty_page计算dirty=NR_FILE_DIRTY+NR_WRITEBACK,NR_WRITEBACK_TEMP没人care。
  2. balance_dirty_page的逻辑,在系统总脏页达到freerun后才throttle进程,太晚了,进程写page cache很快,这时候fuse已经占用了大量内存。
    如果设备io又很慢,在内存需要回收或者umount时会花大量时间回写。
    作者认为针对fuse这种mistrusted fs,需要strictlimit,严格限制,尽快throttle。

这里有个疑问:fuse处理为啥要单独计入NR_WRITEBACK_TEMP,直接计入NR_WRITEBACK不行吗?

  • 没有找到官方解释,个人认为fuse的脏页的回写,很大程度由用户程序fuse daemon决定,用户要为自己行为负责,所以把这个单独拿出来统计。

patch主要几个地方改动:

  • balance_dirty_pages中的dirty、thresh、bg_thresh、freerun都取wb的
    这样即使global dirty < global freerun,只要wb dirty > wb freerun,就要throttle进程;

  • wb_position_ratio中计算wb_pos_ratio:

// ...
static void wb_position_ratio(struct dirty_throttle_control *dtc)
{
  // ...
  dtc->pos_ratio = 0;
  if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
		long long wb_pos_ratio;

		if (dtc->wb_dirty < 8) {
			dtc->pos_ratio = min_t(long long, pos_ratio * 2,
					   2 << RATELIMIT_CALC_SHIFT);
			return;
		}
		// 这里wb dirty只要超过wb thresh,dtc->pos_ratio = 0, 按照max_pause时间来throttle进程
		if (dtc->wb_dirty >= wb_thresh)
			return;

		wb_setpoint = dirty_freerun_ceiling(wb_thresh,
						    dtc->wb_bg_thresh);

		if (wb_setpoint == 0 || wb_setpoint == wb_thresh)
			return;
		// 同global一样,三次函数求pos_ratio,pos_ratio可以很低
		wb_pos_ratio = pos_ratio_polynom(wb_setpoint, dtc->wb_dirty,
						 wb_thresh);

		dtc->pos_ratio = min(pos_ratio, wb_pos_ratio);
		return;
	}
}
// ...
  • wb_update_dirty_ratelimit
// STRICTLIMIT要单独计算setpoint;
if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
	dirty = dtc->wb_dirty;
	if (dtc->wb_dirty < 8)
		setpoint = dtc->wb_dirty + 1;
	else
		setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
}

fuse问题解决

问题分析

  1. 看stack
[<0>] __switch_to+0xdc/0x148
[<0>] balance_dirty_pages+0x688/0x6fc
[<0>] balance_dirty_pages_ratelimited+0xf0/0x23c
[<0>] generic_perform_write+0x14c/0x1b4
[<0>] __generic_file_write_iter+0x118/0x148
[<0>] generic_file_write_iter+0x5c/0xc4
[<0>] fuse_file_write_iter+0xc0/0x2f0 [fuse]
[<0>] new_sync_write+0x80/0xd8
[<0>] vfs_write+0x114/0x138
[<0>] ksys_write+0x78/0xe4
[<0>] __arm64_sys_write+0x20/0x2c
[<0>] el0_svc_common.constprop.0+0x13c/0x1ec
[<0>] do_el0_svc+0x8c/0x98
[<0>] el0_svc+0x20/0x30
[<0>] el0_sync_handler+0xb4/0x134
[<0>] el0_sync+0x1a0/0x1c0

可以得出两个结论:
(1)与fuse有关,且fuse启用了page cache,触发了脏页回写
(2)进程block在io_schedule_timeout,主动休眠

  1. ftrace
    balance_dirty_pages中有tracepoint,直接抓ftrace:
    发现pos_ratio很小,偶尔为0,pause时间比较频繁且时间较长;
fio-7242    [000] .... 10129.625860: balance_dirty_pages: bdi 0:57: limit=786588 setpoint=687837 dirty=8 bdi_setpoint=0 bdi_dirty=8 dirty_ratelimit=5444 task_ratelimit=0 dirtied=8 dirtied_pause=12 paused=0 pause=30 period=30 think=916 cgroup_ino=1
fio-7242    [000] .... 10132.875744: balance_dirty_pages: bdi 0:57: limit=786588 setpoint=687832 dirty=60 bdi_setpoint=0 bdi_dirty=65 dirty_ratelimit=5444 task_ratelimit=0 dirtied=64 dirtied_pause=64 paused=0 pause=200 period=200 think=3200 cgroup_ino=1
fio-7242    [000] .... 10136.176212: balance_dirty_pages: bdi 0:57: limit=786588 setpoint=687818 dirty=67 bdi_setpoint=0 bdi_dirty=65 dirty_ratelimit=5444 task_ratelimit=0 dirtied=64 dirtied_pause=64 paused=0 pause=200 period=200 think=3096 cgroup_ino=1

此问题的明显特征:
bdi_setpoint=0
task_ratelimit=0

原因:
1.
结合用户程序行为,程序持续写入数据量很小,wb inactive,导致wb_thresh计算为0,
=> wb_position_ratio中dtc->wb_dirty >= wb_thresh
=> pos_ratio = 0
=> task_ratelimit = 0
=> pause = max_pause;
加上fuse daemon使用网络存储,网络波动时,回写带宽低
=> current->nr_dirtied_pause 低
=> current->nr_dirtied >= ratelimit
=> 频繁调用balance_dirty_pages,频繁throttle
2.
从trace可以看到:
bdi_dirty>8时,pause>0,这是因为代码中,strictlimit feature对此类状况做了特殊处理,
当bdi_dirty<8时,pos_ratio设置为2,不blocking;
但是bdi_dirty<8是不够的。
另外思考bdi_dirty为什么一直增长呢? 按道理此时writeback thread已经工作了;
原因:
writeback写入磁盘速度不够或者延迟较大,可能是fuse daemon问题,也可能是fuse daemon继续向下写入的延迟(比如通过网络传输);

这也解释了,为什么在fuse+网络,且网络延迟大的时候容易复现;

回过头来再看wb_thresh的计算

/**
 * __wb_calc_thresh - @wb's share of dirty throttling threshold
 * @dtc: dirty_throttle_context of interest
 *
 * Note that balance_dirty_pages() will only seriously take it as a hard limit
 * when sleeping max_pause per page is not enough to keep the dirty pages under
 * control. For example, when the device is completely stalled due to some error
 * conditions, or when there are 1000 dd tasks writing to a slow 10MB/s USB key.
 * In the other normal situations, it acts more gently by throttling the tasks
 * more (rather than completely block them) when the wb dirty pages go high.
 *
 * It allocates high/low dirty limits to fast/slow devices, in order to prevent
 * - starving fast devices
 * - piling up dirty pages (that will take long time to sync) on slow devices
 *
 * The wb's share of dirty limit will be adapting to its throughput and
 * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
 *
 * Return: @wb's dirty limit in pages. The term "dirty" in the context of
 * dirty balancing includes all PG_dirty and PG_writeback pages.
 */
static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)
{
	struct wb_domain *dom = dtc_dom(dtc);
	unsigned long thresh = dtc->thresh;
	u64 wb_thresh;
	unsigned long numerator, denominator;
	unsigned long wb_min_ratio, wb_max_ratio;

	/*
	 * Calculate this BDI's share of the thresh ratio.
	 */
	fprop_fraction_percpu(&dom->completions, dtc->wb_completions,
			      &numerator, &denominator);

	wb_thresh = (thresh * (100 - bdi_min_ratio)) / 100;
	wb_thresh *= numerator;
	wb_thresh = div64_ul(wb_thresh, denominator);

	wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio);

	wb_thresh += (thresh * wb_min_ratio) / 100;
	if (wb_thresh > (thresh * wb_max_ratio) / 100)
		wb_thresh = thresh * wb_max_ratio / 100;

	return wb_thresh;
}

如果wb并不是写入带宽低,只是写入数据量小,周期内完成写入page completions小,这样wb_thresh按比例计算得到很低的值,甚至是0;
在注释中也说:wb_thresh并非hard limit,但是strictlimit cap却用它作为hard limit,在这种case中是有问题的。
同时在计算wb pos_ratio时,非strictlimit有这么一句:

wb_position_ratio:
  /*
	 * It's very possible that wb_thresh is close to 0 not because the
	 * device is slow, but that it has remained inactive for long time.
	 * Honour such devices a reasonable good (hopefully IO efficient)
	 * threshold, so that the occasional writes won't be blocked and active
	 * writes can rampup the threshold quickly.
	 */
	// 注释指出wb_thresh很可能是0,原因是写入是inactive的,并非wb本身慢
    // 这里放大wb_thresh
	wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);

而在strictlimit中只有:

if (dtc->wb_dirty < 8) {
		dtc->pos_ratio = min_t(long long, pos_ratio * 2,
				  2 << RATELIMIT_CALC_SHIFT);
		return;
}
是不够的

解决

  1. 修改内核,fuse去掉strictlimit
    相当于remove strictlimit feature,可以解决问题,但是有strictlimit feature patch中提到的风险;
  2. 调高min_ratio
    min_ratio默认是0,可以设置为1%,或者根据具体case,酌情提高;
    这样在__wb_calc_thresh:
    wb_thresh += (thresh * wb_min_ratio) / 100;
    保证wb_thresh不至于过低。
  3. 修改内核中calc_wb_thresh的逻辑
posted @ 2024-08-05 13:10  jimlinux  阅读(325)  评论(0)    收藏  举报