Limiting dirty writeback cache size for nbd volumes

This article describes limitations of controlling the size of the dirty writeback cache size. It explains how a set of patches have addressed these limitations and how the the dirty writeback cache size can be monitored.

Network block devices (nbd) are often used to provide remote block storage. It is convenient to then create a filesystem on top of the nbd device. For this article the BTRFS filesystem is used. This choice has implications on how the backing device info gets allocated and is explained later.

During testing the following problems have been identified:

  • it takes a long time to write back dirty blocks
  • under high memory pressure, the nbd network connections can be closed due to high memory pressure.

While investigating the high memory pressure, it was discovered that with the default settings for dirty_ratio (20) and dirty_background_ratio (10), the writeback dirty cache can consume considerable amounts of writeback cache. The dirty settings can be queried from the proc filesystem:

The amount of the dirty writeback cache can be queried with:

1
  > grep Dirty /proc/meminfo

On a machine with 64GB main memory, up to 8GB dirty writeback memory have been observed when testing with block devices. On a machine with 256GB of main memory more than 20GB of dirty cache have been observed. To create the dirty memory, the following command was used:

1
> dd if=/dev/zero of=/dev/nbd1 bs=10G seek=0

To also evaluate how this effects the writeback cache when using the BTRFS filesystem, the fio program with the following command has been used:

1
2
3
  > fio --rw=randwrite --bs=1m --ioengine=libaio --iodepth=64    \
        --runtime=300 --numjobs=4 --time_based --group_reporting \
        --name=throughput-test-job --size=1gb

When using filesystems up to 3GB of dirty writeback cache have been observed.

To be able to make network block device volumes sustain memory pressure situations better, the amount of writeback memory needs to be limited. The size of the writeback cache can be limited by setting limits on the backing device info (BDI) of the device or filesystem. The backing device info currently supports two knobs:

The min_ratio allows assigning a minimum percentage of the writeback cache to a device. The max_ratio allows limiting a particular device to use not more than the given percentage of the writeback cache. The two above ratio’s are only applied once the dirty writeback cache size has reached $$ \frac{dirty\text{\textunderscore}ratio + background\text{\textunderscore}dirty\text{\textunderscore}ratio}{2} $$

The existing knobs are documented. As dirty dirty_ratio and dirty_background_ratio are global settings, changing these settings has bigger implications. What is needed is something more fine grained.

A new patch has been accepted upstream. The patch implements four changes:

  1. Introduce strictlimit knob. Currently the max_ratio knob exists to limit the dirty_memory. However this knob only applies once $$ \frac{dirty\text{\textunderscore}ratio + background\text{\textunderscore}dirty\text{\textunderscore}ratio}{2} $$ has been reached. With the BDI_CAP_STRICTLIMIT flag, the max_ratio can be applied without reaching that limit. This change exposes that knob. This knob can also be useful for NFS, fuse filesystems and USB devices.
  2. Use part of 1000000 internal calculation. The max_ratio is based on percentage. With the current machine sizes percentage values can be very high (1% of a 256GB main memory is already 2.5GB). This change uses part of 1000000 instead of percentages for the internal calculations.
  3. Introduce two new sysfs knobs: min_bytes and max_bytes. Currently all calculations are based on ratio, but for a user it often more convenient to specify a limit in bytes. The new knobs will not store bytes values, instead they will translate the byte value to a corresponding ratio. As the internal values are now part of 1000, the ratio is closer to the specified value. However the value should be more seen as an approximation as it can fluctuate over time.
  4. Introduce two new sysfs knobs: min_ratio_fine and max_ratio_fine. The granularity for the existing sysfs BDI knobs min_ratio and max_ratio is based on percentage values. The new sysfs BDI knobs min_ratio_fine and max_ratio_fine allow to specify the ratio as part of 1 million.

The strictlimit knob is exposing exisitng kernel functionality. The knob is already used by the fuse filesystem to always enable the strictlimit flag.

There are two different ways to apply BDI settings: to the block device and to the BTRFS filesystem. For block devices the settings can be directly set in /sys/block//bdi. The BTRFS filesystem creates its own BDI, which is different from the BDI of the block device. Each BTRFS BDI gets the name “btrfs-”. The number is incremented for each filesystem. To get to the BDI of the filesystem is more complicated. The following process needs to be followed (it is assumed that the name of the device is known, which is the case for network block devices):

1
2
3
4
5
  > blkid -s UUID -o value /dev/<nbd-device>
  This returns the uuid of the BTRFS filesystem

  > echo 1 > /sys/class/fs/btrfs/<uuid>/bdi/strict_limit
  Sets the strict_limit with the uuid from the previous command

It is important to understand that BTRFS does not apply the block device BDI settings, it only applies its own settings. If a device can be used as a block device and as a filesystem (at different times), it might make sense to specify BDI settings for the block device and the BTRFS filesystem. The BDI settings are only stored in memory. After a filesystem has been mounted, or the device has been loaded (in case it is a module), the settings have to be set again.

Tests have shown that by specifying the new sysctl knobs, the size of the dirty writeback cache can be limited accordingly. The following sysctl values have been used for some of the testing:

1
2
  > echo 1 > /sys/class/bdi/btrfs-2/strict_limit
  > echo 1000000000 > /sys/block/nbd1/bdi/max_bytes

It should not be expected to reach max_ratio or max_bytes of dirty writeback cache as the writeback will start before that limit is reached. By limiting the dirty writeback cache size also the time to write back dirty blocks to the storage device has considerably decreased and the write throughput has no more spikes and is more consistent throughput.

A BDI object gets allocated for all block devices. In addition some filesystems allocate their own BDI object. One of these examples is the BTRFS filesystem.

When the BTRFS filesystem gets mounted, the btrfs_mount_root() function calls the btrfs_fill_super() function.

1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
static int btrfs_fill_super(struct super_block *sb,
			    struct btrfs_fs_devices *fs_devices,
			    void *data)
{
	struct inode *inode;
	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
	int err;

	sb->s_maxbytes = MAX_LFS_FILESIZE;
	sb->s_magic = BTRFS_SUPER_MAGIC;
	sb->s_op = &btrfs_super_ops;
	sb->s_d_op = &btrfs_dentry_operations;
	sb->s_export_op = &btrfs_export_ops;
#ifdef CONFIG_FS_VERITY
	sb->s_vop = &btrfs_verityops;
#endif
	sb->s_xattr = btrfs_xattr_handlers;
	sb->s_time_gran = 1;
#ifdef CONFIG_BTRFS_FS_POSIX_ACL
	sb->s_flags |= SB_POSIXACL;
#endif
	sb->s_flags |= SB_I_VERSION;
	sb->s_iflags |= SB_I_CGROUPWB;

	err = super_setup_bdi(sb);
	if (err) {
		btrfs_err(fs_info, "super_setup_bdi failed");
		return err;
	}

	err = open_ctree(sb, fs_devices, (char *)data);
	if (err) {
		btrfs_err(fs_info, "open_ctree failed");
		return err;
	}

	inode = btrfs_iget(sb, BTRFS_FIRST_FREE_OBJECTID, fs_info->fs_root);
	if (IS_ERR(inode)) {
		err = PTR_ERR(inode);

This function in turns calls super_setup_bdi(). The super_setup_bdi() function allocates a new BDI, which sets up a new BDI directory with the name “bdi-<number>”.

1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
/*
 * Setup private BDI for given superblock. I gets automatically cleaned up
 * in generic_shutdown_super().
 */
int super_setup_bdi(struct super_block *sb)
{
	static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);

	return super_setup_bdi_name(sb, "%.28s-%ld", sb->s_type->name,
				    atomic_long_inc_return(&bdi_seq));
}
EXPORT_SYMBOL(super_setup_bdi);

The BDI name is setup in the function super_setup_bdi_name() which consists of the filename and a running number. At the end of super_setup_bdi_name() assigns the newly allocated BDI to the superblock.

1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
/*
 * Setup private BDI for given superblock. It gets automatically cleaned up
 * in generic_shutdown_super().
 */
int super_setup_bdi_name(struct super_block *sb, char *fmt, ...)
{
	struct backing_dev_info *bdi;
	int err;
	va_list args;

	bdi = bdi_alloc(NUMA_NO_NODE);
	if (!bdi)
		return -ENOMEM;

	va_start(args, fmt);
	err = bdi_register_va(bdi, fmt, args);
	va_end(args);
	if (err) {
		bdi_put(bdi);
		return err;
	}
	WARN_ON(sb->s_bdi != &noop_backing_dev_info);
	sb->s_bdi = bdi;
	sb->s_iflags |= SB_I_PERSB_BDI;

	return 0;
}
EXPORT_SYMBOL(super_setup_bdi_name);

In the write code path of the BTRFS filesystem, the btrfs_buffered_write() function gets eventually called to perform the buffered writes. This function invokes btrfs_write_check(). In btrfs_write_check(), it sets the backing_dev_info pointer of the task struct.

1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
			     size_t count)
{
	struct file *file = iocb->ki_filp;
	struct inode *inode = file_inode(file);
	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
	loff_t pos = iocb->ki_pos;
	int ret;
	loff_t oldsize;
	loff_t start_pos;

	/*
	 * Quickly bail out on NOWAIT writes if we don't have the nodatacow or
	 * prealloc flags, as without those flags we always have to COW. We will
	 * later check if we can really COW into the target range (using
	 * can_nocow_extent() at btrfs_get_blocks_direct_write()).
	 */
	if ((iocb->ki_flags & IOCB_NOWAIT) &&
	    !(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | BTRFS_INODE_PREALLOC)))
		return -EAGAIN;

	current->backing_dev_info = inode_to_bdi(inode);
	ret = file_remove_privs(file);
	if (ret)
		return ret;

The write first goes to the page cache. Eventually the dirty buffers will get written from the page cache with the function balance_dirty_pages(). When the function gets invoked it gets among other parameters a bdi_writeback struct. This contains a pointer to the above BDI object. To determine how much to write out in one batch, balance_dirty_pages() uses the information of the bdi object.

For block devices the BDI structure gets allocated when blk_alloc_disk() is called.

814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
/**
 * blk_alloc_disk - allocate a gendisk structure
 * @node_id: numa node to allocate on
 *
 * Allocate and pre-initialize a gendisk structure for use with BIO based
 * drivers.
 *
 * Context: can sleep
 */
#define blk_alloc_disk(node_id)						\
({													\
	static struct lock_class_key __key;				\
													\
	__blk_alloc_disk(node_id, &__key);				\
})

This function calls down __blk_alloc_disk()

1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
struct gendisk *__blk_alloc_disk(int node, struct lock_class_key *lkclass)
{
	struct request_queue *q;
	struct gendisk *disk;

	q = blk_alloc_queue(node, false);
	if (!q)
		return NULL;

	disk = __alloc_disk_node(q, node, lkclass);
	if (!disk) {
		blk_put_queue(q);
		return NULL;
	}
	set_bit(GD_OWNS_QUEUE, &disk->state);
	return disk;
}
EXPORT_SYMBOL(__blk_alloc_disk);

and alloc_disk_node() to invoke bdi_alloc().

1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
		struct lock_class_key *lkclass)
{
	struct gendisk *disk;

	disk = kzalloc_node(sizeof(struct gendisk), GFP_KERNEL, node_id);
	if (!disk)
		return NULL;

	if (bioset_init(&disk->bio_split, BIO_POOL_SIZE, 0, 0))
		goto out_free_disk;

	disk->bdi = bdi_alloc(node_id);
	if (!disk->bdi)
		goto out_free_bioset;

	/* bdev_alloc() might need the queue, set before the first call */
	disk->queue = q;
...

When the writeback to the storage device has to wait, it can be traced. The function balance_dirty_pages() has a tracepoint defined, that can be used for this purpose.

1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
		/*
		 * For less than 1s think time (ext3/4 may block the dirtier
		 * for up to 800ms from time to time on 1-HDD; so does xfs,
		 * however at much less frequency), try to compensate it in
		 * future periods by updating the virtual time; otherwise just
		 * do a reset, as it may be a light dirtier.
		 */
		if (pause < min_pause) {
			trace_balance_dirty_pages(wb,
						  sdtc->thresh,
						  sdtc->bg_thresh,
						  sdtc->dirty,
						  sdtc->wb_thresh,
						  sdtc->wb_dirty,
						  dirty_ratelimit,
						  task_ratelimit,
						  pages_dirtied,
						  period,
						  min(pause, 0L),
						  start_time);

It contains information about which BDI is used to do the write back. The tracepoint is defined here.

1
2
3
  > echo 1 > /sys/kernel/debug/tracing/tracing_on
  > echo 1 > /sys/kernel/debug/tracing/events/writeback/balance_dirty_pages/enable
  > cat /sys/kernel/debug/tracing/trace_pipe

Typical output for the tracepoint looks like this (the ouput has been reformated to make it easier to read):

1
2
3
4
5
  
fio-3093904 [044] ..... 1545451.315895: balance_dirty_pages: bdi 259:0: limit=245 setpoint=214 dirty=621
                                                             bdi_setpoint=0 bdi_dirty=2 dirty_ratelimit=249060
                                                             task_ratelimit=0 dirtied=0 dirtied_pause=0 paused=0
                                                             pause=3 period=3 think=2 cgroup_ino=7947