Limiting dirty writeback cache size for nbd volumes
This article describes limitations of controlling the size of the dirty writeback cache size. It explains how a set of patches have addressed these limitations and how the the dirty writeback cache size can be monitored.
Overview
Network block devices (nbd) are often used to provide remote block storage. It is convenient to then create a filesystem on top of the nbd device. For this article the BTRFS filesystem is used. This choice has implications on how the backing device info gets allocated and is explained later.
During testing the following problems have been identified:
- it takes a long time to write back dirty blocks
- under high memory pressure, the nbd network connections can be closed due to high memory pressure.
Investigation
While investigating the high memory pressure, it was discovered that with the default settings for dirty_ratio (20) and dirty_background_ratio (10), the writeback dirty cache can consume considerable amounts of writeback cache. The dirty settings can be queried from the proc filesystem:
The amount of the dirty writeback cache can be queried with:
|
|
On a machine with 64GB main memory, up to 8GB dirty writeback memory have been observed when testing with block devices. On a machine with 256GB of main memory more than 20GB of dirty cache have been observed. To create the dirty memory, the following command was used:
|
|
To also evaluate how this effects the writeback cache when using the BTRFS filesystem, the fio program with the following command has been used:
|
|
When using filesystems up to 3GB of dirty writeback cache have been observed.
Limiting the size of the writeback cache
To be able to make network block device volumes sustain memory pressure situations better, the amount of writeback memory needs to be limited. The size of the writeback cache can be limited by setting limits on the backing device info (BDI) of the device or filesystem. The backing device info currently supports two knobs:
The min_ratio allows assigning a minimum percentage of the writeback cache to a device. The max_ratio allows limiting a particular device to use not more than the given percentage of the writeback cache. The two above ratio’s are only applied once the dirty writeback cache size has reached $$ \frac{dirty\text{\textunderscore}ratio + background\text{\textunderscore}dirty\text{\textunderscore}ratio}{2} $$
The existing knobs are documented. As dirty dirty_ratio and dirty_background_ratio are global settings, changing these settings has bigger implications. What is needed is something more fine grained.
Patch with more controls
A new patch has been accepted upstream. The patch implements four changes:
- Introduce strictlimit knob. Currently the max_ratio knob exists to limit the dirty_memory. However this knob only applies once $$ \frac{dirty\text{\textunderscore}ratio + background\text{\textunderscore}dirty\text{\textunderscore}ratio}{2} $$ has been reached. With the BDI_CAP_STRICTLIMIT flag, the max_ratio can be applied without reaching that limit. This change exposes that knob. This knob can also be useful for NFS, fuse filesystems and USB devices.
- Use part of 1000000 internal calculation. The max_ratio is based on percentage. With the current machine sizes percentage values can be very high (1% of a 256GB main memory is already 2.5GB). This change uses part of 1000000 instead of percentages for the internal calculations.
- Introduce two new sysfs knobs: min_bytes and max_bytes. Currently all calculations are based on ratio, but for a user it often more convenient to specify a limit in bytes. The new knobs will not store bytes values, instead they will translate the byte value to a corresponding ratio. As the internal values are now part of 1000, the ratio is closer to the specified value. However the value should be more seen as an approximation as it can fluctuate over time.
- Introduce two new sysfs knobs: min_ratio_fine and max_ratio_fine. The granularity for the existing sysfs BDI knobs min_ratio and max_ratio is based on percentage values. The new sysfs BDI knobs min_ratio_fine and max_ratio_fine allow to specify the ratio as part of 1 million.
The strictlimit knob is exposing exisitng kernel functionality. The knob is already used by the fuse filesystem to always enable the strictlimit flag.
How to apply the BDI settings
There are two different ways to apply BDI settings: to the block device and to the BTRFS filesystem.
For block devices the settings can be directly set in /sys/block/
|
|
It is important to understand that BTRFS does not apply the block device BDI settings, it only applies its own settings. If a device can be used as a block device and as a filesystem (at different times), it might make sense to specify BDI settings for the block device and the BTRFS filesystem. The BDI settings are only stored in memory. After a filesystem has been mounted, or the device has been loaded (in case it is a module), the settings have to be set again.
Using the new sysctl knobs
Tests have shown that by specifying the new sysctl knobs, the size of the dirty writeback cache can be limited accordingly. The following sysctl values have been used for some of the testing:
|
|
It should not be expected to reach max_ratio or max_bytes of dirty writeback cache as the writeback will start before that limit is reached. By limiting the dirty writeback cache size also the time to write back dirty blocks to the storage device has considerably decreased and the write throughput has no more spikes and is more consistent throughput.
Kernel internals
A BDI object gets allocated for all block devices. In addition some filesystems allocate their own BDI object. One of these examples is the BTRFS filesystem.
BTRFS BDI code path
When the BTRFS filesystem gets mounted, the btrfs_mount_root() function calls the btrfs_fill_super() function.
|
|
This function in turns calls super_setup_bdi(). The super_setup_bdi() function allocates a new BDI, which sets up a new BDI directory with the name “bdi-<number>”.
|
|
The BDI name is setup in the function super_setup_bdi_name() which consists of the filename and a running number. At the end of super_setup_bdi_name() assigns the newly allocated BDI to the superblock.
|
|
In the write code path of the BTRFS filesystem, the btrfs_buffered_write() function gets eventually called to perform the buffered writes. This function invokes btrfs_write_check(). In btrfs_write_check(), it sets the backing_dev_info pointer of the task struct.
|
|
The write first goes to the page cache. Eventually the dirty buffers will get written from the page cache with the function balance_dirty_pages(). When the function gets invoked it gets among other parameters a bdi_writeback struct. This contains a pointer to the above BDI object. To determine how much to write out in one batch, balance_dirty_pages() uses the information of the bdi object.
Block device code path
For block devices the BDI structure gets allocated when blk_alloc_disk() is called.
|
|
This function calls down __blk_alloc_disk()
|
|
and alloc_disk_node() to invoke bdi_alloc().
|
|
Kernel tracing
When the writeback to the storage device has to wait, it can be traced. The function balance_dirty_pages() has a tracepoint defined, that can be used for this purpose.
|
|
It contains information about which BDI is used to do the write back. The tracepoint is defined here.
|
|
Typical output for the tracepoint looks like this (the ouput has been reformated to make it easier to read):
|
|