For the reader who wants to get an overview over the patches that I have submitted upstream (be patient, loading git.kernel.org takes a while).
List of patch series
The following is a list of the more important patch series that I developed and that have been upstreamed. The list is ordered in chronological order. After the title it contains the linux release when the most current version of the patch series was submitted and which subsystems were part of the patch series. The subsystem names are abbreviated with the commmon short names from the kernel directory names:
- block = block layer
- btrfs
- fs = filesystem layer
- iomap
- io-uring
- mm = memory management
- nvme
- xfs
Add “smart scan” mode to KSM
This adds the “smart scan” mode to KSM. Smart scanning skips pages on the next scan that have not been de-duplicated in the previous scans.
Add pages_skipped metric
This adds the pages_skipped metric to KSM. This allows to measure how effective the “smart scan” mode is.
Add pages_scanned metric to KSM
This adds the pages_scanned metric to /sys/kernel/mm/ksm. This metric is cumulative and allows to analyze how much work KSM is doing per scan.
Add ksm stats to /proc/<pid>/smaps
So far KSM statistics have only been available at the global level. To analyze and investigate which VMA’s benefit from KSM was difficult. This adds KSM statistics at the VMA level. The new KSM statistics are exposed in /proc/<pid>/smaps and /proc/<pid>/smaps_rollups.
Adding process control API to enable KSM per process
So far KSM could only be enabled per VMA. With this patch KSM can be enabled for all compatible VMA’s of a process. In addition the setting gets also inherited when the process gets forked.
Adding tracepoints to KSM
This patch adds tracepoints to the Kernel Samepage Merging (KSM) component. This allows to trace the start, stop of a scan and the merging of pages itself.
Backing device information changes
This patch provides new knobs to influence the behavior of the size and management of the dirty pagecache. In addition it also changes the internal calculation to use part of 1000000 instead of percentage values.
P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11, P12, P13, P14, P15, P16, P17, P18, P19, P20
Enable batched completions of passthrough I/O
The filesystem IO path can take advantage of allocating batches of requests, if the underlying submitter tells the block layer about it through the blk_plug. For passthrough IO, the exported API is the blk_mq_alloc_request() helper, and that one does not allow for request caching.
I developed and implemented the prototype of the feature and worked with Jens Axboe on upstreaming this patch series.
Support async buffered writes for io-uring on btrfs
This patch series adds support for async buffered writes when using both btrfs and io-uring. Currently io-uring only supports buffered writes (for btrfs) in the slow path, by processing them in the io workers. With this patch series it is now possible to support buffered writes in the fast path. To be able to use the fast path, the required pages must be in the page cache, the required locks in btrfs can be granted immediately and no additional blocks need to be read form disk.
This patch series makes use of the changes that have been introduced by a previous patch series: “io-uring/xfs: support async buffered writes”
Performance results
The new patch improves throughput by over two times (compared to the exiting behavior, where buffered writes are processed by an io-worker process) and also the latency is considerably reduced. Detailled results are part of the changelog of the first commit.
This patch series is based on the earlier patch series, that provided async buffered writes for XFS.
P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11, P12
The patch series has been disccussed on phoronix, zetabizu, cosfone.
btrfs: sysfs: set / query btrfs chunk size
The btrfs allocator is currently not ideal for all workloads. It tends to suffer from overallocating data block groups and underallocating metadata block groups. This results in filesystems becoming read-only even though there is plenty of “free” space.
Support async buffered writes on XFS for io-uring
This patch series adds support for async buffered writes when using both xfs and io-uring. Currently io-uring only supports buffered writes in the slow path, by processing them in the io workers. With this patch series it is now possible to support buffered writes in the fast path. To be able to use the fast path the required pages must be in the page cache, the required locks in xfs can be granted immediately and no additional blocks need to be read form disk.
Updating the inode can take time. An optimization has been implemented for the time update. Time updates will be processed in the slow path. While there is already a time update in process, other write requests for the same file, can skip the update of the modification time.
Performance results
For fio the following results have been obtained with a queue depth of 1 and 4k block size for sequential writes (runtime 600 secs):
Metric | without patch | with patch | libaio | psync |
---|---|---|---|---|
iops: | 77k | 209k | 195K | 233K |
bw: | 314MB/s | 854MB/s | 790MB/s | 953MB/s |
clat: | 9600ns | 120ns | 540ns | 3000ns |
For an io depth of 1, the new patch improves throughput by over three times (compared to the exiting behavior, where buffered writes are processed by an io-worker process) and also the latency is considerably reduced. To achieve the same or better performance with the exisiting code an io depth of 4 is required. Increasing the iodepth further does not lead to improvements.
In addition the latency of buffered write operations is reduced considerably.
P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11, P12, P13, P14
The patch series has been disccussed on phoronix, twitter, Kernel Recipes presentation.
Add large CQE support for io-uring (April 2022, io-uring)
This adds the large CQE support for io-uring. Large CQE’s are 16 bytes longer. To support the longer CQE’s the allocation part is changed and when the CQE is accessed. The allocation of the large CQE’s is twice as big, so the allocation size is doubled. The ring size calculation needs to take this into account.
All accesses to the large CQE’s need to be shifted by 1 to take the bigger size of each CQE into account. The existing index manipulation does not need to be changed and can stay the same.
The setup and the completion processing needs to take the new fields into account and initialize them. For the completion processing these fields need to be passed through.
The flush completion processing needs to fill the additional CQE32 fields.
The code for overflows needs to be adapted accordingly: the allocation needs to take large CQE’s into account. This means that the order of the fields in the io overflow structure needs to be changed and the allocation needs to be enlarged for big CQE’s. In addition the two new fields need to be copied for large CQE’s.
The new fields are added to the tracing statements, so the extra1 and extra2 fields are exposed in tracing. The new fields are also exposed in the /proc filesystem entry.
P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11, P12
Add xattr support to io-uring
This adds the xattr support to io_uring. The intent is to have a more complete support for file operations in io_uring.
This change adds support for the following functions to io_uring:
- fgetxattr
- fsetxattr
- getxattr
- setxattr
Make statx api for io-uring stable (Feb 2022, fs, io-uring)
One of the key architectual tenets of io-uring is to keep the parameters for io-uring stable. After the call has been submitted, its value can be changed. Unfortunaltely this is not the case for the current statx implementation.
Make io-uring tracepoints consistent (Feb 2022, io-uring)
This makes the io-uring tracepoints consistent. Where it makes sense the tracepoints start with the following four fields:
- context (ring)
- request
- user_data
- opcode.