Projects

For the reader who wants to get an overview over the patches that I have submitted upstream (be patient, loading git.kernel.org takes a while).

List of patch series

The following is a list of the more important patch series that I developed and that have been upstreamed. The list is ordered in chronological order. After the title it contains the linux release when the most current version of the patch series was submitted and which subsystems were part of the patch series. The subsystem names are abbreviated with the commmon short names from the kernel directory names:

block = block layer
btrfs
fs = filesystem layer
iomap
io-uring
mm = memory management
nvme
xfs

Updating UKSM patches

I’m still updating the UKSM (Ulta Kernel samepage merging patches). While the patch never made it into the upstream kernel, I have used the patches at my previous work at Meta for a comparison to KSM. As the patch is still used by some people and depending on the workload might provide some advantages in comparison to KSM, I continued to update them to support newer kernel versions. The patches can be accessed here.

Add “smart scan” mode to KSM

mm

This adds the “smart scan” mode to KSM. Smart scanning skips pages on the next scan that have not been de-duplicated in the previous scans.

P1, P2

Add pages_skipped metric

mm

This adds the pages_skipped metric to KSM. This allows to measure how effective the “smart scan” mode is.

P1, P2

Add pages_scanned metric to KSM

mm

This adds the pages_scanned metric to /sys/kernel/mm/ksm. This metric is cumulative and allows to analyze how much work KSM is doing per scan.

Add ksm stats to /proc/<pid>/smaps

mm

So far KSM statistics have only been available at the global level. To analyze and investigate which VMA’s benefit from KSM was difficult. This adds KSM statistics at the VMA level. The new KSM statistics are exposed in /proc/<pid>/smaps and /proc/<pid>/smaps_rollups.

Adding process control API to enable KSM per process

mm

So far KSM could only be enabled per VMA. With this patch KSM can be enabled for all compatible VMA’s of a process. In addition the setting gets also inherited when the process gets forked.

Cover letter

P1, P2, P2

Adding tracepoints to KSM

mm

This patch adds tracepoints to the Kernel Samepage Merging (KSM) component. This allows to trace the start, stop of a scan and the merging of pages itself.

Backing device information changes

mm

This patch provides new knobs to influence the behavior of the size and management of the dirty pagecache. In addition it also changes the internal calculation to use part of 1000000 instead of percentage values.

Cover letter

P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11, P12, P13, P14, P15, P16, P17, P18, P19, P20

Enable batched completions of passthrough I/O

mm block

The filesystem IO path can take advantage of allocating batches of requests, if the underlying submitter tells the block layer about it through the blk_plug. For passthrough IO, the exported API is the blk_mq_alloc_request() helper, and that one does not allow for request caching.

I developed and implemented the prototype of the feature and worked with Jens Axboe on upstreaming this patch series.

P1, P2, P3

Support async buffered writes for io-uring on btrfs

mm btrfs

This patch series adds support for async buffered writes when using both btrfs and io-uring. Currently io-uring only supports buffered writes (for btrfs) in the slow path, by processing them in the io workers. With this patch series it is now possible to support buffered writes in the fast path. To be able to use the fast path, the required pages must be in the page cache, the required locks in btrfs can be granted immediately and no additional blocks need to be read form disk.

This patch series makes use of the changes that have been introduced by a previous patch series: “io-uring/xfs: support async buffered writes”

Performance results

The new patch improves throughput by over two times (compared to the exiting behavior, where buffered writes are processed by an io-worker process) and also the latency is considerably reduced. Detailled results are part of the changelog of the first commit.

This patch series is based on the earlier patch series, that provided async buffered writes for XFS.

P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11, P12

The patch series has been disccussed on phoronix, zetabizu, cosfone.

btrfs: sysfs: set / query btrfs chunk size

btrfs

The btrfs allocator is currently not ideal for all workloads. It tends to suffer from overallocating data block groups and underallocating metadata block groups. This results in filesystems becoming read-only even though there is plenty of “free” space.

P1, P2, P3

Support async buffered writes on XFS for io-uring

mm iomap fs xfs io-uring

This patch series adds support for async buffered writes when using both xfs and io-uring. Currently io-uring only supports buffered writes in the slow path, by processing them in the io workers. With this patch series it is now possible to support buffered writes in the fast path. To be able to use the fast path the required pages must be in the page cache, the required locks in xfs can be granted immediately and no additional blocks need to be read form disk.

Updating the inode can take time. An optimization has been implemented for the time update. Time updates will be processed in the slow path. While there is already a time update in process, other write requests for the same file, can skip the update of the modification time.

Performance results

For fio the following results have been obtained with a queue depth of 1 and 4k block size for sequential writes (runtime 600 secs):

Metric	without patch	with patch	libaio	psync
iops:	77k	209k	195K	233K
bw:	314MB/s	854MB/s	790MB/s	953MB/s
clat:	9600ns	120ns	540ns	3000ns

For an io depth of 1, the new patch improves throughput by over three times (compared to the exiting behavior, where buffered writes are processed by an io-worker process) and also the latency is considerably reduced. To achieve the same or better performance with the exisiting code an io depth of 4 is required. Increasing the iodepth further does not lead to improvements.

In addition the latency of buffered write operations is reduced considerably.

P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11, P12, P13, P14

The patch series has been disccussed on phoronix, twitter, Kernel Recipes presentation.

Add large CQE support for io-uring (April 2022, io-uring)

io-uring

This adds the large CQE support for io-uring. Large CQE’s are 16 bytes longer. To support the longer CQE’s the allocation part is changed and when the CQE is accessed. The allocation of the large CQE’s is twice as big, so the allocation size is doubled. The ring size calculation needs to take this into account.

All accesses to the large CQE’s need to be shifted by 1 to take the bigger size of each CQE into account. The existing index manipulation does not need to be changed and can stay the same.

The setup and the completion processing needs to take the new fields into account and initialize them. For the completion processing these fields need to be passed through.

The flush completion processing needs to fill the additional CQE32 fields.

The code for overflows needs to be adapted accordingly: the allocation needs to take large CQE’s into account. This means that the order of the fields in the io overflow structure needs to be changed and the allocation needs to be enlarged for big CQE’s. In addition the two new fields need to be copied for large CQE’s.

The new fields are added to the tracing statements, so the extra1 and extra2 fields are exposed in tracing. The new fields are also exposed in the /proc filesystem entry.

P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11, P12

Add xattr support to io-uring

fs io-uring

This adds the xattr support to io_uring. The intent is to have a more complete support for file operations in io_uring.

This change adds support for the following functions to io_uring:

fgetxattr
fsetxattr
getxattr
setxattr

P1, P2, P3, P4

Make statx api for io-uring stable (Feb 2022, fs, io-uring)

fs io-uring

One of the key architectual tenets of io-uring is to keep the parameters for io-uring stable. After the call has been submitted, its value can be changed. Unfortunaltely this is not the case for the current statx implementation.

Make io-uring tracepoints consistent (Feb 2022, io-uring)

io-uring

This makes the io-uring tracepoints consistent. Where it makes sense the tracepoints start with the following four fields:

context (ring)
request
user_data
opcode.