As ben and I discussed during the review of the initial CRC series,
inode allocation needs to log the entire inode to ensure the
replayed create transaction results in an inode with the correct
CRC. This means that the logging overhead of inode create doubled
for 256 byte inodes, and is close to 5x higher for 512 byte inodes.
Ben suggested that having a transaction to initialise buffers to
zero without needing to log them physically might be a way to solve
the problem. It would solve the problem, but I already have a
patchset from a few years back that introduces a new inode create
transaction that doesn't require any physical logging on inodes at
This patch series is a forward port of my original work from 2009
(hence the SOBs being from david@xxxxxxxxxxxxx) with a couple of
more recent patches that will also help reduce inode buffer lookups
and hence improve performance.
The first two patches are for reducing he number of inode buffer
lookups. When we are allocating a new inode, the only reason we look
up the inode buffer is to read the generation number so we can
increment it. This patch replaces the inode buffer read with radomly
calculating a new generation number, resulting in an inode
allocation being a purely in-memory operation requiring no IO. There
is a caveat to that - for people using noikeep, we still need to
ensure the generation number increments monotonically so we only
take the new path if that mount option is not set. This reduces
buffer lookups under create heavy workloads by roughly 10%.
The second patch removes a buffer lookup and modification on unlink
that was added for coherency with bulkstat back when bulkstat did
non-cohernet inode lookups. bulkstat is using coherent lookups
again, so the code in unlink is not necessary any more.
The remaining 5 patches are the new icreate transaction series. The
first patch introduced ordered buffers. These are buffers that are
modified in transactions but are not logged by the transaction. They
have an identical lifecycle to a normal buffer, and so pin the tail
ofthe log until they are written back. This enables us to do log a
logical change and have all the physical changes behave as though
physical logging had been performed. This is used for the inode
buffers by the new icreate transaction.
The rest of the patches are simply mechanical - introducing the
inode create log item, the changes to transaction reservations (uses
less space in the log), converting the code to selectively use the
new logging method and adding recovery support to it.
Right now the code will use this transaction if the filesystem is
CRC enabled. Given that CRC enabled filesystems are experimental at
this point, adding a new log item type should not be a major problem
for anyone using them - just make sure the log is clean before
downgrading to an older kernel...
The patchset passes xfstests on non-CRC filesystems without new
regressions and the initial two patches are resulting in a ~10%
improvement in 8-way create speed and a ~15% improvement in 8-way
unlink speed. I don't have any numbers on CRC enabled filesystems as
I've been working on the userspace CRC patchset and getting that
into shape rather than tesing and benchmarking kernel CRC code...
Comments, thoughts, flames?
PS. I'm working on an equivalent patchset for unlink that logs the
the unlinked list as part of the inode core for CRC enabled
filesystems. That's a little bit away from working yet, though...