[Top] [All Lists]

[RFC, PATCH 0/12] xfs: compound buffers for directory blocks

To: xfs@xxxxxxxxxxx
Subject: [RFC, PATCH 0/12] xfs: compound buffers for directory blocks
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Wed, 7 Dec 2011 17:18:11 +1100
This series is an infrastructure change needed to allow CRCs to be
easily implemented on directory blocks. Directory blocks can be
larger than filesytem blocks and are mapped like data in a file via
the inode block map btree. Hence a given directory block can be made
up of discontiguous filesystem blocks.

The current way of handling this is via the struct xfs_dabuf - a
separate structure that tracks individual struct xfs_bufs for each
discontiguous region of a directory block. This abstracts the
discontiguity away from all the directory code by hiding it behind
linear memory buffer and memcpy()ing to and from the underlying
xfs_bufs as the dabuf is created and destroyed for each directory
operation that operates in a given directory block. the struct
xfs-bufs are cached, but the dabuf is not, leading to significant
overhead in constructing, destroying and modifying large directory

Further, because CRCs requires a single CRC for each directory
block, we need to keep the buffer in an aggregated state until we do
IO on it and can run a CRC calculation callback. With the xfs_dabuf
destroyed long before write IO occurs, there is no way to calculate
the CRC sanely.

To solve this problem we effectively need the functionality of a
xfs_dabuf in a struct xfs_buf. That is, an xfs-buf needs to be able
to map a discontiguous block range and aggregate all the IO needed
to read and write such a discontiguous buffer. Further, the buffer
logging need to support discontiguous ranges as well, and translate
the in-memory new construct into the existing individual discontigous
buffer log format.

To do this, the xfs_buf has a block vector array added to it,
similar in concept to the page array. When IO is issued, it issues
separate Io for each vector in the block array, building the IO
appropriately from the page array. In this way, we avoid the need
for a separate memory buffer for the directory code to work on - it
can work directly on the vmapped buffer address. hence we remove two
memcpy()s from each large directory block modification. Adding a io
count for each vector means that the current method of dispatching,
completing and waiting for IO is unchanged.

Further, by modifying the buffer item formatting to deal with
discontiguous buffers, we remove the need for the xfs_dabuf to
interpose to select the correct xfs_buf to record the changes to.
This means that compound buffers can be used completely
transparently throughout the existing XFS codebase (not just the
directory code) without any modification.

To build compound buffers, we need some method of specifying the
block map. We already have a structure for this - the struct
xfs_bmbt_irec, which is what xfs_bmapi_*() uses and is the native
format for maps in the directory code. hence it makes sense to pass
these into the buffer cache as a method of specifying discontiguous
block ranges.

It makes further sense to use struct xfs_bmbt_irec as the internal
representation of block ranges for all the buffer interfaces, but
this requires one extension. That is, the bmbt format currently only
supports filesystem block sized units (FSB) and metadata requires
sector (disk) addressing (DADDR) units. This is easily handled by
adding a new state value that is held in the xfs_bmbt_irec.br_state
field to indicate what unit the xfs_bmbt_irec map is encoded in.
With this, the irec format can be used throughout the buffer
interfaces to support discontiguous buffers everywhere.

Finally, with al these changes, the struct xfs_dabuf is not
necessary anymore, so can be removed.

The series passes xfstests on 4k/4k, 4k/512b, 64k/4k and 64k/512b
(dirblksz/fsblksz) configurations without any new regressions, and
survives 100 million inode fs_mark benchmarks on a 17TB filesystem
using 4k/4k, 64k/512b and 64k/512b configurations.

Some of the series is a bit verbose - code is rearranged a couple of
times to suite testing step by step (e.g. duplicate code in the
patch that introduces a new interface, factor the duplication back
out in a later patch), so could probably be done neater. However,
I'd prefer not to have to redo the entire series to avoid this
if the end result is substantially identical code - it's time
consuming to make sure each patch doesn't break stuff and I'd like
to try to get this into 3.3 so I can focus on the real goal (CRC
support) ASAP.

Comments, flames and ridicule all welcome. :)



<Prev in Thread] Current Thread [Next in Thread>