[PATCH v2] xfs: always log the inode on unwritten extent conversion
Dave Chinner
david at fromorbit.com
Thu Apr 23 17:32:58 CDT 2015
On Thu, Apr 23, 2015 at 12:42:44PM -0400, Brian Foster wrote:
> The fsync() requirements for crash consistency on XFS are to flush file
> data and force any in-core inode updates to the log. We currently check
> whether the inode is pinned to identify whether the log needs to be
> forced, since a non-zero pin count generally represents an inode that
> has transactions awaiting a flush to the on-disk log.
>
> This is not sufficient in all cases, however. Reports of xfstests test
> generic/311 failures on ppc64/s390x hosts have identified failures to
> fsync outstanding inode modifications due to the inode not being pinned
> at the time of the fsync. This occurs because certain bmap updates can
> complete by logging bmapbt buffers but without ever dirtying (and thus
> pinning) the core inode. The following is a specific incarnation of this
> problem:
>
> $ mount $dev /mnt -o noatime,nobarrier
> $ for i in $(seq 0 2 31); do \
> xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \
> done
> $ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \
> hexdump /mnt/file; \
> ./xfstests-dev/src/godown /mnt
> ...
> 0000000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> *
> 0014000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 00f8000
> $ umount /mnt; mount ...
> $ hexdump /mnt/file
> 0000000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 00f8000
>
> In short, the unwritten extent conversion for the last write is lost
> despite the fact that an fsync executed before the filesystem was
> shutdown. Note that this is impossible to reproduce on v5 supers due to
> unconditional time callbacks for di_changecount and highly difficult to
> reproduce on CONFIG_HZ=1000 kernels due to those same callbacks
> frequently updating cmtime prior to the bmap update. CONFIG_HZ=100
> reduces timer granularity enough to increase the odds that time updates
> are skipped and allows this to reproduce within a handful of attempts.
>
> To deal with this problem, make sure that the inode is logged in the
> unwritten extent conversion path. Fix up the logflags, if necessary,
> after the extent conversion to keep the extent update code consistent
> with the other extent update helpers. This fixup is not necessary for
> the other (hole, delay) extent helpers because they execute in the block
> allocation codepath, which already logs the inode for other reasons
> (e.g., for di_nblocks).
>
> Signed-off-by: Brian Foster <bfoster at redhat.com>
> ---
>
> v2:
> - Log inode unconditionally on unwritten extent conversion and retain
> the fsync pincount check.
> v1: http://oss.sgi.com/pipermail/xfs/2015-April/041468.html
>
> fs/xfs/libxfs/xfs_bmap.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index aeffeaa..e74e42bf 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -4417,6 +4417,21 @@ xfs_bmapi_convert_unwritten(
> error = xfs_bmap_add_extent_unwritten_real(bma->tp, bma->ip, &bma->idx,
> &bma->cur, mval, bma->firstblock, bma->flist,
> &tmp_logflags);
> + /*
> + * Unwritten extent conversion might not have dirtied the inode
> + * depending on the extent state. Unlike block allocation (e.g.,
> + * di_nblocks), there may be no other reason to log the inode in the
> + * unwritten extent conversion path.
> + *
> + * We need to make sure the inode is dirty in the transaction for the
> + * sake of fsync(), which will not force the log for this transaction
> + * unless it sees the inode pinned. This can only happen for btree
> + * format inodes so use XFS_ILOG_CORE.
> + */
> + if (!error && !tmp_logflags) {
> + ASSERT(bma->cur);
> + tmp_logflags |= XFS_ILOG_CORE;
> + }
> bma->logflags |= tmp_logflags;
> if (error)
> return error;
I'd just do:
bma->logflags |= tmp_logflags | XFS_ILOG_CORE;
Because it really doesn't matter if we log an unchanged inode core
or not - it's likely already in the CIL or AIL given we are doing
unwritten extent conversion, so it is unlikely to introduce
significant new overhead from doing this....
Cheers,
Dave.
--
Dave Chinner
david at fromorbit.com
More information about the xfs
mailing list