Re: [PATCH 5/5] xfs: log recovery lsn ordering needs uuid check

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH 5/5] xfs: log recovery lsn ordering needs uuid check
From: Ben Myers <bpm@xxxxxxx>
Date: Tue, 24 Sep 2013 12:14:46 -0500
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <1380002476-18839-6-git-send-email-david@xxxxxxxxxxxxx>
References: <1380002476-18839-1-git-send-email-david@xxxxxxxxxxxxx> <1380002476-18839-6-git-send-email-david@xxxxxxxxxxxxx>
User-agent: Mutt/1.5.20 (2009-06-14)
On Tue, Sep 24, 2013 at 04:01:16PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> After a fair number of xfstests runs, xfs/182 started to fail
> regularly with a corrupted directory - a directory read verifier was
> failing after recovery because it found a block with a XARM magic
> number (remote attribute block) rather than a directory data block.
> The first time I saw this repeated failure I did /something/ and the
> problem went away, so I was never able to find the underlying
> problem. Test xfs/182 failed again today, and I found the root
> cause before I did /something else/ that made it go away.
> Tracing indicated that the block in question was being correctly
> logged, the log was being flushed by sync, but the buffer was not
> being written back before the shutdown occurred. Tracing also
> indicated that log recovery was also reading the block, but then
> never writing it before log recovery invalidated the cache,
> indicating that it was not modified by log recovery.
> More detailed analysis of the corpse indicated that the filesystem
> had a uuid of "a4131074-1872-4cac-9323-2229adbcb886" but the XARM
> block had a uuid of "8f32f043-c3c9-e7f8-f947-4e7f989c05d3", which
> indicated it was a block from an older filesystem. The reason that
> log recovery didn't replay it was that the LSN in the XARM block was
> larger than the LSN of the transaction being replayed, and so the
> block was not overwritten by log recovery.
> Hence, log recovery cant blindly trust the magic number and LSN in
> the block - it must verify that it belongs to the filesystem being
> recovered before using the LSN. i.e. if the UUIDs don't match, we
> need to unconditionally recovery the change held in the log.
> This patch was first tested on a block device that was repeatedly
> causing xfs/182 to fail with the same failure on the same block with
> the same directory read corruption signature (i.e. XARM block). It
> did not fail, and hasn't failed since.
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>

Looks good to me.
Reviewed-by: Ben Myers <bpm@xxxxxxx>

