On Tue, Apr 23, 2013 at 10:54:35AM -0500, Chandra Seetharaman wrote:
> On Tue, 2013-04-23 at 08:48 -0500, Mark Tinguely wrote:
> > On 04/22/13 18:30, Dave Chinner wrote:
> > > On Mon, Apr 22, 2013 at 10:11:39AM -0500, Mark Tinguely wrote:
> > >> #6 [ffff880135603980] _xfs_buf_find at ffffffffa01a7fef [xfs]
> > >> #7 [ffff8801356039f0] xfs_buf_get at ffffffffa01a824a [xfs]
> > >> #8 [ffff880135603a30] xfs_buf_read at ffffffffa01a83a4 [xfs]
> > >> #9 [ffff880135603a60] xlog_recover_inode_pass2 at ffffffffa0193629
> > >> [xfs]
> > >
> > > So it's the same problem as this bug fix addresses:
> > >
> > > commit 10616b806d1d7835b1d23b8d75ef638f92cb98b6
> > > Author: Dave Chinner<dchinner@xxxxxxxxxx>
> > > Date: Mon Jan 21 23:53:52 2013 +1100
> > >
> > > xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end
> > >
> > > When _xfs_buf_find is passed an out of range address, it will fail
> > > to find a relevant struct xfs_perag and oops with a null
> > > dereference. This can happen when trying to walk a filesystem with a
> > > metadata inode that has a partially corrupted extent map (i.e. the
> > > block number returned is corrupt, but is otherwise intact) and we
> > > try to read from the corrupted block address.
> > >
> > > In this case, just fail the lookup. If it is readahead being issued,
> > > it will simply not be done, but if it is real read that fails we
> > > will get an error being reported. Ideally this case should result
> > > in an EFSCORRUPTED error being reported, but we cannot return an
> > > error through xfs_buf_read() or xfs_buf_get() so this lookup failure
> > > may result in ENOMEM or EIO errors being reported instead.
> > >
> > > Signed-off-by: Dave Chinner<dchinner@xxxxxxxxxx>
> > > Reviewed-by: Brian Foster<bfoster@xxxxxxxxxx>
> > > Reviewed-by: Ben Myers<bpm@xxxxxxx>
> > > Signed-off-by: Ben Myers<bpm@xxxxxxx>
> > >
> > >> The recovery value is bad and is a problem on its own, but XFS does
> > >> not verify the validity of ag number when doing a xfs_perag_get().
I'd be interested to know why the inode in recovery is bad - is this
on a kernel that CRCs the log records? Or a result of some other bug
or hardware corruption? i.e. xfs_perag_get is not the problem here,
it's a corruption of a trusted inode number and we failed to detect
> > > Right, that's what the above fix does, but it can't be done on older
> > > kernels because grwofs relies on being able to get buffers beyond
> > > the existing filesystem limits...
> > Thank-you, that make sense.
> > I still do not like assuming xfs_perag_get() will always return a valid
> > perag pointer.
> I second that.
> Is there any reason we should _not_ check the return value from
> xfs_perag_get() for NULL ?
Yes. The input AG should already be bounds checked before the perag
is looked up. If we are asking for an invalid AG, then the bug is not
in xfs_perag_get(), it is in the code that is calling it. i.e. error
checking the xfs_perag_get() function is a band-aid for improper
object validation, not a solution to the problem.
That is, this function was designed to be extremely low overhead and
only to be handed validated data. If it is only handed validated
data, then it is guaranteed to return a valid per-ag structure,
and therefore error checking the return value is not necessary.
Because xfs_perag_get is not designed to handle untrusted data it is
up to the calling code to first validate the AGNO that is passed to
xfs_perag_get(). If we aren't first validating the object that the
AGNO is derived from, then the calling code has failed to validate
it's inputs sufficiently, and lots of other things can go wrong (not
just the xfs_perag_get() call).
For example, the above commit is a catchall for bad block numbers
being looked up in extent records. It was a quick fix, not a
targeted fix for the reported problems. For bad block numbers in
extents, we should be doing is validating block numbers when they
are looked up are within the filesystem bounds (eg. inside
xfs_bmapi_read/xfs_bmapi_write) so that a bad block number is caught
at lookup time, not at IO time. We only do that for extents that
point to block 0.