xfs
[Top] [All Lists]

Re: btree re-write and XFS_WANT_CORRUPTED_GOTO

To: Peter Watkins <treestem@xxxxxxxxx>
Subject: Re: btree re-write and XFS_WANT_CORRUPTED_GOTO
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Fri, 20 May 2011 10:25:28 +1000
Cc: xfs@xxxxxxxxxxx
In-reply-to: <BANLkTinb3aQc9GSy4dqFVeVpGk8W4M92fA@xxxxxxxxxxxxxx>
References: <BANLkTinb3aQc9GSy4dqFVeVpGk8W4M92fA@xxxxxxxxxxxxxx>
User-agent: Mutt/1.5.20 (2009-06-14)
On Thu, May 19, 2011 at 11:59:18AM -0400, Peter Watkins wrote:
> Hello again,
> 
> I've occasionally seen the XFS_WANT_CORRUPTED_GOTO error from
> xfs_free_extent, usually when cleaning up an unlinked file or
> truncating a file.
> 
> Is the btree rewrite known to fix any WANT_CORRUPTED problems? More

No.

> generally, why was the btree code reworked?

Three copies of the almost the same btree code, spanning 15,000 lines
of code, reduced to one btree core file and three ~1,000 line
functions per btree type. All btrees get WANT_CORRUPTED coverage
instead of just the freespace btree. Much easier to implement new
btrees (~1,000 lines of code instead of ~4-7000 lines). Bug fixes to
the btree core get fixed in all types, instead of a just the one
tree it was discovered for. Optimisations need to touch one piece of
code, not three, etc, etc.

> going from 2.6.27 to 2.6.32 (still old I know), which seems to span
> the btree re-write.
> 
> Are any particular patches recommended for this problem? I came across
> 24446fc66fdebbdd8baca0f44fd2a47ad77ba580. It's discussed at
> http://oss.sgi.com/archives/xfs/2011-01/msg00266.html  Do any others
> come to mind?

That problem required CXFS to reproduce - mainline XFS never
executes the particular code path that triggered the bug.

Seriously, if you are having problems with btree corruption, the
first thin you need to do is run with the latest and greatest code.
We fix problems all the time, so asking "what commits from the past
12 releases will fix this problem" is kinda pointless - we can't
give you any sort of reasonable answer to that.

If you can reproduce the problem on a recent kernel (2.6.38 or .39)
then we know we've got a bug that has not been fixed yet, and that
means we need to do spend the effort to fiadn it. However, if you
can't reproduce it on a current kernel, then theres really nothing
much we can do to help you identify the cause - you can run a bisect
on your reproducing workload and find the exact patch that fixed the
problem much more easily than we can....

Of course, if you really want someone to do all this work and fix
these sort of problems on older kernels for you, then that's the
value proposition that using RHEL or SLES bring to the table.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>