Hi guys,
On Fri, Mar 12, 2004 at 09:19:51AM -0600, Steve Lord wrote:
> Have you successfully used xfs on this box with older kernels, or is
> this a new filesystem? Was this the first mount under 2.6.4-mm1?
This would be useful to know.
On Fri, Mar 12, 2004 at 09:43:07PM -0600, Steve Lord wrote:
> > On Fri, Mar 12, 2004 at 09:19:51AM -0600, Steve Lord wrote:
> >
> >>I see this is a sparc, any chance you could provide disassembly of the
> >>xfs_next_bit function. I wonder if it is playing up on this processor,
> >>it makes use of ffs and we have had some architecture issues with it
> >>before.
>
> Well, its a little wierd, looks like something is broken compilation
> wise in the recovery code, or we trampled on ourselves.
>
> We passed a null pointer in xfs_next_bit which is why it blew up,
> the caller does have a way of doing this:
>
> unsigned int *data_map = NULL;
> unsigned int map_size = 0;
>
> switch (buf_f->blf_type) {
> case XFS_LI_BUF:
> data_map = buf_f->blf_data_map;
> map_size = buf_f->blf_map_size;
> break;
> case XFS_LI_6_1_BUF:
> case XFS_LI_5_3_BUF:
> obuf_f = (xfs_buf_log_format_v1_t*)buf_f;
> data_map = obuf_f->blf_data_map;
> map_size = obuf_f->blf_map_size;
> break;
> }
>
> ......
>
> bit = xfs_next_bit(data_map, map_size, bit);
>
> We definitely did not hit any of the expected cases in the switch
> statement and we passed null into xfs_next_bit().
Hmmm. Should be impossible, given the caller, which does
suggest a compiler issue I agree.
> However, the only caller of this function does a check of the same
> data structure and will fail recovery if it does not get one of the
> recognized flags.
>
> So unless some of the code in between the two stamped on the memory
> in the mean time it looks like a compilation bug.
Theres not much code inbetween for something to go wrong, and
nothing I can see that explicitly futzes with buf_f where the
null pointer would be originating.
> Is this repeatable? We could probably come up with some extra checks in
> here to narrow it down. It might also be worth doing an
>
> xfs_logprint -t -i -b /dev/xxx
>
> if this happens again and sending the output. This would at least
> determine if what was on the disk was correct.
>
> Nathan, any other ideas?
Yeah, above dump would be useful; ideally we need to find a
reproducible test case too - this might help there...
On Fri, Mar 12, 2004 at 07:48:14PM -0800, William Lee Irwin III wrote:
> On Fri, Mar 12, 2004 at 09:43:07PM -0600, Steve Lord wrote:
> I have to brew up some way to corrupt the fs and get it to try to
> recover via mounting it to reproduce, which doesn't happen for ordinary
> pulling the plug -type of affairs. It did recur 3 times when I once I
> got it corrupted properly, so it's mostly a question of doing that. I
> think it's related to heavy io going on when the plug is pulled.
Attached is a test program thats been used to exercise log
recovery in a more user-friendly fashion. It uses the xfs
forced-shutdown mode to get a dirty log without having to
pull the plug. So, generate traffic, shutdown, unmount, &
then the next mount will do log recovery. If you find the
right traffic to generate a reproducible failure, diagnosing
this becomes a whole lot easier. There's other tools for
generating all manner of different types of traffic in the
xfstests directory in the xfs userspace cvs too - fsstress
is a good one for generating lots of metadata operations.
cheers.
--
Nathan
-- Binary/unsupported file stripped by Ecartis --
-- Type: text/x-csrc
|