xfs
[Top] [All Lists]

Re: xfs recovery oops in 2.6.4-mm1

To: William Lee Irwin III <wli@xxxxxxxxxxxxxx>
Subject: Re: xfs recovery oops in 2.6.4-mm1
From: Steve Lord <lord@xxxxxxx>
Date: Fri, 12 Mar 2004 21:43:07 -0600
Cc: linux-xfs@xxxxxxxxxxx, Nathan Scott <nathans@xxxxxxx>
In-reply-to: <20040312233153.GT655@holomorphy.com>
References: <20040312100025.GP655@holomorphy.com> <4051D517.4070005@xfs.org> <20040312233153.GT655@holomorphy.com>
Sender: linux-xfs-bounce@xxxxxxxxxxx
User-agent: Mozilla Thunderbird 0.5 (X11/20040208)
William Lee Irwin III wrote:
>>William Lee Irwin III wrote:
>>
>>>Console log attached. Remote kernel hacking -level access to the system
>>>for debugging can be arranged.
> 
> 
> On Fri, Mar 12, 2004 at 09:19:51AM -0600, Steve Lord wrote:
> 
>>I see this is a sparc, any chance you could provide disassembly of the
>>xfs_next_bit function. I wonder if it is playing up on this processor,
>>it makes use of ffs and we have had some architecture issues with it
>>before.
>>Not that I can read sparc assembler, but I can take a crack at it ;-)
>>Have you successfully used xfs on this box with older kernels, or is
>>this a new filesystem? Was this the first mount under 2.6.4-mm1?
> 
> 
> It holds up under normal usage. It seems that recovery is the only time
> this happens.
> 
> 
> -- wli
> 

Well, its a little wierd, looks like something is broken compilation
wise in the recovery code, or we trampled on ourselves.

We passed a null pointer in xfs_next_bit which is why it blew up,
the caller does have a way of doing this:

         unsigned int            *data_map = NULL;
         unsigned int            map_size = 0;

         switch (buf_f->blf_type) {
         case XFS_LI_BUF:
                 data_map = buf_f->blf_data_map;
                 map_size = buf_f->blf_map_size;
                 break;
         case XFS_LI_6_1_BUF:
         case XFS_LI_5_3_BUF:
                 obuf_f = (xfs_buf_log_format_v1_t*)buf_f;
                 data_map = obuf_f->blf_data_map;
                 map_size = obuf_f->blf_map_size;
                 break;
         }

......

        bit = xfs_next_bit(data_map, map_size, bit);

We definitely did not hit any of the expected cases in the switch
statement and we passed null into xfs_next_bit().

However, the only caller of this function does a check of the same
data structure and will fail recovery if it does not get one of the
recognized flags.

So unless some of the code in between the two stamped on the memory
in the mean time it looks like a compilation bug.

Is this repeatable? We could probably come up with some extra checks in
here to narrow it down. It might also be worth doing an

        xfs_logprint -t -i -b /dev/xxx

if this happens again and sending the output. This would at least
determine if what was on the disk was correct.

Nathan, any other ideas?

Steve


<Prev in Thread] Current Thread [Next in Thread>