xfs
[Top] [All Lists]

Re: Buffer Head Corruption Found

To: Steve Lord <lord@xxxxxxx>
Subject: Re: Buffer Head Corruption Found
From: Russell Cattelan <cattelan@xxxxxxxxxxx>
Date: Wed, 24 Jan 2001 16:38:08 -0600
Cc: dcox@xxxxxxxxxx, Linux-XFS <linux-xfs@xxxxxxxxxxx>
References: <200101242150.f0OLo0n22823@xxxxxxxxxxxxxxxxxxxx>
Sender: owner-linux-xfs@xxxxxxxxxxx
Steve Lord wrote:

I should have responded to the earlier, I think I know what the problem
is.

XFS allocates private memory for buffer_heads used in meta data
operations.
If one these buffer_heads is fed to one of the buffer_head reclaim
functions
it will end up on one of the lists that manages the global buffer_head
pool.
This of course is an error and typically shows up when somebody tried to 
reference b_next_free.

Basically we will need to find the spot where and XFS buffer_head is
being 
reclaimed in a normal fashion rather than by our end_io routines.

One place the could happen is in the default end_io function if the raid 
code is resetting this at some point this would be a good place to start
looking.

Note:
Currently page_buf.c has this bit of code
error:
        /* If we ever do get here then clean up what we already did */
        for (itr=0; itr < cnt; itr++) {
                buffer_IO_error(psync->bh[itr]);
        }
        return err;

This call to buffer_IO_error is an error, it would put the buffer_head
on the lists.
I don't think we ever hit this but you might want to take it out just as
a first pass.

> 
> > All,
> >
> >       I just spent today performing a pseudo binary search for a buffer head
> > corruption I have been experiencing with XFS and RAID5.  I have no idea
> > why it only happens in this instance, as you'll see.
> >
> >       In page_buf.c, around line 1424, a call is made to kmem_cache_alloc
> > ().  The short story is: at least one pointer is returned that is
> > already in use!
> >
> >       I wrote a function that steps through the buffer_head lists, and 
> > checks
> > for b_next_free == NULL.  Since it's a circular list, that should never
> > be true.
> >
> >       However, after the call to kmem_cache_alloc, and the subsequent 
> > 'memset
> > (bh, 0,...)', I have my NULL.  This also is the source of most of my
> > Oopes from within buffer.c.  Those functions are not expecting a NULL in
> > b_next_free at all ;-).
> >
> >       So: I've found it, but I have no idea why kmem_cache_alloc would 
> > return
> > a previously used bh, nor what to do about it.
> 
> Hmm, I am not sure how kmem_cache_alloc can do that either, is it not more
> likely that a buffer is being freed, but not removed from the list - i.e. the
> needle is in that other haystack over there. Maybe turning on memory poisoning
> will make things fall over faster - in mm/slab.c there are three defines :
> 
> #define DEBUG           0
> #define STATS           0
> #define FORCED_DEBUG    0
> 
> I think you want to set the DEBUG flag to 1
> 
> Steve
> 
> >
> >       Ideas?
> >
> >       Thanks!
> >
> > --
> > "Men occasionally stumble over the truth, but most of them pick
> > themselves up and hurry off as if nothing had happened."
> >    -- Winston Churchill
> >
> > Danny

-- 
Russell Cattelan
--
Digital Elves inc. -- Currently on loan to SGI
Linux XFS core developer.

<Prev in Thread] Current Thread [Next in Thread>