xfs
[Top] [All Lists]

Re: Questions about pagebuf code

To: Russell Cattelan <cattelan@xxxxxxx>
Subject: Re: Questions about pagebuf code
From: Craig Tierney <ctierney@xxxxxxxx>
Date: Mon, 10 May 2004 09:25:01 -0600
Cc: linux-xfs@xxxxxxxxxxx, dritch@xxxxxxxx
In-reply-to: <1083872935.28589.27.camel@naboo.americas.sgi.com>
References: <1083435856.2302.3.camel@localhost.localdomain> <20040501194709.A23768@infradead.org> <1083446482.2302.22.camel@localhost.localdomain> <1083614597.24397.16.camel@naboo.americas.sgi.com> <1083870335.2376.10.camel@hpti9.fsl.noaa.gov> <1083872935.28589.27.camel@naboo.americas.sgi.com>
Sender: linux-xfs-bounce@xxxxxxxxxxx
> > It appears my corruption pattern is different (but not unrelated).
> > I finally confirmed that when a file is corrupted, it is corrupted
> > with data occurring from another write.   For example, Temperature gets
> > overwritten when height, or velocity is overwritten with humidity.
> The problem is trying to figure out if the data is stale or really from
> a different process.
> If the data block was previously used and then freed either by removing
> the old file or truncating it and then the new file reallocates that
> free block into its file but never writes to that block, the data is
> stale.

I ran several tests over the weekend where wrote a pattern
to the disk (val=pos%256).  My tests showed that the pages
are just not being written.  I see the original pattern, and
not values that look like data.  It seems that my problem
is the same as you have described before.  That helps.
The doublewrite code is easier to run and hopefully debug.

Craig

> 
> Now if you pattern up a partition with zeros or some known pattern, make
> a file system run your test with no removes or truncates and you still
> see data from another client, that is not a stale data problem.
> That would indicate a page management problem, (usually a locking
> problem) the size of corruption is usually limited to a page size
> in those cases. 
> If I remember right you were seeing problems or 1 or more pages in
> length?

> 
> I wish I had something for you to try, my latest go around on this bug
> is not looking promising.
> I thought it might have something to do with i_sem not being held 
> for writes, but I'm still seeing the problem even with the i_sem.
> I need to put an assert in generic_file_write just to make sure
> i_sem is being coming in locked.
> 
> 
> > 
> > Unless there is something really wrong in caching, the corruption
> > must becoming from another client process.  In one case, process
> > 3 showed corruption in file 8.  The corruption in file 8 matches
> > data from file 14 (files are written in order).  Processes
> > are all started at the same time, but they do get out of sync
> > because of IO and other load issues.  
> > 
> > It seems like buffers are overwritten, or if the IO is async
> > that a buffer is being used before it is actually no longer in
> > use. 
> > 
> > Craig
> > 
> > 


<Prev in Thread] Current Thread [Next in Thread>