xfs
[Top] [All Lists]

Re: Questions about pagebuf code

To: Craig Tierney <ctierney@xxxxxxxx>
Subject: Re: Questions about pagebuf code
From: Russell Cattelan <cattelan@xxxxxxx>
Date: Thu, 06 May 2004 14:48:55 -0500
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <1083870335.2376.10.camel@xxxxxxxxxxxxxxxxxx>
References: <1083435856.2302.3.camel@xxxxxxxxxxxxxxxxxxxxx> <20040501194709.A23768@xxxxxxxxxxxxx> <1083446482.2302.22.camel@xxxxxxxxxxxxxxxxxxxxx> <1083614597.24397.16.camel@xxxxxxxxxxxxxxxxxxxxxx> <1083870335.2376.10.camel@xxxxxxxxxxxxxxxxxx>
Sender: linux-xfs-bounce@xxxxxxxxxxx
On Thu, 2004-05-06 at 13:05 -0600, Craig Tierney wrote:
> > I've mostly ruled out the problem being per-page since as the test in
> > 198 shows the corruption is the size of give write. In the case of
> > doublewrite  16k.
> > 
> > What appears to be happening as the file is grown out of order in
> > relation to the writes that are happening every once in a while a 16k
> > write is simply forgotten about.
> > 
> > In other words write 16k at 16k then write 16k at 0k.
> > So grow to 32k starting at 16k then go back and fill in
> > the 0-16k block and sometimes the 0-16k block is simply not written
> > to disk.
> > 
> > I'm going to keep looking at this i_sem not locked theory a bit longer 
> > before I give up on yet another theory.
> > 
> > If you have any observations theory let me know it might be helpful.
> 
> It appears my corruption pattern is different (but not unrelated).
> I finally confirmed that when a file is corrupted, it is corrupted
> with data occurring from another write.   For example, Temperature gets
> overwritten when height, or velocity is overwritten with humidity.
The problem is trying to figure out if the data is stale or really from
a different process.
If the data block was previously used and then freed either by removing
the old file or truncating it and then the new file reallocates that
free block into its file but never writes to that block, the data is
stale.

Now if you pattern up a partition with zeros or some known pattern, make
a file system run your test with no removes or truncates and you still
see data from another client, that is not a stale data problem.
That would indicate a page management problem, (usually a locking
problem) the size of corruption is usually limited to a page size
in those cases. 
If I remember right you were seeing problems or 1 or more pages in
length?

I wish I had something for you to try, my latest go around on this bug
is not looking promising.
I thought it might have something to do with i_sem not being held 
for writes, but I'm still seeing the problem even with the i_sem.
I need to put an assert in generic_file_write just to make sure
i_sem is being coming in locked.


> 
> Unless there is something really wrong in caching, the corruption
> must becoming from another client process.  In one case, process
> 3 showed corruption in file 8.  The corruption in file 8 matches
> data from file 14 (files are written in order).  Processes
> are all started at the same time, but they do get out of sync
> because of IO and other load issues.  
> 
> It seems like buffers are overwritten, or if the IO is async
> that a buffer is being used before it is actually no longer in
> use. 
> 
> Craig
> 
> 
-- 
Russell Cattelan <cattelan@xxxxxxx>

Attachment: signature.asc
Description: This is a digitally signed message part

<Prev in Thread] Current Thread [Next in Thread>