[Top] [All Lists]

Re: Questions about pagebuf code

To: Russell Cattelan <cattelan@xxxxxxx>
Subject: Re: Questions about pagebuf code
From: Craig Tierney <ctierney@xxxxxxxx>
Date: Mon, 03 May 2004 16:01:25 -0600
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <1083614597.24397.16.camel@xxxxxxxxxxxxxxxxxxxxxx>
References: <1083435856.2302.3.camel@xxxxxxxxxxxxxxxxxxxxx> <20040501194709.A23768@xxxxxxxxxxxxx> <1083446482.2302.22.camel@xxxxxxxxxxxxxxxxxxxxx> <1083614597.24397.16.camel@xxxxxxxxxxxxxxxxxxxxxx>
Sender: linux-xfs-bounce@xxxxxxxxxxx
On Mon, 2004-05-03 at 14:03, Russell Cattelan wrote:
> Look at this bug and see if matches you're situation.
> http://oss.sgi.com/bugzilla/show_bug.cgi?id=198

I cannot say for certain it is, but I am not sure. 
My writes don't jump around in the file, it is just
a series of reads, then writes, and then another series
of reads then again writes.  However, I downloaded both
test codes and I am trying them out.

> I've been taking a bunch of stabs at this bug trying to
> figure out the problem. 
> Everything so far had proved to be an incorrect theory of the problem.
> My latest thought is a race someplace  in generic_file_write since xfs
> is was calling it with the i_sem held.
> I'm testing some code now but either I'm missing something or adding
> the lock has not helped the problem.
> I've mostly ruled out the problem being per-page since as the test in
> 198 shows the corruption is the size of give write. In the case of
> doublewrite  16k.
> What appears to be happening as the file is grown out of order in
> relation to the writes that are happening every once in a while a 16k
> write is simply forgotten about.
> In other words write 16k at 16k then write 16k at 0k.
> So grow to 32k starting at 16k then go back and fill in
> the 0-16k block and sometimes the 0-16k block is simply not written
> to disk.
> I'm going to keep looking at this i_sem not locked theory a bit longer 
> before I give up on yet another theory.
> If you have any observations theory let me know it might be helpful.

I need to write some code to verify, but it might be that the corrupted
data is actual data from another place in the file.  Or, it is data
from a different file.  In one place the corruption looked like a
different set of values.  It was supposed to be velocities (+/- 5.0) but
then ended up looking like temperatures (between 250.0 and 300.0).
So the question is, if this is indeed the case, are those values from 
this file, or another one?

I am not sure if this could be caused by 'forgetting' to write a
particular chuck as you described above.  I will test that as well.
On a new filesystem, I should not get so lucky as to have values that
look like temperatures (or whatever it was to be).


> > Thanks for the details.  
> > 
> > I am trying to debug a problem with file corruption writing to my xfs
> > filesystem (over nfs) when the server is under heavy load (16+ clients
> > writing simultaneously to different directories).  It does happen with
> > different versions under the 2.4 kernel.  It doesn't happen wen ext2 or
> > jfs is used as the underlying filesystem.  It does happen more often on
> > when faster servers are used, and the corruption is always page aligned
> > (starts at ADDR%4096==0, ends at ADDR%4096=4095).
> > 
> > Since the corruption is always page aligned, and it happens more often
> > on faster servers, I suspect there is some race condition or missed lock
> > were pages are selected for use.  
> > 
> > I didn't completely understand the difference between v/kmalloc, but
> > if there is an issue with mvalloc timings, I figured I would remove all
> > vmallocs (because it is easy) to see if it changed anything.
> > 
> > Thanks,
> > CRaig
> > 
> > 
> > 

<Prev in Thread] Current Thread [Next in Thread>