On Mon, 2004-05-03 at 14:03, Russell Cattelan wrote:
> Look at this bug and see if matches you're situation.
I cannot say for certain it is, but I am not sure.
My writes don't jump around in the file, it is just
a series of reads, then writes, and then another series
of reads then again writes. However, I downloaded both
test codes and I am trying them out.
> I've been taking a bunch of stabs at this bug trying to
> figure out the problem.
> Everything so far had proved to be an incorrect theory of the problem.
> My latest thought is a race someplace in generic_file_write since xfs
> is was calling it with the i_sem held.
> I'm testing some code now but either I'm missing something or adding
> the lock has not helped the problem.
> I've mostly ruled out the problem being per-page since as the test in
> 198 shows the corruption is the size of give write. In the case of
> doublewrite 16k.
> What appears to be happening as the file is grown out of order in
> relation to the writes that are happening every once in a while a 16k
> write is simply forgotten about.
> In other words write 16k at 16k then write 16k at 0k.
> So grow to 32k starting at 16k then go back and fill in
> the 0-16k block and sometimes the 0-16k block is simply not written
> to disk.
> I'm going to keep looking at this i_sem not locked theory a bit longer
> before I give up on yet another theory.
> If you have any observations theory let me know it might be helpful.
I need to write some code to verify, but it might be that the corrupted
data is actual data from another place in the file. Or, it is data
from a different file. In one place the corruption looked like a
different set of values. It was supposed to be velocities (+/- 5.0) but
then ended up looking like temperatures (between 250.0 and 300.0).
So the question is, if this is indeed the case, are those values from
this file, or another one?
I am not sure if this could be caused by 'forgetting' to write a
particular chuck as you described above. I will test that as well.
On a new filesystem, I should not get so lucky as to have values that
look like temperatures (or whatever it was to be).
> > Thanks for the details.
> > I am trying to debug a problem with file corruption writing to my xfs
> > filesystem (over nfs) when the server is under heavy load (16+ clients
> > writing simultaneously to different directories). It does happen with
> > different versions under the 2.4 kernel. It doesn't happen wen ext2 or
> > jfs is used as the underlying filesystem. It does happen more often on
> > when faster servers are used, and the corruption is always page aligned
> > (starts at ADDR%4096==0, ends at ADDR%4096=4095).
> > Since the corruption is always page aligned, and it happens more often
> > on faster servers, I suspect there is some race condition or missed lock
> > were pages are selected for use.
> > I didn't completely understand the difference between v/kmalloc, but
> > if there is an issue with mvalloc timings, I figured I would remove all
> > vmallocs (because it is easy) to see if it changed anything.
> > Thanks,
> > CRaig