xfs
[Top] [All Lists]

Re: Data corruption with xfs+nfs+lvm

To: Nathan Scott <nathans@xxxxxxx>
Subject: Re: Data corruption with xfs+nfs+lvm
From: Craig Tierney <ctierney@xxxxxxxx>
Date: Mon, 09 Feb 2004 13:53:49 -0700
Cc: cattelan@xxxxxxx, linux-xfs@xxxxxxxxxxx
In-reply-to: <1075482631.3866.9.camel@xxxxxxxxxxxxxxxxxx>
References: <1075423747.3859.280.camel@xxxxxxxxxxxxxxxxxx> <20040130024343.GC1062@frodo> <1075482631.3866.9.camel@xxxxxxxxxxxxxxxxxx>
Sender: linux-xfs-bounce@xxxxxxxxxxx
On Fri, 2004-01-30 at 10:10, Craig Tierney wrote:
> On Thu, 2004-01-29 at 19:43, Nathan Scott wrote:
> > On Thu, Jan 29, 2004 at 05:49:07PM -0700, Craig Tierney wrote:
> > > I have just discovered that I am having problems with data corruption
> > > on my NFS servers and XFS.  It happens in several different cases, but
> > > all under load.  Here are the cases that I have gotten data corruption
> > > for reads and writes.  Corruption happens on different servers and
> > > on different filesystems (some configured with LVM striping, some not).
> > 
> > Can you descibe your test case in more detail?  In particular,
> > do you have a program/programs that demonstrates the problem?
> > That is always a huge help.  Or a list of things to run - what
> > sort of IO is being done, and what does "under load" mean in
> > your context.

I wanted to update the situation.  I patched the kernel to remove 
the call to xfs_refcache_purge_some.  This improve the situation,
but I am still getting data corruption from the model.

I tried to cause the problem on the filesystem directly with
a serial code.  I was not able to cause a problem running
24 cases of the code.  I tried running the same set of cases
over NFS.  I was not able to cause the problem.

However, when running the MPI model (several cases simutaneously)
I can get differences in the output files.  They should be the same.
Of course it is the 64 processor code that causes the problem!
I could be something else besides the filesystem, but I doubt it
at this point.   Data are only written from the rank zero process
(no parallel IO). 

I still had some other problems.  I had a non-insignificant number
of models crash for no reason.  I cannot blame that on xfs though.
I did have 2 cases where the model got stuck writing data to
the filesystem.

Things got better without the xfs_refcache_purge_some but I still
had differing output and some wedged disk cases.

Load would be about 20 concurrent processes doing large reads
or writes to the filesystem.

Are there any significant differences in the xfs to be in 2.4.25
that I should try (easy to do)?

Should I try playing with network settings?  Change from gigE
to FastEthernet or reduce the number of NFS threads (currently
set at 256)?

Thanks,
Craig




<Prev in Thread] Current Thread [Next in Thread>