xfs
[Top] [All Lists]

Re: Data corruption with xfs+nfs+lvm

To: Nathan Scott <nathans@xxxxxxx>
Subject: Re: Data corruption with xfs+nfs+lvm
From: Craig Tierney <ctierney@xxxxxxxx>
Date: 30 Jan 2004 10:10:31 -0700
Cc: cattelan@xxxxxxx, linux-xfs@xxxxxxxxxxx
In-reply-to: <20040130024343.GC1062@frodo>
Organization:
References: <1075423747.3859.280.camel@hpti7.fsl.noaa.gov> <20040130024343.GC1062@frodo>
Sender: linux-xfs-bounce@xxxxxxxxxxx
On Thu, 2004-01-29 at 19:43, Nathan Scott wrote:
> On Thu, Jan 29, 2004 at 05:49:07PM -0700, Craig Tierney wrote:
> > I have just discovered that I am having problems with data corruption
> > on my NFS servers and XFS.  It happens in several different cases, but
> > all under load.  Here are the cases that I have gotten data corruption
> > for reads and writes.  Corruption happens on different servers and
> > on different filesystems (some configured with LVM striping, some not).
> 
> Can you descibe your test case in more detail?  In particular,
> do you have a program/programs that demonstrates the problem?
> That is always a huge help.  Or a list of things to run - what
> sort of IO is being done, and what does "under load" mean in
> your context.

I will gather more information and get back.  The job is serveral
different codes that massage the data.  One piece regrids the data
for the model, then the model reads some data and writes more out,
and the last post-processes the data into new grids.  We have seen
failures at every step.    Not that every step fails, but we can
trigger issues at every step.  We have some cases were we think
we have read corruption because re-running the same case works
the 2nd (or 3rd) time.

> 
> > We tested the new linux-2.4.21 kernel on the dual P3.  
> 
> "new" and "2.4.21" don't really go together. :)

True.  The last patch I found for xfs was 1.3.1 and applied
to 2.4.21, so that is why I used it.  If 2.4.24 (or .25) is released
with full xfs support I will try that.  If there is a pre-release
I should try then I will do that as well.

> 
> > The file writes are from single processes.  Some codes are MPI, but
> > all the IO, reads and writes, go through the rank 0 node.  We can
> > reproduce the corruption relatively easy when 16 processes are active.
> 
> Can you give me a recipe so that I can reproduce it locally?
> Does NFS have to be in the picture for this to fail?  And is
> it reproducible without LVM too?

I hadn't thought of testing without NFS due to the setup.  However,
I do know of a way to test it.  I will get one portion going directly
on the filesystem.


> 
> Russell, does this sound like that NFS corruption that you
> were looking into awhile back?
> 
> cheers.

Thanks,
Craig


<Prev in Thread] Current Thread [Next in Thread>