I have just discovered that I am having problems with data corruption
on my NFS servers and XFS. It happens in several different cases, but
all under load. Here are the cases that I have gotten data corruption
for reads and writes. Corruption happens on different servers and
on different filesystems (some configured with LVM striping, some not).
linux-2.4.20, xfs 1.2, lvm 1.0.5, qlogic 6.01.0,
- Trond's nfs patches available at the time applied to kernel
linux-2.4.21, xfs 1.3.1, lvm 1.0.8, qlogic 6.06.10
- stock NFS (no patches applied)
NFS is configured as vers=3,udp,async,rsize=wsize=8192,nolock,hard
We tested a dual P3 and a dual Xeon with the linux-2.4.20 kernel setup.
We were able to generate corruption on both servers.
We tested the new linux-2.4.21 kernel on the dual P3.
All servers are using the tg3 driver (broadcom gigE). All clients
are using the e100 driver (Intel's eepro driver).
The file writes are from single processes. Some codes are MPI, but
all the IO, reads and writes, go through the rank 0 node. We can
reproduce the corruption relatively easy when 16 processes are active.
We do not see the corruption on nfs+xfs when running just 1 case at a
time.
Running linux-2.4.20, lvm 1.0.5, and ext2/3 does not show the problem,
but it is just a bunch slower. We ran the suite of test cases several
times and could not trigger a problem.
The structure of the corruption can vary. One case I saw difference
in the binary files for a span of 49k out of 102MB. Another case
had differences spread over a much larger region. The differences
in data are different, but non-zero values.
If anyone has suggestions on what to try or test to determine what
is wrong, I would love to hear it.
Thanks,
Craig
|