We were not able to complete all tests we wanted today, but some critical
information on how to reproduce the problem was uncovered, so here is some
further clarifications.
1. The bad blocks were never written, as opposed to being zeroed; we
have been
using new data patterns on each pass and can see that the bad blocks are
from the
previous pass when this corruption happens.
2. The test files are created by copying a reference file with a "cp"
command
from a shell script. This probably eliminates DIRECT_IO issues from
consideration.
3. A key requirement we learned today is that multiple writers must be used
to create the problem.
If only a single write process is used, no corruption is found when we
check the files.
When two or more (we typically use 5) are creating files in the "md0"
partition at the
same time, then the problem appears.
In practice, we have a script that creates directories then
copies/creates a fixed number of files
into each. It is invoked with a directory name, and the name of a
reference file; 5 copies of the
script are executed simultaneously with different target directory names.
If only a single copy of this writer script is executed - we do not see
corruption.
4. We have confirmed that while the first chunk is never corrupted, all
other chunks
may show corruption.
For a 516K file and a 128K chunk size, the first 128K is never affected
and corruption
may be seen anywhere else in the file (chunk 2, 3, or 4). (We are
using 512+4 K to make
sure that different file alignments relative to file system chunk
boundaries are tested.)
5. We have switched to running all tests on a fairly stock 2.6.11
kernel. We do add
some of the fedora core patches that affect the RPM build environment
and device
driver updates. Any FC patch that affected core systems (execshield, 4K
stacks,
net/disk dump, etc) is not applied - we have learned to be very
conservative about this,
having had things like JAVA broken by their changes.
Tomorrow (I hope) we will address more chunk sizes and minimum partition
sizes.
Also, we will check behavior of the JFS/RAID0 file system combination.
As always, any ideas/questions are welcome.
Jim Foris
|