I am in the process of building a couple of file servers for various purposes
and over the last week have been running quite a few tests in an attempt to
determine if I trust the hardware/software combination enough for it to be
put into production.
One of the tests I was doing was trying to simulate many users copying large
directories across an XFS volume. To do this I was generating many
background jobs copying a 4G directory to another directory on the XFS volume.
eg.
#>cp -r 001 002&
#>cp -r 001 003&
#>cp -r 001 004&
.....
.....
#>cp -r 001 019&
#>cp -r 001 020&
Everything would start fine but less than a minute into the test various
hundreds of errors are displayed like:
cp: cannot stat file `/mnt/raid5/filename`: Input/output error
Once this has happened the XFS volume disappears. By this I mean that it is
still mounted but all files and directories are no longer visible using ls.
Any other file activity results in an Input/Output error. Once I unmount &
mount the volume again the data is again visible up to the point where the
copy failed.
In the /var/log/messages log around the same time as the copy test I get
entries like:
Sep 9 05:13:46 ATLAS kernel: 02:86: rw=0, want=156092516, limit=360
Sep 9 05:13:46 ATLAS kernel: attempt to access beyond end of device
The problem is reproduceable on XFS volumes on a 2 disk (IDE) raid0 (SW raid)
partition and on a 5 disk (IDE) raid5 (SW raid) partition. However, there is
no problem with the copy test using ext2 volumes on the above partitions.
The copy test also passes when run on a non-raid drive.
I am using Kernel 2.4.9 and the relevant latest XFS patch from the patches
directory on the XFS ftp site.
patch-2.4.9-xfs-2001-08-19
The thing that really puzzles me is that the above directory copy test runs
fine when I only have 10 background copy jobs running at a time. As soon as
I have 20 background copy jobs running the problem occurs. The system passes
both bonnie++ and mongo.pl tests/benchmarks.
So from the results I have at the moment it would seem that XFS is stomping
over the raid code or the raid code is stomping over XFS. Should I cross
post this to the raid list as well?
P.S. I have just noticed on the mailing list archive a note about fixing a
problem that caused mongo.pl to hang. Although my systems don't hang in
mongo do people think I'm seeing the same problem just a different symptom?
Another issue that I think is not related is that when using the 2.4.9-xfs
kernel, when the kernel identifies the drives during bootup I get IRQ probe
failed errors.
hda: IC35L040AVER07-0, ATA DISK drive
hda: IRQ probe failed (0xfffffff8)
hdb: IC35L040AVER07-0, ATA DISK drive
hdb: IRQ probe failed (0xfffffff8)
........the rest as normal
The errors occur when the kernel is run on an ASUS A7V133 motherboard but not
on a ASUS A7V133C. The errors don't happen with a native 2.4.9 kernel
either. Since the errors occur for the 2 drives on the 1st channel of the
1st IDE controller (which is not related to the raid arrays mentioned above)
and the system still boots - I have not been worried about it. Should I be
worried?
At this stage I'm unsure what other info people would like. If anyone wants
logs, config files, more information or more testing please tell me.
Thanks for your time.
Adrian Head
Bytecomm P/L
|