Hi Adrian
Adrian Head schrieb:
>
> I am in the process of building a couple of file servers for various purposes
> and over the last week have been running quite a few tests in an attempt to
> determine if I trust the hardware/software combination enough for it to be
> put into production.
>
> One of the tests I was doing was trying to simulate many users copying large
> directories across an XFS volume. To do this I was generating many
> background jobs copying a 4G directory to another directory on the XFS volume.
> eg.
> #>cp -r 001 002&
> #>cp -r 001 003&
> #>cp -r 001 004&
> .....
> .....
> #>cp -r 001 019&
> #>cp -r 001 020&
I did similar tests two months ago. I was having problems as well but
ufurtunately I don't remember what is was exactly.
First question: You created Softraid5, was the raid synced when you
started the tests?
> Everything would start fine but less than a minute into the test various
> hundreds of errors are displayed like:
> cp: cannot stat file `/mnt/raid5/filename`: Input/output error
>
> Once this has happened the XFS volume disappears. By this I mean that it is
> still mounted but all files and directories are no longer visible using ls.
> Any other file activity results in an Input/Output error. Once I unmount &
> mount the volume again the data is again visible up to the point where the
> copy failed.
>
> In the /var/log/messages log around the same time as the copy test I get
> entries like:
> Sep 9 05:13:46 ATLAS kernel: 02:86: rw=0, want=156092516, limit=360
> Sep 9 05:13:46 ATLAS kernel: attempt to access beyond end of device
This looks interesting. I don't know what this means exactly but it
looks to me like you managed to create a filesystem bigger than the raid
volume was? I got the very same error when I tried to restore data with
xfsrestore from DAT (xfsrestore from DLT was fine). The issue is still
open.
>
> The problem is reproduceable on XFS volumes on a 2 disk (IDE) raid0 (SW raid)
> partition and on a 5 disk (IDE) raid5 (SW raid) partition. However, there is
> no problem with the copy test using ext2 volumes on the above partitions.
> The copy test also passes when run on a non-raid drive.
>
> I am using Kernel 2.4.9 and the relevant latest XFS patch from the patches
> directory on the XFS ftp site.
> patch-2.4.9-xfs-2001-08-19
>
> The thing that really puzzles me is that the above directory copy test runs
> fine when I only have 10 background copy jobs running at a time. As soon as
> I have 20 background copy jobs running the problem occurs. The system passes
> both bonnie++ and mongo.pl tests/benchmarks.
>
> So from the results I have at the moment it would seem that XFS is stomping
> over the raid code or the raid code is stomping over XFS. Should I cross
> post this to the raid list as well?
>
> P.S. I have just noticed on the mailing list archive a note about fixing a
> problem that caused mongo.pl to hang. Although my systems don't hang in
> mongo do people think I'm seeing the same problem just a different symptom?
>
> Another issue that I think is not related is that when using the 2.4.9-xfs
> kernel, when the kernel identifies the drives during bootup I get IRQ probe
> failed errors.
> hda: IC35L040AVER07-0, ATA DISK drive
> hda: IRQ probe failed (0xfffffff8)
> hdb: IC35L040AVER07-0, ATA DISK drive
> hdb: IRQ probe failed (0xfffffff8)
> ........the rest as normal
>
> The errors occur when the kernel is run on an ASUS A7V133 motherboard but not
> on a ASUS A7V133C. The errors don't happen with a native 2.4.9 kernel
> either. Since the errors occur for the 2 drives on the 1st channel of the
> 1st IDE controller (which is not related to the raid arrays mentioned above)
> and the system still boots - I have not been worried about it. Should I be
> worried?
>
> At this stage I'm unsure what other info people would like. If anyone wants
> logs, config files, more information or more testing please tell me.
>
> Thanks for your time.
>
> Adrian Head
> Bytecomm P/L
I have a test system here with SoftRAID5 on 4 U160 SCSI disks. I'll try
to kill it today with cp jobs.
-Simon
|