xfs
[Top] [All Lists]

Re: XFS dying when many processes copy many files/directories

To: jtrostel@xxxxxxxxxxxxxx, Eric Sandeen <sandeen@xxxxxxx>
Subject: Re: XFS dying when many processes copy many files/directories
From: Adrian Head <ahead@xxxxxxxxxxxxxx>
Date: Thu, 27 Dec 2001 13:56:01 +1000
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <XFMail.20011221094934.jtrostel@xxxxxxxxxxxxxx>
References: <XFMail.20011221094934.jtrostel@xxxxxxxxxxxxxx>
Sender: owner-linux-xfs@xxxxxxxxxxx
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'll give all the details I think are require - if I miss any please contact 
me.

The problem can be simulated very easily by using a shell script that creates 
many background cp processes that copy directories around an XFS volume.  I 
first came across this problem back in Sept and there was a tread on the 
mailing list.  I was never able to resolve the problem so I had to drop it 
for when I got some spare time.
http://marc.theaimsgroup.com/?l=linux-xfs&m=100008724000734&w=2
(Was one thread of 2 at the time.)

John - I have explained a little bit about the simulation test here:
http://marc.theaimsgroup.com/?l=linux-xfs&m=100854867117179&w=2

My standard process before putting servers into production is to run a few 
tests to make sure that I can trust the hardware/software combination.  One 
of my standard tests is to simulate many users simultaniously copying many 
files across the filesystem.  The volume is a software raid5 over 4 IDE 
drives.

It is during this test that the machine hangs after getting almost 95% 
complete.  I have tried running this test using XFS, ext2, ext3, reiserfs and 
only XFS fails to complete.  This situation is completely reproducable every 
time I have run this test to date.

#==============

I do this test with the following simple script which just starts many cp 
processes in the background (Directories in this test are ~ 266M each 
consisting of standard files found on office fileserves, word, excel, 
powerpoint, pdf's, zip's etc...  The smallest would be less than 1k the 
largest would be in the 10M range and it is probabily nested 10 deep max):

#=========
#!/bin/sh
# Simulate many ppl copying files.

cp -fr 01 2

for (( i=80; i!=2; i-- )) ; do
  cp -fr 01 $i &
#  echo $i
done
#=========

All I do is copy the directory from another system drive to the s/w raid5 
drive in question and then run the script.  The script copies the directory 
once in the foreground to get it cached and then begins starting background 
cp processes copying the directory accross the volume.  Before the CVS update 
(18Dec) the system would die running less than 80 background processes 
whereas now it survives but dies running less than 160 processes.  ext3 and 
resierfs just keep going until finished.

What happens is that the tests starts and takes a few hours.  I usually also 
have top running in another console.  When the machine hangs - there is 
messages in the logs or on screen.  The only indication is that there is no 
disk activity and that top shows that all the cp processes are in "D" state.  
9 times out of 10 kupdated is still shown as running in top.  Usually I can 
change consoles using the keyboard and can get into kdb.  I have some 
shorthand traces of kupdated and some random cp processes in previous emails. 
If needed I can do more detailed ones.
http://marc.theaimsgroup.com/?l=linux-xfs&m=100889210619543&w=2

John - not sure if this answered your question - if not just give me a hint 
of what info you need.

Eric - the system details are:

Athlon 1.2G running on a ASUS A7V133 with 384M 133-SDRAM.
3x Intel 100+Pro NICs, 2 bonded using the kernel module.
2x Promise Ultra100TX2 & 1x Promise Ultra100 onboard.
hda (40G) - various + LVM
hdb (40G) - various (currently my backup drive - not used)
hdc - cdrom
hde (40G) - hde1 (RAID5/DISK0/ACTIVE)
hdg (40G) - hdf1 (RAID5/DISK1/ACTIVE)
hdi (40G) - hdi1 (RAID5/DISK2/ACTIVE)
hdk (40G) - hdk1 (RAID5/DISK3/ACTIVE)
hdm (40G) - hdm1 (RAID0/DISK0/ACTIVE) (currently not used)
hdo (40G) - hdo1 (RAID0/DISK0/ACTIVE)  (currently not used)
hdp (40G) - hdp1 (RAID5/DISK4/HOT-SPARE)
All drives are IBM 40G 60GXP's

The system disk is partitioned up like:
=======================================================================
- - hda1 -               /boot                      20M ext3
- - hda2 -               /                         512M ext3
- - hda3 -               /usr                      512M ext3
- - hda4 -               {extended}
- - hda5 -               /var                      512M ext3
- - hda6 -               swap                      768M
- - hda7 -      {LVM volume group HDA}  36941M - Remaining Space (36G)
 - /dev/HDA/TMP /tmp               128M reiserfs
 - /dev/HDA/CACHE       /cache             512M reiserfs
 - /dev/HDA/CHROOT      /chroot            512M ext3
 - /dev/HDA/SRC /usr/src                  4096M reiserfs
 - /dev/HDA/SRC_SNAP     /usr/src/snapshot/admin  1024M
 - /dev/HDA/SRC_SNAP_DY /usr/src/snapshot/daily  1024M
                                                (38G Unallocated)
=======================================================================
and the RAID5 is just a standard straight XFS filesystem on top of md0.
/dev/md0 -      /mnt/raid5              117801M
=======================================================================

The recent kernels I have tried from CVS are:
2.4.16-xfs      20011218
2.4.16-xfs      20011223
2.4.17-xfs      20011226

The kernel was compiled with -O0, -O2 & -O3 tags using the standard gcc for 
both RH7.1 & RH7.2  Currently I'm using (RH7.2):
gcc version 2.96 20000731 (Re Hat Linux 7.1 2.96-89)

I have not tried the old kgcc yet.  I just havn't been able to get both gcc & 
kgcc coexisting hapily together yet - this is on the todo list though.

I have used both base (minimum install) RH7.1 & RH7.2 distros with the 
current cmd_rpms from the SGI ftp server.

The mkfs command I use is:
mkfs -t xfs /dev/md0

The mount details in fstab are:
/dev/md0        /mnt/raid5      xfs     defaults        1 2

The mount command I use on the command line is:
mount -t xfs /dev/md0 /mnt/raid5

Whether the volume is mounted at boot or latter doesn't matter - the problem 
still exists.

Not sure what other info you guys need?  What have I missed out?  The box 
that I'm currently using is my personal home fileserver so I'm happy to run 
any test as required.

Thanks for your help.  I hope that both of you have had a Merry Christmas.

On Sat, 22 Dec 2001 00:49, you wrote:
> Adrian-
>
> Can you tell me how to reproduce this problem so I can try so investigating
> here too?
>

- -- 
Adrian Head

(Public Key available on request.)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE8KpvV8ZJI8OvSkAcRAhDcAJ9DfaqHJ6ScR3/rovieCuOoCa0AcACfTskO
FyToBLg0/B7Xu8uRuiXpy+Y=
=Sb8F
-----END PGP SIGNATURE-----


<Prev in Thread] Current Thread [Next in Thread>