-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I'll give all the details I think are require - if I miss any please contact
me.
The problem can be simulated very easily by using a shell script that creates
many background cp processes that copy directories around an XFS volume. I
first came across this problem back in Sept and there was a tread on the
mailing list. I was never able to resolve the problem so I had to drop it
for when I got some spare time.
http://marc.theaimsgroup.com/?l=linux-xfs&m=100008724000734&w=2
(Was one thread of 2 at the time.)
John - I have explained a little bit about the simulation test here:
http://marc.theaimsgroup.com/?l=linux-xfs&m=100854867117179&w=2
My standard process before putting servers into production is to run a few
tests to make sure that I can trust the hardware/software combination. One
of my standard tests is to simulate many users simultaniously copying many
files across the filesystem. The volume is a software raid5 over 4 IDE
drives.
It is during this test that the machine hangs after getting almost 95%
complete. I have tried running this test using XFS, ext2, ext3, reiserfs and
only XFS fails to complete. This situation is completely reproducable every
time I have run this test to date.
#==============
I do this test with the following simple script which just starts many cp
processes in the background (Directories in this test are ~ 266M each
consisting of standard files found on office fileserves, word, excel,
powerpoint, pdf's, zip's etc... The smallest would be less than 1k the
largest would be in the 10M range and it is probabily nested 10 deep max):
#=========
#!/bin/sh
# Simulate many ppl copying files.
cp -fr 01 2
for (( i=80; i!=2; i-- )) ; do
cp -fr 01 $i &
# echo $i
done
#=========
All I do is copy the directory from another system drive to the s/w raid5
drive in question and then run the script. The script copies the directory
once in the foreground to get it cached and then begins starting background
cp processes copying the directory accross the volume. Before the CVS update
(18Dec) the system would die running less than 80 background processes
whereas now it survives but dies running less than 160 processes. ext3 and
resierfs just keep going until finished.
What happens is that the tests starts and takes a few hours. I usually also
have top running in another console. When the machine hangs - there is
messages in the logs or on screen. The only indication is that there is no
disk activity and that top shows that all the cp processes are in "D" state.
9 times out of 10 kupdated is still shown as running in top. Usually I can
change consoles using the keyboard and can get into kdb. I have some
shorthand traces of kupdated and some random cp processes in previous emails.
If needed I can do more detailed ones.
http://marc.theaimsgroup.com/?l=linux-xfs&m=100889210619543&w=2
John - not sure if this answered your question - if not just give me a hint
of what info you need.
Eric - the system details are:
Athlon 1.2G running on a ASUS A7V133 with 384M 133-SDRAM.
3x Intel 100+Pro NICs, 2 bonded using the kernel module.
2x Promise Ultra100TX2 & 1x Promise Ultra100 onboard.
hda (40G) - various + LVM
hdb (40G) - various (currently my backup drive - not used)
hdc - cdrom
hde (40G) - hde1 (RAID5/DISK0/ACTIVE)
hdg (40G) - hdf1 (RAID5/DISK1/ACTIVE)
hdi (40G) - hdi1 (RAID5/DISK2/ACTIVE)
hdk (40G) - hdk1 (RAID5/DISK3/ACTIVE)
hdm (40G) - hdm1 (RAID0/DISK0/ACTIVE) (currently not used)
hdo (40G) - hdo1 (RAID0/DISK0/ACTIVE) (currently not used)
hdp (40G) - hdp1 (RAID5/DISK4/HOT-SPARE)
All drives are IBM 40G 60GXP's
The system disk is partitioned up like:
=======================================================================
- - hda1 - /boot 20M ext3
- - hda2 - / 512M ext3
- - hda3 - /usr 512M ext3
- - hda4 - {extended}
- - hda5 - /var 512M ext3
- - hda6 - swap 768M
- - hda7 - {LVM volume group HDA} 36941M - Remaining Space (36G)
- /dev/HDA/TMP /tmp 128M reiserfs
- /dev/HDA/CACHE /cache 512M reiserfs
- /dev/HDA/CHROOT /chroot 512M ext3
- /dev/HDA/SRC /usr/src 4096M reiserfs
- /dev/HDA/SRC_SNAP /usr/src/snapshot/admin 1024M
- /dev/HDA/SRC_SNAP_DY /usr/src/snapshot/daily 1024M
(38G Unallocated)
=======================================================================
and the RAID5 is just a standard straight XFS filesystem on top of md0.
/dev/md0 - /mnt/raid5 117801M
=======================================================================
The recent kernels I have tried from CVS are:
2.4.16-xfs 20011218
2.4.16-xfs 20011223
2.4.17-xfs 20011226
The kernel was compiled with -O0, -O2 & -O3 tags using the standard gcc for
both RH7.1 & RH7.2 Currently I'm using (RH7.2):
gcc version 2.96 20000731 (Re Hat Linux 7.1 2.96-89)
I have not tried the old kgcc yet. I just havn't been able to get both gcc &
kgcc coexisting hapily together yet - this is on the todo list though.
I have used both base (minimum install) RH7.1 & RH7.2 distros with the
current cmd_rpms from the SGI ftp server.
The mkfs command I use is:
mkfs -t xfs /dev/md0
The mount details in fstab are:
/dev/md0 /mnt/raid5 xfs defaults 1 2
The mount command I use on the command line is:
mount -t xfs /dev/md0 /mnt/raid5
Whether the volume is mounted at boot or latter doesn't matter - the problem
still exists.
Not sure what other info you guys need? What have I missed out? The box
that I'm currently using is my personal home fileserver so I'm happy to run
any test as required.
Thanks for your help. I hope that both of you have had a Merry Christmas.
On Sat, 22 Dec 2001 00:49, you wrote:
> Adrian-
>
> Can you tell me how to reproduce this problem so I can try so investigating
> here too?
>
- --
Adrian Head
(Public Key available on request.)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org
iD8DBQE8KpvV8ZJI8OvSkAcRAhDcAJ9DfaqHJ6ScR3/rovieCuOoCa0AcACfTskO
FyToBLg0/B7Xu8uRuiXpy+Y=
=Sb8F
-----END PGP SIGNATURE-----
|