xfs
[Top] [All Lists]

Re: Problems with many processes copying large directories across an XFS

To: linux-xfs@xxxxxxxxxxx
Subject: Re: Problems with many processes copying large directories across an XFS volume.
From: Adrian Head <ahead@xxxxxxxxxxxxxx>
Date: Mon, 10 Sep 2001 12:00:30 +1000
Reply-to: adrian.head@xxxxxxxxxxxxxxx
Sender: owner-linux-xfs@xxxxxxxxxxx
Over the last few days I have had the opportunity to try a few more tests to 
try and find more information to describe the problem I am seeing when I run 
multiple copy jobs in the background across an XFS volume.

I downloaded the 2.4.9-xfs-2001-08-26 kernel patch from the XFS ftp server 
and gave it a run with the multiple cp test.  After running through the same 
procedure as in my previous post - a few minutes into the test I still got 
the Input/Output error messages printed on the console, but this time I also 
had the following messages printed in /var/log/messages:

Sep 10 10:14:57 ATLAS kernel: I/O error in filesystem ("md(9,0)") meta-data
 dev 0x900 block 0x9802bdc Sep 10 10:14:57 ATLAS kernel: (xlog_iodone") error
 5 buf count 32768 
Sep 10 10:14:57 ATLAS kernel:  xfs_force_shutdown(md(9,0),0x2) called from 
line 940 of file xfs_log.c.  Return address - 0xd8cb66f8 
Sep 10 10:14:57 ATLAS kernel: Log I/O Error  Detected. Shutting down 
filesystem: md(9,0) 
Sep 10 10:14:57 ATLAS kernel:  Please umount the filesystem, and rectify the 
problem(s) 
Sep 10 10:14:57 ATLAS kernel: xfs_force_shutdown(md(9,0),0x2) called from 
line 714 of file  xfs_log.c. Return address = 0xd8cb65d3 
Sep 10 10:14:57 ATLAS kernel: attempt  to access beyond end of device 
Sep 10 10:14:57 ATLAS kernel: 02:82: rw=0,  want=1602235696, limit=4

I'm not sure what these really mean or what caused the I/O error but I hope 
this sheds some more light on the problem. (I had to transfer this by hand so 
if there are any uncertainties I'll go to the effort of getting the log and 
posting it.)

The other attempt I tried was to download the 2.4.10-pre2-xfs-2001-09-02 
kernel patch and run the same multiple cp test.

This time things were different:
This time it did not die a few minutes into the test like the previous 
attempts.  Because it had not died I added more background processes - 40 
total instead of the 20 in the previous tests.  About halfway through the 
test for some unknown reason one cp process segfaulted because of a null 
kernel pointer.  Why this happened to only one cp process I'm not sure - it 
certainly did not affect the other processes and the test continued.  The 
test ran fine for almost 3 hours until I noticed that all HDD activity had 
stopped.  Thinking that the test had completed I checked the jobs in the 
shell and it was reported that all jobs but the one that had segfaulted were 
still running.  I checked `top` but it showed that all cp processes were 
running.    Thinking it might have been a hardware issue I `ls *` a couple of 
drives including the raid5 (SW) where I was running the test.  Although it 
was slow everything worked as normal until I did a `du -sh` on the volume 
that I had run the test on - with this the console froze.  I switched to 
another console and did a `df -h` which gave me a result.  The multiple cp 
test had stopped with only 20G left to fill on the 154G raid5 volume (about 
83% complete).  I was able to ping the machine from another box on the 
network but I was unable to log in remotely through telnet.  I left the 
machine for another few hours but the status did not change.  In the running 
console I shutdown the machine.  The machine started to go down but hung not 
long after.  Nothing was written to the logs.
Note: The consoles I am talking about are the virtual kind Alt+F1, Alt+F2 
etc.  I'm not running anything graphical just plain text consoles.

Some details from 'top'  (just in case it helps)
up 2:45 4 users load average 40.15 40.41 40.70
66 processes 27 sleeping 39 running
CPU states 0% user 36.6% sys 0% nice 63.3% idle.
Mem 384520k av 381956k used 2564k free
0k shrd 216k buff 353916k cached
Swap 524624 av 14464k used 510160k free

The 2nd test has thrown me for a complete spin.  Am I seeing the same problem 
as the original test or is this something else?  Considering that I was 
pushing the machine way past what I would expect to see in the production 
environment do I need to worry?

The IRQ probe failure I talked about in the original post still exists on 
both of the kernels built for these test attempts.  Do I need to worry about 
these IRQ probe failures?

At this stage I'm unsure what other info people would like.  If anyone wants 
logs, config files, more information or more testing please tell me.

Thanks for your time.

Adrian Head
Bytecomm P/L


<Prev in Thread] Current Thread [Next in Thread>