Over the last few days I have had the opportunity to try a few more tests to
try and find more information to describe the problem I am seeing when I run
multiple copy jobs in the background across an XFS volume.
I downloaded the 2.4.9-xfs-2001-08-26 kernel patch from the XFS ftp server
and gave it a run with the multiple cp test. After running through the same
procedure as in my previous post - a few minutes into the test I still got
the Input/Output error messages printed on the console, but this time I also
had the following messages printed in /var/log/messages:
Sep 10 10:14:57 ATLAS kernel: I/O error in filesystem ("md(9,0)") meta-data
dev 0x900 block 0x9802bdc Sep 10 10:14:57 ATLAS kernel: (xlog_iodone") error
5 buf count 32768
Sep 10 10:14:57 ATLAS kernel: xfs_force_shutdown(md(9,0),0x2) called from
line 940 of file xfs_log.c. Return address - 0xd8cb66f8
Sep 10 10:14:57 ATLAS kernel: Log I/O Error Detected. Shutting down
filesystem: md(9,0)
Sep 10 10:14:57 ATLAS kernel: Please umount the filesystem, and rectify the
problem(s)
Sep 10 10:14:57 ATLAS kernel: xfs_force_shutdown(md(9,0),0x2) called from
line 714 of file xfs_log.c. Return address = 0xd8cb65d3
Sep 10 10:14:57 ATLAS kernel: attempt to access beyond end of device
Sep 10 10:14:57 ATLAS kernel: 02:82: rw=0, want=1602235696, limit=4
I'm not sure what these really mean or what caused the I/O error but I hope
this sheds some more light on the problem. (I had to transfer this by hand so
if there are any uncertainties I'll go to the effort of getting the log and
posting it.)
The other attempt I tried was to download the 2.4.10-pre2-xfs-2001-09-02
kernel patch and run the same multiple cp test.
This time things were different:
This time it did not die a few minutes into the test like the previous
attempts. Because it had not died I added more background processes - 40
total instead of the 20 in the previous tests. About halfway through the
test for some unknown reason one cp process segfaulted because of a null
kernel pointer. Why this happened to only one cp process I'm not sure - it
certainly did not affect the other processes and the test continued. The
test ran fine for almost 3 hours until I noticed that all HDD activity had
stopped. Thinking that the test had completed I checked the jobs in the
shell and it was reported that all jobs but the one that had segfaulted were
still running. I checked `top` but it showed that all cp processes were
running. Thinking it might have been a hardware issue I `ls *` a couple of
drives including the raid5 (SW) where I was running the test. Although it
was slow everything worked as normal until I did a `du -sh` on the volume
that I had run the test on - with this the console froze. I switched to
another console and did a `df -h` which gave me a result. The multiple cp
test had stopped with only 20G left to fill on the 154G raid5 volume (about
83% complete). I was able to ping the machine from another box on the
network but I was unable to log in remotely through telnet. I left the
machine for another few hours but the status did not change. In the running
console I shutdown the machine. The machine started to go down but hung not
long after. Nothing was written to the logs.
Note: The consoles I am talking about are the virtual kind Alt+F1, Alt+F2
etc. I'm not running anything graphical just plain text consoles.
Some details from 'top' (just in case it helps)
up 2:45 4 users load average 40.15 40.41 40.70
66 processes 27 sleeping 39 running
CPU states 0% user 36.6% sys 0% nice 63.3% idle.
Mem 384520k av 381956k used 2564k free
0k shrd 216k buff 353916k cached
Swap 524624 av 14464k used 510160k free
The 2nd test has thrown me for a complete spin. Am I seeing the same problem
as the original test or is this something else? Considering that I was
pushing the machine way past what I would expect to see in the production
environment do I need to worry?
The IRQ probe failure I talked about in the original post still exists on
both of the kernels built for these test attempts. Do I need to worry about
these IRQ probe failures?
At this stage I'm unsure what other info people would like. If anyone wants
logs, config files, more information or more testing please tell me.
Thanks for your time.
Adrian Head
Bytecomm P/L
|