On Thu, 2003-09-25 at 16:14, Bernd Strieder wrote:
> Hello
>
> The filesystem in question is on a 36GB SCSI drive, the only
> partition on this drive. The fs contains 22 GB of data, quota is
> used with few exception. The filesystem is exported to about 15
> Linux clients, 1 Debian, the others SuSE8.2 and 8.1 , and 1
> OpenBSD client. The server is P3 2-way SMP with 4GB of RAM.
>
> Occasionally xfsdump is hanging in uninterruptible sleep.
> Occasionally means, that it happens sometimes, but I have not
> been able to trigger the problem.
>
> Usually xfsdump is run at night writing about 20 GB via rmt to a
> Sun box with a DLT streamer attached to it. If the problem
> happens, it must be at the beginning of the dump, from the
> backup logs I have.
>
> I have not found a way to kill xfsdump in this state, the machine
> has to be booted, or the other night the next xfsdump started
> will get into the same state. There are no diagnostics somewhere
> in the system, syslog, console, dmesg. ps says xfsdump is in
> lock_p.
>
> The problem happens with the SuSE-kernels delivered with SuSE-8.1
> and 8.2, and with all patched Linus with XFS patched, and with
> -ac kernels. All kernels I have tried show the problem. Before
> the update to SuSE 8.2 it took once 3 months between two cases
> of hanging xfsdump using a vanilla 2.4.21 kernel with xfs 1.2
> patched.
>
> I have tried xfsdump to /dev/null and putting the system under
> load (more disk load, more network load), but I could not
> trigger the problem. The tape drive swallows about 5MB/sec, by
> dumping to /dev/zero the rate is about 15MB/sec, which should be
> more stress to the system.
>
> Twice, the hanging xfsdump was not noticed for some days and the
> system got instable, kernel NFSd hanging, which made a reboot
> mandatory.
>
> Any ideas?
>
> Bernd Strieder
Using sysrq to get a stack trace of the xfsdump thread in the kernel
will give some pointers to where it is hanging. Since it happens
when you use the real tape rather than the dummy one, there is a fair
chance that the tape end of the dump is where the problem lies.
You need a kernel with sysrq enabled, and you need to turn it on,
then there is an option to dump stacks of all kernel threads.
There is a documentation on all of this in the kernel source
directory under Documentation/sysrq.txt.
Running strace on the binary is also an option, but that requires
being able to reproduce the problem.
Steve
--
Steve Lord <lord@xxxxxxx>
|