It finally did it again, it took longer then I expected, it also
locked itself up so bad that I couldn't get into it to hit
I had turned on the debugging feature that automatically logs the hung
tasks, and I've attached the log below, I hope it's helpful.
I was running 3.2.4 from kernel.org on a 4 core Xeon machine:
model name : Intel(R) Xeon(R) CPU 5140 @ 2.33GHz
2x Intel 80003ES2LAN Gigabit Ethernet Controllers bonded together
2 LSI SAS controllers:
08:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)
0a:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS
2008 [Falcon] (rev 03)
16 drives in a mix of 2 and 3TB, in 3 raid5 arrays and combined
together with LVM
/dev/mapper/pool-main 23T 12T 11T 52%
for a 23TB volume formatted with XFS.
The root partition is ext4 on an older SATA drive, the reason I bring
this up is that when I hit (on a whim) ctrl+sysrq+J that is supposed
to unfreeze frozen filesystems, the console started dumping lots of
messages about attempting to unfreeze /dev/sda3 [my root partition] so
maybe there's a problem with my sda drive.
But I get no i/o or other errors in my logs at all. I monitor all
drives with smartd to head off any drive failures before they happen
and it seems to think sda is fine.
Hopefully my attached log helps.
I appreciate any input, also please call me an idiot if I'm missing
On Tue, Feb 7, 2012 at 10:54 AM, Jan Kara <jack@xxxxxxx> wrote:
> On Tue 07-02-12 10:35:37, Gerard Saraber wrote:
>> On Mon, Feb 6, 2012 at 4:51 PM, Jan Kara <jack@xxxxxxx> wrote:
>> > On Mon 06-02-12 09:40:45, Gerard Saraber wrote:
>> >> Greetings everyone,
>> >> I've been having a bit of a problem since upgrading to the linux 3.x
>> >> series, I have a machine that we're using as a NAS that runs various
>> >> rsync processes (mostly at night), lately after a day or two, I will
>> >> come in in the morning to a load average of 49, but the machine not
>> >> really doing anything, when trying to run 'dstat' the command just
>> >> hung with no output at all. there were no errors in the logs, or even
>> >> anything that would vaguely point at anything I could work with.
>> >> So needing to get the machine back to work I attempted to reboot it
>> >> "shutdown -r now" on console... it gives a nice message saying it's
>> >> going to reboot, but nothing ever happens.. the only way to reboot it
>> >> is by using ctrl + alt + sysrq + b. after which the machine reboots
>> >> and the raid array comes back clean.
>> >> I'm not sure how to troubleshoot this, any pointers would be appreciated.
>> >> I'm compiling 3.2.4 at the moment and found a bunch of possibly useful
>> >> options in the kernel debugging section:
>> >> detect hard/soft lockups and detect hung tasks, maybe it'll give me
>> >> something more to go on.
>> >> Some details about the machine:
>> >> Linux xenbox 3.2.2 #1 SMP Sun Jan 29 10:28:22 CST 2012 x86_64 Intel(R)
>> >> Xeon(R) CPU 5140 @ 2.33GHz GenuineIntel GNU/Linux
>> >> It has 3 software raid arrays (2 x 5 drives and 1 x 4 drives) LVM'ed
>> >> together into a 23TB XFS filesystem.
>> >> 6GB memory and a pair of Intel Gigabit ethernet controllers bonded
>> >> together.
>> > Hmm, might be some deadlock in the filesystem. Adding XFS guys to CC.
>> > Can you run 'echo w >/proc/sysrq-trigger' and post output of dmesg here?
>> > Honza
>> > --
>> > Jan Kara <jack@xxxxxxx>
>> > SUSE Labs, CR
>> Thanks for the quick reply,
>> the machine is running good at the moment so I'm not sure if the
>> output helps, but here it is:
>> [I'll also be sure to grab this log the next time it locks]
> Yeah. Sorry, I was not clear but I meant you should grab the traces when
> the machine locks up again...
> Jan Kara <jack@xxxxxxx>
> SUSE Labs, CR
Description: Text document