[Top] [All Lists]

Re: Soft lockup problem

To: Jan Kara <jack@xxxxxxx>
Subject: Re: Soft lockup problem
From: Gerard Saraber <gsaraber@xxxxxxxxx>
Date: Mon, 27 Feb 2012 08:38:05 -0600
Authentication-results: mr.google.com; spf=pass (google.com: domain of gsaraber@xxxxxxxxx designates as permitted sender) smtp.mail=gsaraber@xxxxxxxxx; dkim=pass header.i=gsaraber@xxxxxxxxx
Cc: linux-kernel@xxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=AlZTLvJpMsgmJiJh+Lzl8Xy5OEYHB9LahY3v15mCYGE=; b=gvgKuCwu1vv3jfgc4qT9ckwTfn4hdhB0sP6Q/D2VK+tEdn4hp9vXm160Naymk8BV0F xCVgSx1lVyXGeVSPm62ZjQy4OoUgTpWt+0gSZqJRxoeclHLlMAGj1ZGziHaYeT6A3ZA7 /ugSe9qYlPDBua/Uy/G0WM0rr7VEp+BcnKH/U=
In-reply-to: <20120207165452.GA1043@xxxxxxxxxxxxx>
References: <CAOgv-ziomS-+ML0Xb4oSoWTjzkG8BBBQcMjXwn7rjQV2pFtQ6Q@xxxxxxxxxxxxxx> <20120206225122.GF24840@xxxxxxxxxxxxx> <CAOgv-zjjwoTnN7=a1kRHL_yVRzc5w3BfUfsY0_X61-rjzJJ2Hg@xxxxxxxxxxxxxx> <20120207165452.GA1043@xxxxxxxxxxxxx>
Hi everyone,
It finally did it again, it took longer then I expected, it also
locked itself up so bad that I couldn't get into it to hit
ctrl+alt+sysrq+w ..
I had turned on the debugging feature that automatically logs the hung
tasks, and I've attached the log below, I hope it's helpful.

I was running 3.2.4 from kernel.org on a 4 core Xeon machine:
model name      : Intel(R) Xeon(R) CPU            5140  @ 2.33GHz

6GB Ram

2x Intel 80003ES2LAN Gigabit Ethernet Controllers bonded together
2 LSI SAS controllers:
08:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)
0a:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS
2008 [Falcon] (rev 03)

16  drives in a mix of 2 and 3TB, in 3 raid5 arrays and combined
together with LVM
/dev/mapper/pool-main   23T   12T   11T  52%
for a 23TB volume formatted with XFS.

The root partition is ext4 on an older SATA drive, the reason I bring
this up is that when I hit (on a whim) ctrl+sysrq+J that is supposed
to unfreeze frozen filesystems, the console started dumping lots of
messages about attempting to unfreeze /dev/sda3 [my root partition] so
maybe there's a problem with my sda drive.
But I get no i/o or other errors in my logs at all. I monitor all
drives with smartd to head off any drive failures before they happen
and it seems to think sda is fine.

Hopefully my attached log helps.
I appreciate any input, also please call me an idiot if I'm missing
something obvious.

-Gerard Saraber

On Tue, Feb 7, 2012 at 10:54 AM, Jan Kara <jack@xxxxxxx> wrote:
> On Tue 07-02-12 10:35:37, Gerard Saraber wrote:
>> On Mon, Feb 6, 2012 at 4:51 PM, Jan Kara <jack@xxxxxxx> wrote:
>> > On Mon 06-02-12 09:40:45, Gerard Saraber wrote:
>> >> Greetings everyone,
>> >> I've been having a bit of a problem since upgrading to the linux 3.x
>> >> series, I have a machine that we're using as a NAS that runs various
>> >> rsync processes (mostly at night), lately after a day or two, I will
>> >> come in in the morning to a load average of 49, but the machine not
>> >> really doing anything, when trying to run 'dstat' the command just
>> >> hung with no output at all. there were no errors in the logs, or even
>> >> anything that would vaguely point at anything I could work with.
>> >> So needing to get the machine back to work I attempted to reboot it
>> >> "shutdown -r now" on console... it gives a nice message saying it's
>> >> going to reboot, but nothing ever happens.. the only way to reboot it
>> >> is by using ctrl + alt + sysrq + b. after which the machine reboots
>> >> and the raid array comes back clean.
>> >>
>> >> I'm not sure how to troubleshoot this, any pointers would be appreciated.
>> >>
>> >> I'm compiling 3.2.4 at the moment and found a bunch of possibly useful
>> >> options in the kernel debugging section:
>> >> detect hard/soft lockups and detect hung tasks, maybe it'll give me
>> >> something more to go on.
>> >>
>> >> Some details about the machine:
>> >> Linux xenbox 3.2.2 #1 SMP Sun Jan 29 10:28:22 CST 2012 x86_64 Intel(R)
>> >> Xeon(R) CPU 5140 @ 2.33GHz GenuineIntel GNU/Linux
>> >> It has 3 software raid arrays (2 x 5 drives and 1 x 4 drives) LVM'ed
>> >> together into a 23TB XFS filesystem.
>> >> 6GB memory and a pair of Intel Gigabit ethernet controllers bonded 
>> >> together.
>> >  Hmm, might be some deadlock in the filesystem. Adding XFS guys to CC.
>> > Can you run 'echo w >/proc/sysrq-trigger' and post output of dmesg here?
>> >
>> >                                                                Honza
>> > --
>> > Jan Kara <jack@xxxxxxx>
>> > SUSE Labs, CR
>> Thanks for the quick reply,
>> the machine is running good at the moment so I'm not sure if the
>> output helps, but here it is:
>> [I'll also be sure to grab this log the next time it locks]
>  Yeah. Sorry, I was not clear but I meant you should grab the traces when
> the machine locks up again...
>                                                                Honza
> --
> Jan Kara <jack@xxxxxxx>
> SUSE Labs, CR

Attachment: hlog.txt
Description: Text document

<Prev in Thread] Current Thread [Next in Thread>