Still seeing hangs in xlog_grant_log_space

Mark Tinguely tinguely at sgi.com
Wed May 16 16:29:01 CDT 2012


On 05/16/12 14:03, Chris J Arges wrote:
> On 05/16/2012 01:42 PM, Ben Myers wrote:
>> Hey Chris,
>>
>> On Thu, May 10, 2012 at 04:11:27PM +0000, Chris J Arges wrote:
>>> <snip>
>>>> Canonical attached them to the bug report that they filed yesterday:
>>>> http://oss.sgi.com/bugzilla/show_bug.cgi?id=922
>>>>
>>>> ...Juerg
>>>>
>>>
>>> Hello,
>>> I am able to reproduce this bug with the instructions posted in this bug. Let me
>>> know what I can do to help.
>>
>> The bug shows:
>>
>> |This has been tested on the following kernels which all exhibit the same
>> |failures:
>> |- 3.2.0-24 (Ubuntu Precise)
>> |- 3.4.0-rc4
>> |- 3.0.29
>> |- 3.1.10
>> |- 3.2.15
>> |- 3.3.2
>>
>> Can you find an older kernel that isn't broken?
>>
>
> Sure, I can start digging further back.
> Also 2.6.38-8-server was the original version that this bug was reported
> on. So I can try testing circa 2.6.32 to see if that also fails.
> --chris
>
>> -Ben
>>
>

What I know so far:
I have log cleaner kicker added to xlog_grant_head_wake(). This kicker 
at best would prevent waiting for the next sync before starting the log 
cleaner.

I have one machine that has been running for 2 days without hanging. 
Actually, now I would prefer it to hurry up and hang.

Here is what see on the machine that is hung:

A few processes (4-5) are hung waiting to get space on the log. There 
isn't enough free space on the log for the first transaction and it 
waits. All other processes will have to wait behind the first process. 
251,811 bytes of the original 589,842 bytes should still be free (if my 
hand free space calculations are correct).

The AIL is empty. There is nothing to clean. Any new transaction at this 
point will kick the cleaner, and it still can't start the first waiter, 
so it joins the wait list.

The only XFS traffic at this point is inode reclaim worker. This is to 
be expected.

The CIL has entries, nothing is waiting on the CIL. xc_current_sequence 
= 117511 xc_push_seq = 117510. So there is nothing for the CIL worker to do.

117511 is the largest sequence number that I have found so far in the 
xfs_log_item list. There are a few entries with smaller sequence numbers 
and the following strange entry:

77th entry in the xfs_log_item list:

crash> struct xfs_log_item ffff88083222b5b8
struct xfs_log_item {
   li_ail = {
     next = 0xffff88083222b5b0,
     prev = 0x0
   },
   li_lsn = 0,
   li_desc = 0x9f5d9f5d,
   li_mountp = 0xffff88083116e300,
   li_ailp = 0x0,
   li_type = 0,
   li_flags = 0,
   li_bio_list = 0x0,
   li_cb = 0,
   li_ops = 0xffff88083105de00,
   li_cil = {
     next = 0xffff880832ad9f08,
     prev = 0xffff880831751448
   },
   li_lv = 0xc788c788,
   li_seq = -131906182637504
}

Everything in this entry is bad except the li_cil.next and li_cil.prev. 
It looks like li_ail.next is really part of a list that starts at 
0xffff88083222b5b0. The best explanation is a junk addresses was 
inserted into the li_cil chain.

This is a single data point which could be anything including bad 
hardware. I will continue to traverse this list until I can get the 
other box to hang. If someone want to traverse their xfs_log_item list ...

--Mark Tinguely.



More information about the xfs mailing list