[Top] [All Lists]

Re: Still seeing hangs in xlog_grant_log_space

To: Chris J Arges <chris.j.arges@xxxxxxxxxxxxx>
Subject: Re: Still seeing hangs in xlog_grant_log_space
From: Mark Tinguely <tinguely@xxxxxxx>
Date: Wed, 16 May 2012 16:29:01 -0500
Cc: Ben Myers <bpm@xxxxxxx>, linux-xfs@xxxxxxxxxxx
In-reply-to: <4FB3FA1D.6050102@xxxxxxxxxxxxx>
References: <20120426224412.GA9541@dastard> <CADLDEKs6oMDA-6OhmcFxyRoBVpduKtSput=53TQGn9NCAOXC1Q@xxxxxxxxxxxxxx> <20120426230738.GB9541@dastard> <CADLDEKuKLeYiqhQW0E9g_bS0VXoxPGPOck3N004Pxg4_Opbzow@xxxxxxxxxxxxxx> <20120427110922.GF9541@dastard> <CADLDEKtUHAGcOPT1jtcvyJVk+zsoL5_thYFtHJYs+w=6EGuVSA@xxxxxxxxxxxxxx> <CADLDEKs4YbNzj2c0HKHwSdUfKy0efdQRe1rOsWDkWUgd+BOGHw@xxxxxxxxxxxxxx> <20120507171908.GA16881@xxxxxxx> <CADLDEKvgT_FcGhJKoPaQv0mh_Jqdaqu8SYatc9xxU7vOY217YQ@xxxxxxxxxxxxxx> <loom.20120510T180646-433@xxxxxxxxxxxxxx> <20120516184231.GK16099@xxxxxxx> <4FB3FA1D.6050102@xxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:9.0) Gecko/20120122 Thunderbird/9.0
On 05/16/12 14:03, Chris J Arges wrote:
On 05/16/2012 01:42 PM, Ben Myers wrote:
Hey Chris,

On Thu, May 10, 2012 at 04:11:27PM +0000, Chris J Arges wrote:
Canonical attached them to the bug report that they filed yesterday:


I am able to reproduce this bug with the instructions posted in this bug. Let me
know what I can do to help.

The bug shows:

|This has been tested on the following kernels which all exhibit the same
|- 3.2.0-24 (Ubuntu Precise)
|- 3.4.0-rc4
|- 3.0.29
|- 3.1.10
|- 3.2.15
|- 3.3.2

Can you find an older kernel that isn't broken?

Sure, I can start digging further back.
Also 2.6.38-8-server was the original version that this bug was reported
on. So I can try testing circa 2.6.32 to see if that also fails.


What I know so far:
I have log cleaner kicker added to xlog_grant_head_wake(). This kicker at best would prevent waiting for the next sync before starting the log cleaner.

I have one machine that has been running for 2 days without hanging. Actually, now I would prefer it to hurry up and hang.

Here is what see on the machine that is hung:

A few processes (4-5) are hung waiting to get space on the log. There isn't enough free space on the log for the first transaction and it waits. All other processes will have to wait behind the first process. 251,811 bytes of the original 589,842 bytes should still be free (if my hand free space calculations are correct).

The AIL is empty. There is nothing to clean. Any new transaction at this point will kick the cleaner, and it still can't start the first waiter, so it joins the wait list.

The only XFS traffic at this point is inode reclaim worker. This is to be expected.

The CIL has entries, nothing is waiting on the CIL. xc_current_sequence = 117511 xc_push_seq = 117510. So there is nothing for the CIL worker to do.

117511 is the largest sequence number that I have found so far in the xfs_log_item list. There are a few entries with smaller sequence numbers and the following strange entry:

77th entry in the xfs_log_item list:

crash> struct xfs_log_item ffff88083222b5b8
struct xfs_log_item {
  li_ail = {
    next = 0xffff88083222b5b0,
    prev = 0x0
  li_lsn = 0,
  li_desc = 0x9f5d9f5d,
  li_mountp = 0xffff88083116e300,
  li_ailp = 0x0,
  li_type = 0,
  li_flags = 0,
  li_bio_list = 0x0,
  li_cb = 0,
  li_ops = 0xffff88083105de00,
  li_cil = {
    next = 0xffff880832ad9f08,
    prev = 0xffff880831751448
  li_lv = 0xc788c788,
  li_seq = -131906182637504

Everything in this entry is bad except the li_cil.next and li_cil.prev. It looks like li_ail.next is really part of a list that starts at 0xffff88083222b5b0. The best explanation is a junk addresses was inserted into the li_cil chain.

This is a single data point which could be anything including bad hardware. I will continue to traverse this list until I can get the other box to hang. If someone want to traverse their xfs_log_item list ...

--Mark Tinguely.

<Prev in Thread] Current Thread [Next in Thread>