xfs
[Top] [All Lists]

Re: Still seeing hangs in xlog_grant_log_space

To: Peter Watkins <treestem@xxxxxxxxx>
Subject: Re: Still seeing hangs in xlog_grant_log_space
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Wed, 6 Jun 2012 09:54:47 +1000
Cc: Juerg Haefliger <juergh@xxxxxxxxx>, bpm@xxxxxxx, xfs@xxxxxxxxxxx
In-reply-to: <CAH4wwdFu7DEkHFZ5Bf7_PtLPsG0hUyUDoov03q=82R6t+QkERg@xxxxxxxxxxxxxx>
References: <CAH4wwdGWHSZoveLJMxu5pjr22NEEeW7oG8TS+snoM8RY=ZeRmg@xxxxxxxxxxxxxx> <CADLDEKsGtsw-rrSOE7gY4T81u+p41b34ixv0B7Dh07afJ73n2w@xxxxxxxxxxxxxx> <CAH4wwdFu7DEkHFZ5Bf7_PtLPsG0hUyUDoov03q=82R6t+QkERg@xxxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Fri, May 25, 2012 at 01:03:04PM -0400, Peter Watkins wrote:
> On Fri, May 25, 2012 at 2:28 AM, Juerg Haefliger <juergh@xxxxxxxxx> wrote:
> >> Does your kernel have the effect of
> >>
> >> 0bf6a5bd4b55b466964ead6fa566d8f346a828ee xfs: convert the xfsaild
> >> thread to a workqueue
> >
> > No.
> >
> >
> >> c7eead1e118fb7e34ee8f5063c3c090c054c3820 xfs: revert to using a
> >> kthread for AIL pushing
> >
> > No.
> >
> >
> >> In particular, is this code in xfs_trans_ail_push:
> >>
> >>       smp_wmb();
> >>       xfs_trans_ail_copy_lsn(ailp, &ailp->xa_target, &threshold_lsn);
> >>       smp_wmb();
> >
> > No. xfs_trans_ail_push looks like this:
> >
> > void
> > xfs_trans_ail_push(
> >        struct xfs_ail  *ailp,
> >        xfs_lsn_t       threshold_lsn)
> > {
> >        xfs_log_item_t  *lip;
> >
> >        lip = xfs_ail_min(ailp);
> >        if (lip && !XFS_FORCED_SHUTDOWN(ailp->xa_mount)) {
> >                if (XFS_LSN_CMP(threshold_lsn, ailp->xa_target) > 0)
> >                        xfsaild_wakeup(ailp, threshold_lsn);
> >        }
> > }
> >
> >
> > FWIW, the XFS driver in my kernel is identical to the vanilla 2.6.38
> > driver. I'm still trying to get a XFS trace from a production hang. I
> > do have a crash dump from a production machine with /tmp hanging.
> > Would it be helpful to share that dump?
> >
> > ...Juerg
> 
> It looks like the combined effect of those patches, perhaps the write
> barriers, fix one log space hang. That problem exists in 2.6.38.

There are a huge number of fixes to solve these problems since
2.6.38. It doesn't help us at all to test anymore on 2.6.38,
especially as that kernel is not supported, and I'd suggest that you
migrate production off it sooner rather than later.

> Reading bug #922 I see your test case reproduces in recent kernels, so
> there must be a newer problem also.

Right, that's what we need to find - it appears to be a CIL
stall/accounting leak, completely unrelated to all the other AIL/log
space stalls that have been occurring. Last thing is that I was
waiting for more information on the stall that mark T @ sgi was able
to reproduce. I haven't heard anything from him since I asked for
more information on May 23....

> I find the reproducer the most useful, so no need to upload the dump.

At this point, running on a 3.5-rc1 kernel is what we need to get
working reliably. Once we have the problems solved there, we can
work out what set of patches need to be backported to 3.0-stable and
other kernels to fix the problems in those supported kernels...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>