[Top] [All Lists]

Re: xfs sb_internal#2 lockdep splat

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: xfs sb_internal#2 lockdep splat
From: Jan Kara <jack@xxxxxxx>
Date: Mon, 3 Sep 2012 12:18:54 +0200
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>, Sage Weil <sage@xxxxxxxxxxx>, Jan Kara <jack@xxxxxxx>, xfs@xxxxxxxxxxx
In-reply-to: <20120902003731.GK15292@dastard>
References: <alpine.DEB.2.00.1208311318140.19947@xxxxxxxxxxxxxxxxxx> <20120901230425.GA6896@xxxxxxxxxxxxx> <20120902003731.GK15292@dastard>
User-agent: Mutt/1.5.20 (2009-06-14)
On Sun 02-09-12 10:37:31, Dave Chinner wrote:
> On Sat, Sep 01, 2012 at 07:04:26PM -0400, Christoph Hellwig wrote:
> > I've had some time to look at this issue and it seems to be due to the
> > brand new filesystem freezing code in the VFS which (ab-)uses lockdep
> > in a creative way.
> Yes. It's interacting strangely with other lockdep abuses like the
> work queue annotations, which is where this one is coming from.
> Basically, freeze counters and work queue flushes are *not locks*,
> but lockdep is being told they are locks and it's not smart enough
> to know the difference. It just sees different orders in different
> contexts and complains.
  Well, freeze counters are locks (if we didn't care for performance they
would be rw semaphores) but their usage is rather specific so not everything
lockdep thinks is possible is really possible. But that has always been the
problem with lockdep and more unusual lock usage.
> > In short the XFS code to flush pending delalloc data when running into
> > ENOSPC conditions.  I don't understand the fsfreeze code and its usage
> > of lockdep enough to confirm if the warning is correct, but Dave has
> > patches to rip this code path out in the current from and replace it
> > with the VM layer code used by ext4 and btrfs.  I suspect that should
> > sort out this issue.
> Right, after a couple of days of thinking, I think the root of the
> issue is that lockdep sees that the data writeback done by the
> xfs_flush_worker can run transactions, but does so while holding an
> open transaction. So we have a wait synchronous with an open
> transaction that is dependent on other transactions being started.
> However, this is a false positive warning because of freeze
> behaviour (context) that lockdep is not aware of and cannot be told
> about. That is, the inode flush will always be completed before the
> freeze can progress past the FREEZE_WRITE stage, and the
> transactions are in the FREEZE_FS context. IOWs:
> > > [23405.638393]  Possible unsafe locking scenario:
> > > [23405.638393] 
> > > [23405.638394]        CPU0                    CPU1
> > > [23405.638394]        ----                    ----
> > > [23405.638396]   lock(sb_internal#2);
> > > [23405.638398]                                lock((&mp->m_flush_work));
> > > [23405.638400]                                lock(sb_internal#2);
> > > [23405.638402]   lock((&mp->m_flush_work));
> ignores the fact the CPU0 lock order is actually:
>       lock(sb_internal#0);            << FREEZE_WRITE
>       lock(sb_internal#2);            << FREEZE_FS
>       lock((&mp->m_flush_work));
> And that it is safe to do any sort of dependent lock of
> sb_internal#2 while a sb_internal#0 lock is held and that nested
> dependent sb_internal#2 lock levels is also safe for the same
> reason. Inversions/nesting of FREEZE_FS only matter outside
> FREEZE_WRITE/FREEZE_PAGECACHE context, not when they are inside that
> context.
> Indeed, changing the code to use the VFS flush functions doesn't
> change this at all - we still block waiting for IO submission and
> hence allocation transactions whilst holding the same "locks". We'll
> just avoid lockdep false positives because the VFS writeback code
> doesn't use the workqueue infrastructure and hence has no lockdep
> annotations to make waiting for the IO submission look like a
> lock...
  Yeah. If locking annotations for freeze locks show up doing more bad than
good we can remove them (at least at innermost level which seems to be
the most problematic for XFS) but they caught quite some problems when I was
developing the code so I believe there is a value in them.

Jan Kara <jack@xxxxxxx>

<Prev in Thread] Current Thread [Next in Thread>