xfs
[Top] [All Lists]

Re: XFS: Assertion failed: bp->b_bn != XFS_BUF_DADDR_NULL, file: fs/xfs/

To: Rafał Kupka <kupson@xxxxxxxxxx>
Subject: Re: XFS: Assertion failed: bp->b_bn != XFS_BUF_DADDR_NULL, file: fs/xfs/xfs_buf.c, line: 598
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Tue, 27 Mar 2012 21:17:50 +1100
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>, xfs@xxxxxxxxxxx
In-reply-to: <20120327032843.GV5091@dastard>
References: <1330539626.7615.73.camel@xxxxxxxxxxxxxxxx> <20120302104348.GA22230@xxxxxxxxxxxxx> <1330774627.32457.24.camel@xxxxxxxxxxxxxxxx> <20120303114413.GA18748@xxxxxxxxxxxxx> <1330983612.28645.18.camel@xxxxxxxxxxxxxxxx> <20120306203020.GG3592@dastard> <1331233817.20679.18.camel@xxxxxxxxxxxxxxxx> <1332777925.14696.194.camel@xxxxxxxxxxxxxxxx> <20120327032843.GV5091@dastard>
User-agent: Mutt/1.5.21 (2010-09-15)
On Tue, Mar 27, 2012 at 02:28:43PM +1100, Dave Chinner wrote:
> On Mon, Mar 26, 2012 at 06:05:25PM +0200, Rafał Kupka wrote:
> > On Thu, 2012-03-08 at 20:10 +0100, Rafał Kupka wrote: 
> > > On Wed, 2012-03-07 at 07:30 +1100, Dave Chinner wrote:
> > 
> > Hi,
> > 
> > Finally I've captured this event trace. Hope it will help.
> > 
> > > > Also, it might be worthwhile to capture an event trace (e.g. all the
> > > > xfs_buf events) to see what occurs just before the error is
> > > > triggered. That might tell us exactl what sequence has occurred
> > > > leading up to the error.
> > > 
> > > I will try to provide this information in a few days.
> > 
> > Kernel version - Linus tree up to:
> > 0e6e847f "xfs: stop using the page cache to back the buffer cache"
> > 
> > It's uniprocessor virtual machine (KVM) with 384 MB RAM.
> 
> Ok, so limited memory and no parallelism. Can you send me you
> .config so I can have a look at other various config options?
> 
> > Full event trace is available at https://gist.github.com/2206109
> > (sorry for it's *.perl extension, my mistake).
> > 
> > Kernel log entry:
> > [  495.197571] XFS (vda2): I/O error occurred: meta-data dev vda2 block 
> > 0x13d80       ("xfs_trans_read_buf") error 5 buf count 8192
> > [  495.198283] XFS (vda2): xfs_imap_to_bp: xfs_trans_read_buf() returned 
> > error 5.
> > [  501.540477] XFS (vda2): I/O error occurred: meta-data dev vda2 block 
> > 0x13d80       ("xfs_trans_read_buf") error 5 buf count 8192
> > [  501.543047] XFS (vda2): xfs_imap_to_bp: xfs_trans_read_buf() returned 
> > error 5.
> > 
> > It's possible that this "I/O error" is caused by memory pressure?
> > 
> > Related trace line:
> > xfsdump-2498  [000] 24195.782269: xfs_buf_ioerror: dev 254:2 bno 0x13d80 
> > len 0x2000 hold 3 pincount 0 lock 0 error 5 flags READ|TRYLOCK|PAGES caller 
> > _xfs_buf_ioapply
> > 
> > Events related to "bno 0x13d80":
> > xfsdump-2498  [000] 24195.782266: xfs_buf_get: dev 254:2 bno 0x13d80 len 
> > 0x2000 hold 2 pincount 0 lock 0 flags READ|LOCK|DONT_BLOCK caller 
> > xfs_buf_read
> > xfsdump-2498  [000] 24195.782266: xfs_buf_read: dev 254:2 bno 0x13d80 len 
> > 0x2000 hold 2 pincount 0 lock 0 flags READ|LOCK|DONT_BLOCK caller 
> > xfs_trans_read_buf
> > xfsdump-2498  [000] 24195.782267: xfs_buf_iorequest: dev 254:2 bno 0x13d80 
> > len 0x2000 hold 2 pincount 0 lock 0 flags READ|TRYLOCK|PAGES caller 
> > _xfs_buf_read
> > xfsdump-2498  [000] 24195.782267: xfs_buf_hold: dev 254:2 bno 0x13d80 len 
> > 0x2000 hold 2 pincount 0 lock 0 flags READ|TRYLOCK|PAGES caller 
> > xfs_buf_iorequest
> > xfsdump-2498  [000] 24195.782269: xfs_buf_ioerror: dev 254:2 bno 0x13d80 
> > len 0x2000 hold 3 pincount 0 lock 0 error 5 flags READ|TRYLOCK|PAGES caller 
> > _xfs_buf_ioapply
> > xfsdump-2498  [000] 24195.782270: xfs_buf_rele: dev 254:2 bno 0x13d80 len 
> > 0x2000 hold 3 pincount 0 lock 0 flags READ|TRYLOCK|PAGES caller 
> > xfs_buf_iorequest
> > xfsdump-2498  [000] 24195.782985: xfs_buf_unlock: dev 254:2 bno 0x13d80 len 
> > 0x2000 hold 2 pincount 0 lock 1 flags READ|TRYLOCK|PAGES caller 
> > xfs_buf_relse
> > xfsdump-2498  [000] 24195.782986: xfs_buf_rele: dev 254:2 bno 0x13d80 len 
> > 0x2000 hold 2 pincount 0 lock 1 flags READ|TRYLOCK|PAGES caller 
> > xfs_trans_read_buf
> > 
> > Do this data put some light on this issue?
> 
> Yes, that it was errored out because the bio that was built from the
> buffer was zero length. That means the buffer was not set up
> properly in the first place. Hmmm, this is unusual - the two lines
> above your traces:
> 
> xfs_buf_cond_lock: dev 254:2 bno 0xffffffffffffffff len 0x2000 hold 2 
> pincount 0 lock 0 flags READ|ASYNC|TRYLOCK|PAGES caller _xfs_buf_find
> xfs_buf_find: dev 254:2 bno 0xffffffffffffffff len 0x2000 hold 2 pincount 0 
> lock 0 flags READ|LOCK|DONT_BLOCK caller xfs_buf_get
> xfs_buf_get: dev 254:2 bno 0x13d80 len 0x2000 hold 2 pincount 0 lock 0 flags 
> READ|LOCK|DONT_BLOCK caller xfs_buf_read
> xfs_buf_read: dev 254:2 bno 0x13d80 len 0x2000 hold 2 pincount 0 lock 0 flags 
> READ|LOCK|DONT_BLOCK caller xfs_trans_read_buf
> 
> The lookup found an uninitialised buffer in the cache. That's
> exceedingly curious - the match is done on bp->b_file_offset, and
> after it has been inserted it the bno is then filled out.
> 
> Oh, wow, I think I see the bug. The key is the initial state of that
> uninitialised buffer: READ|ASYNC|TRYLOCK|PAGES, and the state of
> the it from the xfs_buf_get() trace READ|LOCK|DONT_BLOCK. Those two
> sets of flags do not match up - the initial set indicate the buffer
> has been inserted into the cache for readahead, the second set
> indicate a blocking read. 
> 
> You've only got a small amount of memory, so I suspect that a memory
> allocation has failed in a specific place to cause this. I'm not
> exactly sure how the second thread trips over it and initialises it
> somewhat (there's 3 initialisation steps to a buffer) but the
> initial failure is probably this:
> 
>       async readhead started
>       _xfs_buf_find fails
>       allocate new buffer A
>       partially initialise A
>       _xfs_buf_find
>          fail to find match
>          insert buffer A into rbtree
>          returns buffer A
>       A == A
>          fails allocate pages to buffer
>       unlock buffer
>       release buffer
> 
>       <uninitialised buffer now sitting in cache. THIS IS WRONG!>
> 
> 
> That would lead to a second lookup finding it and using it, but I'm
> still not quite sure how the buffer bno gets initialised. I can
> construct valid scenarios that will cause it to happen, but none
> match the event trace you've posted.
> 
> And just looking at the way we clean up this error, this race
> condition could lead to a use after free if we immediately free
> if we fail to allocate the pages to the buffer. Messy. I'll have to
> think on this a bit - maybe a "broken" state is necessary.

So here's a patch that implements a "broken" state. It's not pretty,
but I think it will fix the race you look to be hitting.

Can you run your test with this patch while running a trace for the
new xfs_buf_*broken events to verify you are hitting that code and
that it doesn't trigger EIO problems anymore?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

xfs: fix buffer lookup race on allocation failure

From: Dave Chinner <dchinner@xxxxxxxxxx>

When memory allocation fails to add the page array or tht epages to
a buffer during xfs_buf_get(), the buffer is left in the cache in a
partially initialised state. There is enough state left for the next
lookup on that buffer to find the buffer, and for the buffer to then
be used without finishing the initialisation.  As a result, when an
attempt to do IO on the buffer occurs, it fails with EIO because
there are no pages attached to the buffer.

We cannot remove the buffer from the cache immediately and free it,
because there may already be a racing lookup that is blocked on the
buffer lock. hence the moment we unlock the buffer to then free it,
the other user is woken and we have a use-after-free situation.

Hence we have to mark the buffer as "broken" and check that after we
have gained the buffer lock on a cache hit lookup. This enables
racing lookups to avoid the broken buffer and drop their references,
allowing the buffer to be freed.

This however, doesn't solve the problem completely - there may be a
delay in the buffer getting freed (e.g. pre-emption), so when we try
the lookup a second time with a new buffer to insert into the tree,
if we find the broken buffer again, drop the buffer lock, sleep for
a short while, and try the lookup again. When the broken bufer is
finally removed from the cache we will make forwards progress.

Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
---
 fs/xfs/xfs_buf.c   |   33 ++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_buf.h   |    2 ++
 fs/xfs/xfs_trace.h |    2 ++
 3 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 6819b51..3588460 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -441,6 +441,7 @@ _xfs_buf_find(
        ASSERT(!(range_base & (xfs_off_t)btp->bt_smask));
 
        /* get tree root */
+lookup_again:
        pag = xfs_perag_get(btp->bt_mount,
                                xfs_daddr_to_agno(btp->bt_mount, ioff));
 
@@ -505,6 +506,24 @@ found:
        }
 
        /*
+        * If the buffer initialisation was not completed correctly after
+        * insertion into the cache we can't use this buffer and we are pinning
+        * it preventing it from being removed from the cache. Hence just avoid
+        * the buffer if it is broken, allowing it to be reclaimed and removed
+        * from the cache. If we are here with a new buffer to insert into the
+        * cache, we want to try the lookup again so that the expected "insert
+        * on lookup failure" semantics are preserved for this case.
+        */
+       if (bp->b_flags & XBF_BROKEN) {
+               trace_xfs_buf_find_broken(bp, flags, _RET_IP_);
+               xfs_buf_relse(bp);
+               if (!new_bp)
+                       return NULL;
+               delay(1);
+               goto lookup_again;
+       }
+
+       /*
         * if the buffer is stale, clear all the external state associated with
         * it. We need to keep flags such as how we allocated the buffer memory
         * intact here.
@@ -552,8 +571,20 @@ xfs_buf_get(
 
        if (bp == new_bp) {
                error = xfs_buf_allocate_memory(bp, flags);
-               if (error)
+               if (error) {
+                       /*
+                        * The buffer has been inserted into the cache now, so
+                        * will be visible to other callers. Once we unlock the
+                        * buffer, someone else can grab it out of the tree.
+                        * It's a broken buffer, so we have to ensure that it is
+                        * noticed - just freeing the buffer here can result in
+                        * use-after-free if someone is already waiting on the
+                        * buffer lock.
+                        */
+                       bp->b_flags |= XBF_BROKEN;
+                       trace_xfs_buf_get_broken(bp, flags, _RET_IP_);
                        goto no_buffer;
+               }
        } else
                kmem_zone_free(xfs_buf_zone, new_bp);
 
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index df7ffb0..b4545b9 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -52,6 +52,7 @@ typedef enum {
 #define XBF_DONE       (1 << 5) /* all pages in the buffer uptodate */
 #define XBF_DELWRI     (1 << 6) /* buffer has dirty pages */
 #define XBF_STALE      (1 << 7) /* buffer has been staled, do not find it */
+#define XBF_BROKEN     (1 << 8) /* buffer is broken, do not find it */
 
 /* I/O hints for the BIO layer */
 #define XBF_SYNCIO     (1 << 10)/* treat this buffer as synchronous I/O */
@@ -79,6 +80,7 @@ typedef unsigned int xfs_buf_flags_t;
        { XBF_DONE,             "DONE" }, \
        { XBF_DELWRI,           "DELWRI" }, \
        { XBF_STALE,            "STALE" }, \
+       { XBF_BROKEN,           "BROKEN" }, \
        { XBF_SYNCIO,           "SYNCIO" }, \
        { XBF_FUA,              "FUA" }, \
        { XBF_FLUSH,            "FLUSH" }, \
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 270775f..e6f7d75 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -386,7 +386,9 @@ DEFINE_EVENT(xfs_buf_flags_class, name, \
        TP_PROTO(struct xfs_buf *bp, unsigned flags, unsigned long caller_ip), \
        TP_ARGS(bp, flags, caller_ip))
 DEFINE_BUF_FLAGS_EVENT(xfs_buf_find);
+DEFINE_BUF_FLAGS_EVENT(xfs_buf_find_broken);
 DEFINE_BUF_FLAGS_EVENT(xfs_buf_get);
+DEFINE_BUF_FLAGS_EVENT(xfs_buf_get_broken);
 DEFINE_BUF_FLAGS_EVENT(xfs_buf_read);
 
 TRACE_EVENT(xfs_buf_ioerror,

<Prev in Thread] Current Thread [Next in Thread>