<div dir="ltr">Thanks Dave.<div>This fixed the umount problem I was seeing.</div><div>Tried it a few thousand times. Works great.</div><div>Bob</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Apr 9, 2014 at 10:42 PM, Dave Chinner <span dir="ltr"><<a href="mailto:david@fromorbit.com" target="_blank">david@fromorbit.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">From: Dave Chinner <<a href="mailto:dchinner@redhat.com">dchinner@redhat.com</a>><br>
<br>
And interesting situation can occur if a log IO error occurs during<br>
the unmount of a filesystem. The cases reported have the same<br>
signature - the update of the superblock counters fails due to a log<br>
write IO error:<br>
<br>
XFS (dm-16): xfs_do_force_shutdown(0x2) called from line 1170 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa08a44a1<br>
XFS (dm-16): Log I/O Error Detected. Shutting down filesystem<br>
XFS (dm-16): Unable to update superblock counters. Freespace may not be correct on next mount.<br>
XFS (dm-16): xfs_log_force: error 5 returned.<br>
XFS (¿-¿¿¿): Please umount the filesystem and rectify the problem(s)<br>
<br>
It can be seen that the last line of output contains a corrupt<br>
device name - this is because the log and xfs_mount structures have<br>
already been freed by the time this message is printed. A kernel<br>
oops closely follows.<br>
<br>
The issue is that the shutdown is occurring in a separate IO<br>
completion thread to the unmount. Once the shutdown processing has<br>
started and all the iclogs are marked with XLOG_STATE_IOERROR, the<br>
log shutdown code wakes anyone waiting on a log force so they can<br>
process the shutdown error. This wakes up the unmount code that<br>
is doing a synchronous transaction to update the superblock<br>
counters.<br>
<br>
The unmount path now sees all the iclogs are marked with<br>
XLOG_STATE_IOERROR and so never waits on them again, knowing that if<br>
it does, there will not be a wakeup trigger for it and we will hang<br>
the unmount if we do. Hence the unmount runs through all the<br>
remaining code and frees all the filesystem structures while the<br>
xlog_iodone() is still processing the shutdown. When the log<br>
shutdown processing completes, xfs_do_force_shutdown() emits the<br>
"Please umount the filesystem and rectify the problem(s)" message,<br>
and xlog_iodone() then aborts all the objects attached to the iclog.<br>
An iclog that has already been freed....<br>
<br>
The real issue here is that there is no serialisation point between<br>
the log IO and the unmount. We have serialisations points for log<br>
writes, log forces, reservations, etc, but we don't actually have<br>
any code that wakes for log IO to fully complete. We do that for all<br>
other types of object, so why not iclogbufs?<br>
<br>
Well, it turns out that we can easily do this. We've got xfs_buf<br>
handles, and that's what everyone else uses for IO serialisation.<br>
i.e. bp->b_sema. So, lets hold iclogbufs locked over IO, and only<br>
release the lock in xlog_iodone() when we are finished with the<br>
buffer. That way before we tear down the iclog, we can lock and<br>
unlock the buffer to ensure IO completion has finished completely<br>
before we tear it down.<br>
<br>
Signed-off-by: Dave Chinner <<a href="mailto:dchinner@redhat.com">dchinner@redhat.com</a>><br>
---<br>
fs/xfs/xfs_log.c | 53 ++++++++++++++++++++++++++++++++++++++++++++---------<br>
1 file changed, 44 insertions(+), 9 deletions(-)<br>
<br>
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c<br>
index 8497a00..08624dc 100644<br>
--- a/fs/xfs/xfs_log.c<br>
+++ b/fs/xfs/xfs_log.c<br>
@@ -1181,11 +1181,14 @@ xlog_iodone(xfs_buf_t *bp)<br>
/* log I/O is always issued ASYNC */<br>
ASSERT(XFS_BUF_ISASYNC(bp));<br>
xlog_state_done_syncing(iclog, aborted);<br>
+<br>
/*<br>
- * do not reference the buffer (bp) here as we could race<br>
- * with it being freed after writing the unmount record to the<br>
- * log.<br>
+ * drop the buffer lock now that we are done. Nothing references<br>
+ * the buffer after this, so an unmount waiting on this lock can now<br>
+ * tear it down safely. As such, it is unsafe to reference the buffer<br>
+ * (bp) after the unlock as we could race with it being freed.<br>
*/<br>
+ xfs_buf_unlock(bp);<br>
}<br>
<br>
/*<br>
@@ -1368,8 +1371,16 @@ xlog_alloc_log(<br>
bp = xfs_buf_alloc(mp->m_logdev_targp, 0, BTOBB(log->l_iclog_size), 0);<br>
if (!bp)<br>
goto out_free_log;<br>
- bp->b_iodone = xlog_iodone;<br>
+<br>
+ /*<br>
+ * The iclogbuf buffer locks are held over IO but we are not going to do<br>
+ * IO yet. Hence unlock the buffer so that the log IO path can grab it<br>
+ * when appropriately.<br>
+ */<br>
ASSERT(xfs_buf_islocked(bp));<br>
+ xfs_buf_unlock(bp);<br>
+<br>
+ bp->b_iodone = xlog_iodone;<br>
log->l_xbuf = bp;<br>
<br>
spin_lock_init(&log->l_icloglock);<br>
@@ -1398,6 +1409,9 @@ xlog_alloc_log(<br>
if (!bp)<br>
goto out_free_iclog;<br>
<br>
+ ASSERT(xfs_buf_islocked(bp));<br>
+ xfs_buf_unlock(bp);<br>
+<br>
bp->b_iodone = xlog_iodone;<br>
iclog->ic_bp = bp;<br>
iclog->ic_data = bp->b_addr;<br>
@@ -1422,7 +1436,6 @@ xlog_alloc_log(<br>
iclog->ic_callback_tail = &(iclog->ic_callback);<br>
iclog->ic_datap = (char *)iclog->ic_data + log->l_iclog_hsize;<br>
<br>
- ASSERT(xfs_buf_islocked(iclog->ic_bp));<br>
init_waitqueue_head(&iclog->ic_force_wait);<br>
init_waitqueue_head(&iclog->ic_write_wait);<br>
<br>
@@ -1631,6 +1644,12 @@ xlog_cksum(<br>
* we transition the iclogs to IOERROR state *after* flushing all existing<br>
* iclogs to disk. This is because we don't want anymore new transactions to be<br>
* started or completed afterwards.<br>
+ *<br>
+ * We lock the iclogbufs here so that we can serialise against IO completion<br>
+ * during unmount. We might be processing a shutdown triggered during unmount,<br>
+ * and that can occur asynchronously to the unmount thread, and hence we need to<br>
+ * ensure that completes before tearing down the iclogbufs. Hence we need to<br>
+ * hold the buffer lock across the log IO to acheive that.<br>
*/<br>
STATIC int<br>
xlog_bdstrat(<br>
@@ -1638,6 +1657,7 @@ xlog_bdstrat(<br>
{<br>
struct xlog_in_core *iclog = bp->b_fspriv;<br>
<br>
+ xfs_buf_lock(bp);<br>
if (iclog->ic_state & XLOG_STATE_IOERROR) {<br>
xfs_buf_ioerror(bp, EIO);<br>
xfs_buf_stale(bp);<br>
@@ -1645,7 +1665,8 @@ xlog_bdstrat(<br>
/*<br>
* It would seem logical to return EIO here, but we rely on<br>
* the log state machine to propagate I/O errors instead of<br>
- * doing it here.<br>
+ * doing it here. Similarly, IO completion will unlock the<br>
+ * buffer, so we don't do it here.<br>
*/<br>
return 0;<br>
}<br>
@@ -1847,14 +1868,28 @@ xlog_dealloc_log(<br>
xlog_cil_destroy(log);<br>
<br>
/*<br>
- * always need to ensure that the extra buffer does not point to memory<br>
- * owned by another log buffer before we free it.<br>
+ * Cycle all the iclogbuf locks to make sure all log IO completion<br>
+ * is done before we tear down these buffers.<br>
*/<br>
+ iclog = log->l_iclog;<br>
+ for (i = 0; i < log->l_iclog_bufs; i++) {<br>
+ xfs_buf_lock(iclog->ic_bp);<br>
+ xfs_buf_unlock(iclog->ic_bp);<br>
+ iclog = iclog->ic_next;<br>
+ }<br>
+<br>
+ /*<br>
+ * Always need to ensure that the extra buffer does not point to memory<br>
+ * owned by another log buffer before we free it. Also, cycle the lock<br>
+ * first to ensure we've completed IO on it.<br>
+ */<br>
+ xfs_buf_lock(log->l_xbuf);<br>
+ xfs_buf_unlock(log->l_xbuf);<br>
xfs_buf_set_empty(log->l_xbuf, BTOBB(log->l_iclog_size));<br>
xfs_buf_free(log->l_xbuf);<br>
<br>
iclog = log->l_iclog;<br>
- for (i=0; i<log->l_iclog_bufs; i++) {<br>
+ for (i = 0; i < log->l_iclog_bufs; i++) {<br>
xfs_buf_free(iclog->ic_bp);<br>
next_iclog = iclog->ic_next;<br>
kmem_free(iclog);<br>
<span class="HOEnZb"><font color="#888888">--<br>
1.9.0<br>
<br>
</font></span></blockquote></div><br></div>