xfs
[Top] [All Lists]

Review: freezing sometimes leaves the log dirty

To: xfs-dev@xxxxxxx
Subject: Review: freezing sometimes leaves the log dirty
From: David Chinner <dgc@xxxxxxx>
Date: Wed, 31 Jan 2007 09:03:26 +1100
Cc: xfs@xxxxxxxxxxx
Sender: xfs-bounce@xxxxxxxxxxx
User-agent: Mutt/1.4.2.1i
When we freeze the filesystem on a system that is under
heavy load, the fleeze can complete it's flushes while there
are still transactions active. Hence the freeze completes
with a dirty log and dirty metadata buffers still in memory.

The Linux freeze path is a tangled mess - I had to go back
to the irix code to work out exactly what we should be doing
to work out why the linux code was failing because of
the convoluted paths the linux code takes through the
generic layers.

In short, when we freeze the writes, we should not be
quiescing the filesystem at this point. All we should
be doing is a blocking data sync because we haven't shut down
the transaction subsystem yet. We also need to wait
for all direct I/O writes to complete as well.

Once the data sync is complete, we can return to the generic
code for it to freeze new transactions. Then we can wait for
all active transactions to complete before we quiesce the
filesystem which flushes out all the dirty metadata buffers.

At this point we have a clean filesystem and an empty log
so we can safely write the unmount record followed by a
dummy record to dirty the log to ensure unlinked list
processing on remount if we crash or shut down the machine
while the filesystem is frozen.

Comments?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/linux-2.6/xfs_super.c |   14 +++++++++++---
 fs/xfs/linux-2.6/xfs_vfs.h   |    1 +
 fs/xfs/xfs_vfsops.c          |   26 ++++++++++++++++++++++----
 3 files changed, 34 insertions(+), 7 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_super.c     2007-01-08 
14:32:40.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c  2007-01-08 22:46:12.520522391 
+1100
@@ -730,9 +730,17 @@ xfs_fs_sync_super(
        int                     error;
        int                     flags;
 
-       if (unlikely(sb->s_frozen == SB_FREEZE_WRITE))
-               flags = SYNC_QUIESCE;
-       else
+       if (unlikely(sb->s_frozen == SB_FREEZE_WRITE)) {
+               /*
+                * First stage of freeze - no more writers will make progress
+                * now we are here, so we flush delwri and delalloc buffers
+                * here, then wait for all I/O to complete.  Data is frozen at
+                * that point. Metadata is not frozen, transactions can still
+                * occur here so don't bother flushing the buftarg (i.e
+                * SYNC_QUIESCE) because it'll just get dirty again.
+                */
+               flags = SYNC_FSDATA | SYNC_DELWRI | SYNC_WAIT | SYNC_DIO_WAIT;
+       } else
                flags = SYNC_FSDATA | (wait ? SYNC_WAIT : 0);
 
        error = bhv_vfs_sync(vfsp, flags, NULL);
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_vfs.h
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_vfs.h       2006-12-22 
10:53:22.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_vfs.h    2007-01-08 22:27:26.366619320 
+1100
@@ -92,6 +92,7 @@ typedef enum {
 #define SYNC_REFCACHE          0x0040  /* prune some of the nfs ref cache */
 #define SYNC_REMOUNT           0x0080  /* remount readonly, no dummy LRs */
 #define SYNC_QUIESCE           0x0100  /* quiesce fileystem for a snapshot */
+#define SYNC_DIO_WAIT          0x0200  /* wait for direct I/O to complete */
 
 #define SHUTDOWN_META_IO_ERROR 0x0001  /* write attempt to metadata failed */
 #define SHUTDOWN_LOG_IO_ERROR  0x0002  /* write attempt to the log failed */
Index: 2.6.x-xfs-new/fs/xfs/xfs_vfsops.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_vfsops.c      2007-01-08 20:06:55.000000000 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_vfsops.c   2007-01-08 23:27:54.696637946 +1100
@@ -881,6 +881,10 @@ xfs_statvfs(
  *                    this by simply making sure the log gets flushed
  *                    if SYNC_BDFLUSH is set, and by actually writing it
  *                    out otherwise.
+ *     SYNC_DIO_WAIT - The caller wants us to wait for all direct I/Os
+ *                    as well to ensure all data I/O completes before we
+ *                    return. Forms the drain side of the write barrier needed
+ *                    to safely quiesce the filesystem.
  *
  */
 /*ARGSUSED*/
@@ -892,10 +896,7 @@ xfs_sync(
 {
        xfs_mount_t     *mp = XFS_BHVTOM(bdp);
 
-       if (unlikely(flags == SYNC_QUIESCE))
-               return xfs_quiesce_fs(mp);
-       else
-               return xfs_syncsub(mp, flags, NULL);
+       return xfs_syncsub(mp, flags, NULL);
 }
 
 /*
@@ -1181,6 +1182,12 @@ xfs_sync_inodes(
                        }
 
                }
+               /*
+                * When freezing, we need to wait ensure direct I/O is complete
+                * as well to ensure all data modification is complete here
+                */
+               if (flags & SYNC_DIO_WAIT)
+                       vn_iowait(vp);
 
                if (flags & SYNC_BDFLUSH) {
                        if ((flags & SYNC_ATTR) &&
@@ -1959,15 +1966,26 @@ xfs_showargs(
        return 0;
 }
 
+/*
+ * Second stage of a freeze. The data is already frozen, now we have to take
+ * care of the metadata. New transactions are already blocked, so we need to
+ * wait for any remaining transactions to drain out before proceding.
+ */
 STATIC void
 xfs_freeze(
        bhv_desc_t      *bdp)
 {
        xfs_mount_t     *mp = XFS_BHVTOM(bdp);
 
+       /* wait for all modifications to complete */
        while (atomic_read(&mp->m_active_trans) > 0)
                delay(100);
 
+       /* flush inodes and push all remaining buffers out to disk */
+       xfs_quiesce_fs(mp);
+
+       BUG_ON(atomic_read(&mp->m_active_trans) > 0);
+
        /* Push the superblock and write an unmount record */
        xfs_log_unmount_write(mp);
        xfs_unmountfs_writesb(mp);


<Prev in Thread] Current Thread [Next in Thread>