[Top] [All Lists]

Re: Extreme fragmentation when backing up via NFS

To: karn@xxxxxxxx
Subject: Re: Extreme fragmentation when backing up via NFS
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Fri, 14 Jan 2011 15:51:26 +1100
Cc: xfs@xxxxxxxxxxx
In-reply-to: <4D2EFB08.90502@xxxxxxxxxxxx>
References: <4D0C9A4F.4040108@xxxxxxxxxxx> <20101220001024.GH5193@dastard> <4D0EC5C5.2070407@xxxxxxxxxxx> <20101220045126.GK5193@dastard> <20101220105547.4f9e7218@xxxxxxxxxxxxxx> <4D2E3B38.6010506@xxxxxxxxxxx> <4D2EFB08.90502@xxxxxxxxxxxx>
User-agent: Mutt/1.5.20 (2009-06-14)
On Thu, Jan 13, 2011 at 05:15:52AM -0800, Phil Karn wrote:
> I have been backing up my main Linux server onto a secondary machine via
> NFS. I use xfsdump like this:
> xfsdump -l 9 -f /machine/backups/fs.9.xfsdump /
> Over on the server machine, xfs_bmap shows an *extreme* amount of
> fragmentation in the backup file. 20,000+ extents are not uncommon, with
> many extents consisting of a single allocation block (8x 512B sectors or
> 4KB).
> I do notice while the backup file is being written that holes often
> appear in the extent map towards the end of the file. I theorize that
> somehow the individual writes are going to the file system out of order,
> and this causes both the temporary holes and the extreme fragmentation.
> I'm able to work around the fragmentation manually by looking at the
> estimate from xfsdump of the size of the backup and then using the
> fallocate command locally on the file server to allocate more than that
> amount of space to the backup file. When the backup is done, I look at
> xfsdump's report of the actual size of the backup file and use the
> truncate command locally on the server to trim off the excess.
> Is fragmentation on XFS via NFS a known problem?

Yes, and it's caused by the way the NFS server uses the VFS. These
commits that have just hit mainline in the 2.6.38-rc1 merge window:

6e85756 xfs: don't truncate prealloc from frequently accessed inodes
055388a xfs: dynamic speculative EOF preallocation

Should mostly fix the problem. It would be good to know if they
really do fix your problem or not, because you are suffering from
exactly the problem they are supposed to fix. I've copied the
commit messages below so I don't have to spend time explaining the
problem or the fix. :)


Dave Chinner

commit 6e857567dbbfe14dd6cc3f7414671b047b1ff5c7
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Thu Dec 23 12:02:31 2010 +1100

    xfs: don't truncate prealloc from frequently accessed inodes
    A long standing problem for streaming writeѕ through the NFS server
    has been that the NFS server opens and closes file descriptors on an
    inode for every write. The result of this behaviour is that the
    ->release() function is called on every close and that results in
    XFS truncating speculative preallocation beyond the EOF.  This has
    an adverse effect on file layout when multiple files are being
    written at the same time - they interleave their extents and can
    result in severe fragmentation.
    To avoid this problem, keep track of ->release calls made on a dirty
    inode. For most cases, an inode is only going to be opened once for
    writing and then closed again during it's lifetime in cache. Hence
    if there are multiple ->release calls when the inode is dirty, there
    is a good chance that the inode is being accessed by the NFS server.
    Hence set a flag the first time ->release is called while there are
    delalloc blocks still outstanding on the inode.

    If this flag is set when ->release is next called, then do no
    truncate away the speculative preallocation - leave it there so that
    subsequent writes do not need to reallocate the delalloc space. This
    will prevent interleaving of extents of different inodes written
    concurrently to the same AG.
    If we get this wrong, it is not a big deal as we truncate
    speculative allocation beyond EOF anyway in xfs_inactive() when the
    inode is thrown out of the cache.
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>

commit 055388a3188f56676c21e92962fc366ac8b5cb72
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Tue Jan 4 11:35:03 2011 +1100

    xfs: dynamic speculative EOF preallocation
    Currently the size of the speculative preallocation during delayed
    allocation is fixed by either the allocsize mount option of a
    default size. We are seeing a lot of cases where we need to
    recommend using the allocsize mount option to prevent fragmentation
    when buffered writes land in the same AG.
    Rather than using a fixed preallocation size by default (up to 64k),
    make it dynamic by basing it on the current inode size. That way the
    EOF preallocation will increase as the file size increases.  Hence
    for streaming writes we are much more likely to get large
    preallocations exactly when we need it to reduce fragementation.
    For default settings, the size of the initial extents is determined
    by the number of parallel writers and the amount of memory in the
    machine. For 4GB RAM and 4 concurrent 32GB file writes:
    EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                
       0: [0..1048575]:         1048672..2097247      0 (1048672..2097247)      
       1: [1048576..2097151]:   5242976..6291551      0 (5242976..6291551)      
       2: [2097152..4194303]:   12583008..14680159    0 (12583008..14680159)    
       3: [4194304..8388607]:   25165920..29360223    0 (25165920..29360223)    
       4: [8388608..16777215]:  58720352..67108959    0 (58720352..67108959)    
       5: [16777216..33554423]: 117440584..134217791  0 (117440584..134217791) 
       6: [33554424..50331511]: 184549056..201326143  0 (184549056..201326143) 
       7: [50331512..67108599]: 251657408..268434495  0 (251657408..268434495) 
    and for 16 concurrent 16GB file writes:
     EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET               
       0: [0..262143]:          2490472..2752615      0 (2490472..2752615)      
       1: [262144..524287]:     6291560..6553703      0 (6291560..6553703)      
       2: [524288..1048575]:    13631592..14155879    0 (13631592..14155879)    
       3: [1048576..2097151]:   30408808..31457383    0 (30408808..31457383)    
       4: [2097152..4194303]:   52428904..54526055    0 (52428904..54526055)    
       5: [4194304..8388607]:   104857704..109052007  0 (104857704..109052007)  
       6: [8388608..16777215]:  209715304..218103911  0 (209715304..218103911)  
       7: [16777216..33554423]: 452984848..469762055  0 (452984848..469762055) 
    Because it is hard to take back specualtive preallocation, cases
    where there are large slow growing log files on a nearly full
    filesystem may cause premature ENOSPC. Hence as the filesystem nears
    full, the maximum dynamic prealloc size іs reduced according to this
    table (based on 4k block size):
    freespace       max prealloc size
      >5%             full extent (8GB)
      4-5%             2GB (8GB >> 2)
      3-4%             1GB (8GB >> 3)
      2-3%           512MB (8GB >> 4)
      1-2%           256MB (8GB >> 5)
      <1%            128MB (8GB >> 6)
    This should reduce the amount of space held in speculative
    preallocation for such cases.
    The allocsize mount option turns off the dynamic behaviour and fixes
    the prealloc size to whatever the mount option specifies. i.e. the
    behaviour is unchanged.
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>

<Prev in Thread] Current Thread [Next in Thread>