[Top] [All Lists]

Re: [PATCH 2/4] [RFC] xfs: limit speculative prealloc size on sparse fil

To: Brian Foster <bfoster@xxxxxxxxxx>
Subject: Re: [PATCH 2/4] [RFC] xfs: limit speculative prealloc size on sparse files
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Wed, 23 Jan 2013 08:34:40 +1100
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <50FEE2AC.9050005@xxxxxxxxxx>
References: <1358772835-21436-1-git-send-email-david@xxxxxxxxxxxxx> <1358772835-21436-3-git-send-email-david@xxxxxxxxxxxxx> <50FEE2AC.9050005@xxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Tue, Jan 22, 2013 at 02:04:12PM -0500, Brian Foster wrote:
> On 01/21/2013 07:53 AM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > This is an RFC that follow sup from a conversion Eric and I had on
> > IRC.  The idea is to prevent EOF speculative preallocation from
> > triggering larger allocations on IO patterns of
> > truncate--to-zero-seek-write-seek-write-....  which results in
> > non-sparse files for large files. This, unfortunately, is the way cp
> > behaves when copying sparse files, and it results in sub-optimal
> > destination file layouts.
> > 
> > What this code does is that it looks at the current extent over the
> > new EOF location, and if it is a hole it turns off preallocation
> > altogether. To avoid the next write from doing a large prealloc, it
> > takes the size of subsequent preallocations from the current size of
> > the existing EOF extent. IOWs, if you leave a hole in the file, it
> > resets preallocation behaviour to the same as if it was a zero size
> > file.
> > 
> > I haven't fully tested this, so I'm not sure if it works exactly
> > like I think it should, but I wanted to get this out there to get
> > more eyes on it...
> > 
> On a quick test, I didn't quite get the behavior documented below. Is it
> possible your test file had the initial extent preallocated from an xfs
> module with the current preallocation scheme?

No, I didn't run the test on an unmodified kernel. It is possible
that I didn't remove it or truncate it between identical tests or
tests with different offsets, though.

<reruns test on a freshly mkfs'd fs>

I get the same result as what I posted. Note that I am using a CRC enabled
kernel and filesystem here, and it's 17TB in size, but that shouldn't affect
the preallocation algorithm...

$ sudo mkfs.xfs -f -l size=131072b,sunit=8 -m crc=1 /dev/vdc
meta-data=/dev/vdc               isize=512    agcount=17, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=0
         =                       crc=1
data     =                       bsize=4096   blocks=4563402735, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=131072, version=2
         =                       sectsz=512   sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount -o nobarrier,logbsize=256k /dev/vdc /mnt/scratch
$ sudo xfs_io -f -c "pwrite 0 31m" -c "pwrite 33m 1m" -c "pwrite 128m 1m" -c 
"fiemap -v" /mnt/scratch/blah
wrote 32505856/32505856 bytes at offset 0
31 MiB, 7936 ops; 0.0000 sec (1.036 GiB/sec and 271501.8816 ops/sec)
wrote 1048576/1048576 bytes at offset 34603008
1 MiB, 256 ops; 0.0000 sec (738.007 MiB/sec and 188929.8893 ops/sec)
wrote 1048576/1048576 bytes at offset 134217728
1 MiB, 256 ops; 0.0000 sec (55.772 MiB/sec and 14277.7468 ops/sec)
   0: [0..65535]:      128..65663       65536   0x0
   1: [65536..67583]:  hole              2048
   2: [67584..133119]: 67712..133247    65536   0x0
   3: [133120..262143]: hole             129024
   4: [262144..393215]: 262272..393343   131072   0x1

> What I see is that sequential writes to a file disable preallocation
> completely (so the first extent in the test below is 31m instead of
> 32m). Digging a bit further, it seemed to be due to start_fsb always
> being a hole. I hacked that a bit to read the extent of the block
> immediately previous to the write offset (instead of the inode size), e.g.:
>       start_fsb = XFS_B_TO_FSBT(mp, offset);
>       if (start_fsb)
>               start_fsb--;
> ... and I seem to get expected behavior, at least in the simple xfs_io test.

I'll have a look at it if I get time before LCA, otherwise it will
be a couple of weeks before I get back to it.


Dave Chinner

<Prev in Thread] Current Thread [Next in Thread>