On Tue, Jan 22, 2013 at 02:04:12PM -0500, Brian Foster wrote:
> On 01/21/2013 07:53 AM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > This is an RFC that follow sup from a conversion Eric and I had on
> > IRC. The idea is to prevent EOF speculative preallocation from
> > triggering larger allocations on IO patterns of
> > truncate--to-zero-seek-write-seek-write-.... which results in
> > non-sparse files for large files. This, unfortunately, is the way cp
> > behaves when copying sparse files, and it results in sub-optimal
> > destination file layouts.
> > What this code does is that it looks at the current extent over the
> > new EOF location, and if it is a hole it turns off preallocation
> > altogether. To avoid the next write from doing a large prealloc, it
> > takes the size of subsequent preallocations from the current size of
> > the existing EOF extent. IOWs, if you leave a hole in the file, it
> > resets preallocation behaviour to the same as if it was a zero size
> > file.
> > I haven't fully tested this, so I'm not sure if it works exactly
> > like I think it should, but I wanted to get this out there to get
> > more eyes on it...
> On a quick test, I didn't quite get the behavior documented below. Is it
> possible your test file had the initial extent preallocated from an xfs
> module with the current preallocation scheme?
No, I didn't run the test on an unmodified kernel. It is possible
that I didn't remove it or truncate it between identical tests or
tests with different offsets, though.
<reruns test on a freshly mkfs'd fs>
I get the same result as what I posted. Note that I am using a CRC enabled
kernel and filesystem here, and it's 17TB in size, but that shouldn't affect
the preallocation algorithm...
$ sudo mkfs.xfs -f -l size=131072b,sunit=8 -m crc=1 /dev/vdc
meta-data=/dev/vdc isize=512 agcount=17, agsize=268435455 blks
= sectsz=512 attr=2, projid32bit=0
data = bsize=4096 blocks=4563402735, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=131072, version=2
= sectsz=512 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
$ sudo mount -o nobarrier,logbsize=256k /dev/vdc /mnt/scratch
$ sudo xfs_io -f -c "pwrite 0 31m" -c "pwrite 33m 1m" -c "pwrite 128m 1m" -c
"fiemap -v" /mnt/scratch/blah
wrote 32505856/32505856 bytes at offset 0
31 MiB, 7936 ops; 0.0000 sec (1.036 GiB/sec and 271501.8816 ops/sec)
wrote 1048576/1048576 bytes at offset 34603008
1 MiB, 256 ops; 0.0000 sec (738.007 MiB/sec and 188929.8893 ops/sec)
wrote 1048576/1048576 bytes at offset 134217728
1 MiB, 256 ops; 0.0000 sec (55.772 MiB/sec and 14277.7468 ops/sec)
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..65535]: 128..65663 65536 0x0
1: [65536..67583]: hole 2048
2: [67584..133119]: 67712..133247 65536 0x0
3: [133120..262143]: hole 129024
4: [262144..393215]: 262272..393343 131072 0x1
> What I see is that sequential writes to a file disable preallocation
> completely (so the first extent in the test below is 31m instead of
> 32m). Digging a bit further, it seemed to be due to start_fsb always
> being a hole. I hacked that a bit to read the extent of the block
> immediately previous to the write offset (instead of the inode size), e.g.:
> start_fsb = XFS_B_TO_FSBT(mp, offset);
> if (start_fsb)
> ... and I seem to get expected behavior, at least in the simple xfs_io test.
I'll have a look at it if I get time before LCA, otherwise it will
be a couple of weeks before I get back to it.