[Top] [All Lists]

Re: XFS fragmentation on file append

To: Keyur Govande <keyurgovande@xxxxxxxxx>
Subject: Re: XFS fragmentation on file append
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Tue, 8 Apr 2014 11:50:12 +1000
Cc: linux-fsdevel@xxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <CAJhmKHmwvCBx=arJ6m2ZhuQVq=Jj-XbBGvEPjTPY5a1QtWRTCQ@xxxxxxxxxxxxxx>
References: <CAJhmKHmwvCBx=arJ6m2ZhuQVq=Jj-XbBGvEPjTPY5a1QtWRTCQ@xxxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
[cc the XFS mailing list <xfs@xxxxxxxxxxx>]

On Mon, Apr 07, 2014 at 06:53:46PM -0400, Keyur Govande wrote:
> Hello,
> I'm currently investigating a MySQL performance degradation on XFS due
> to file fragmentation.
> The box has a 16 drive RAID 10 array with a 1GB battery backed cache
> running on a 12 core box.
> xfs_info shows:
> meta-data=/dev/sda4    isize=256    agcount=24, agsize=24024992 blks
>                =                 sectsz=512   attr=2, projid32bit=0
> data         =                 bsize=4096   blocks=576599552, imaxpct=5
>                =                 sunit=16     swidth=512 blks
> naming   = version 2     bsize=4096   ascii-ci=0
> log         = internal       bsize=4096   blocks=281552, version=2
>              =                   sectsz=512   sunit=16 blks, lazy-count=1
> realtime = none            extsz=4096   blocks=0, rtextents=0
> The kernel version is: 3.14.0-1.el6.elrepo.x86_64 and the XFS
> partition is mounted with: rw,noatime,allocsize=128m,inode64,swalloc.
> The partition is 2TB in size and 40% full to simulate production.
> Here's a test program that appends 512KB like MySQL does (write and
> then fsync). To exacerbate the issue, it loops a bunch of times:
> https://gist.github.com/keyurdg/961c19175b81c73fdaa3
> When run, this creates ~9500 extents most of length 1024.

1024 of what? Most likely it is 1024 basic blocks, which is 512KB,
the size of your writes.

Could you post the output of the xfs_bmap commands you are using to
get this information?

> cat'ing the
> file to /dev/null after dropping the caches reads at an average of 75
> MBps, way less than the hardware is capable of.

What you are doing is "open-seekend-write-fsync-close".  You haven't
told the filesystem you are doing append writes (O_APPEND, or the
append inode flag) so it can't optimise for them.

You are also cleaning the file before closing it, so you are
defeating the current heuristics that XFS uses to determine whether
to remove speculative preallocation on close() - if the inode is
dirty at close(), then it won't be removed. Hence speculative
preallocation does nothing for your IO pattern (i.e. the allocsize
mount option is completely useless). Remove the fsync and you'll
see your fragmentation problem go away completely.

> When I add a posix_fallocate before calling pwrite() as shown here
> https://gist.github.com/keyurdg/eb504864d27ebfe7b40a the file
> fragments an order of magnitude less (~30 extents), and cat'ing to
> /dev/null proceeds at ~1GBps.

That should make no difference on XFS as you are only preallocating
the 512KB region beyond EOF that you are about to write into and
hence both delayed allocation and preallocation have the same
allocation target (the current EOF block). Hence in both cases the
allocation patterns should be identical if the freespace extent they
are being allocated out of are identical.

Did you remove the previous test files and sync the filesystem
between test runs so that the available freespace was identical for
the different test runs? If you didn't then the filesystem allocated
the files out of different free space extents and hence you'll get
different allocation patterns...

> The same behavior is seen even when the allocsize option is removed
> and the partition remounted.

See above.

> This is somewhat unexpected and I'm working on a patch to add
> fallocate to MySQL, wanted to check in here if I'm missing anything
> obvious here.

fallocate() of 512KB sized regions will not prevent fragmentation
into 512KB sized extents with the write pattern you are using.

If you use the inode APPEND attribute for your log files, this lets
the filesystem optimise it's block management for append IO. In the
case of XFS, it then will not remove preallocation beyond EOF when
the fd is closed because the next write will be at EOF where the
speculative preallocation already exists. Then allocsize=128M will
actually work for your log files....

Alternatively, set an extent size hint on the log files to define
the minimum sized allocation (e.g. 32MB) and this will limit
fragmentation without you having to modify the MySQL code at all...


Dave Chinner

<Prev in Thread] Current Thread [Next in Thread>