Request for information on bloated writes using Swift
Dilip Simha
nmdilipsimha at gmail.com
Wed Feb 3 10:15:34 CST 2016
Thank you Eric,
I am sorry, I missed reading your message before replying.
You got my question right.
Regards,
Dilip
On Wed, Feb 3, 2016 at 8:10 AM, Dilip Simha <nmdilipsimha at gmail.com> wrote:
> On Wed, Feb 3, 2016 at 12:30 AM, Dave Chinner <david at fromorbit.com> wrote:
>
>> On Tue, Feb 02, 2016 at 11:09:15PM -0800, Dilip Simha wrote:
>> > Hi Dave,
>> >
>> > On Tue, Feb 2, 2016 at 10:37 PM, Dave Chinner <david at fromorbit.com>
>> wrote:
>> >
>> > > On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:
>> > > > Hi Eric,
>> > > >
>> > > > Thank you for your quick reply.
>> > > >
>> > > > Using xfs_io as per your suggestion, I am able to reproduce the
>> issue.
>> > > > However, I need to falloc for 256K and write for 257K to see this
>> issue.
>> > > >
>> > > > # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k"
>> /srv/node/r1/t1.txt
>> > > > # stat /srv/node/r1/t4.txt | grep Blocks
>> > > > Size: 263168 Blocks: 1536 IO Block: 4096 regular file
>> > >
>> > > Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.
>> > >
>> > > When you writing *past the preallocated area* and do delayed
>> > > allocation, the speculative preallocation beyond EOF is double the
>> > > size of the extent at EOF. i.e. 512k, leading to 768k being
>> > > allocated to the file (1536 blocks, exactly).
>> > >
>> >
>> > Thank you for the details.
>> > This is exactly where I am a bit perplexed. Since the reclamation logic
>> > skips inodes that have the XFS_DIFLAG_PREALLOC flag set, why did the
>> > allocation logic allot more blocks on such an inode?
>>
>> To store the data you wrote outside the preallocated region, of
>> course.
>>
>> > My understanding is that the fallocate caller only requested for 256K
>> worth
>> > of blocks to be available sequentially if possible.
>>
>> fallocate only guarantees the blocks are allocated - it does not
>> guarantee anything about the layout of the blocks.
>>
>> > On any subsequent write beyond the EOF, the caller is completely
>> > unaware of the underlying file-system storing that data adjacent
>> > to the first 256K data. Since XFS is speculatively allocating
>> > additional space (512K) adjacent to the first 256K data, I would
>> > expect XFS to either treat these two allocations distinctly and
>> > NOT mark XFS_DIFLAG_PREALLOC on the additional 512K data(minus the
>> > actually used additional data=1K), OR remove XFS_DIFLAG_PREALLOC
>> > flag on the entire inode.
>>
>> Oh, if only it were that simple. It's way more complex than I have
>> time to explain here.
>>
>> Fundamentally, XFS_DIFLAG_PREALLOC is used to indicate that
>> persistent preallocation has been done on the file, and so if that
>> has happened we need to turn off optimistic removal of blocks
>> anywhere in the file because we can't tell what blocks had
>> persistent preallocation done on them after the fact. That's the
>> way it's been since unwritten extents were added to XFS back in
>> 1998, and I don't really see the need for it to change right now.
>>
>
> I completely understand the reasoning behind this reclamation logic and I
> also agree to it.
> But my question is with the allocation logic. I don't understand why XFS
> allocates more than necessary blocks when this flag is set and when it
> knows that its not going to clean up the additional space.
>
> A simple example would be:
> 1: Open File in Write mode.
> 2: Fallocate 256K
> 3: Write 256K
> 4: Close File
>
> Stat shows that XFS allocated 512 blocks as expected.
>
> 5: Open file in append mode.
> 6: Write 256 bytes.
> 7: Close file.
>
> Expectation is that the number of blocks allocated is either 512+1 or
> 512+8 depending on the block size.
> However, XFS uses speculative preallocation to allocate 512K (as per your
> explanation) to write 256 bytes and hence the overall disk usage goes up to
> 1536 blocks.
> Now, who is responsible for clearing up the additional allocated blocks?
> Clearly the application has no idea about the over-allocation.
>
> I agree that if an application uses fallocate and delayed allocation on
> the same file in the same IO, then its a badly structured application. But
> in this case we have two different IOs on the same file. The first IO did
> not expect an append and hence issued an fallocate. So that looks good to
> me.
>
> Your thoughts on this?
>
> Regards,
> Dilip
>
>
>> If an application wants to mix fallocate and delayed allocatin
>> writes to the same file in the same IO, then that's an application
>> bug. It's going to cause bad IO patterns and file fragmentation and
>> have other side effects (as you've noticed), and there's nothing the
>> filesystem can do about it. fallocate() requires expertise to use in
>> a beneficial manner - most developers do not have the required
>> expertise (and don't have enough expertise to realise this) and so
>> usually make things worse rather than better by using fallocate.
>>
>> > Also, is there any way I can check for this flag?
>> > The FLAGS, as observed from xfs_bmap doesn't show any flags set to it.
>> Am I
>> > not looking at the right flags?
>>
>> xfs_io -c stat <file>
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> david at fromorbit.com
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20160203/dd462bdf/attachment-0001.html>
More information about the xfs
mailing list