<div dir="ltr">Hi Dave,<div><br></div><div>Thanks much for the suggestions. Your suggestion of not mixing preallocated and non-preallocated writes on the same file makes sense to me.</div><div><br></div><div>Regards,</div><div>Dilip</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Feb 3, 2016 at 3:28 PM, Dave Chinner <span dir="ltr"><<a href="mailto:david@fromorbit.com" target="_blank">david@fromorbit.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Wed, Feb 03, 2016 at 02:43:27PM -0800, Dilip Simha wrote:<br>
> On Wed, Feb 3, 2016 at 1:51 PM, Dave Chinner <<a href="mailto:david@fromorbit.com">david@fromorbit.com</a>> wrote:<br>
><br>
> > On Wed, Feb 03, 2016 at 09:02:40AM -0600, Eric Sandeen wrote:<br>
> > ><br>
> > ><br>
> > > On 2/3/16 2:30 AM, Dave Chinner wrote:<br>
> > > > On Tue, Feb 02, 2016 at 11:09:15PM -0800, Dilip Simha wrote:<br>
> > > >> Hi Dave,<br>
> > > >><br>
> > > >> On Tue, Feb 2, 2016 at 10:37 PM, Dave Chinner <<a href="mailto:david@fromorbit.com">david@fromorbit.com</a>><br>
> > wrote:<br>
> > > >><br>
> > > >>> On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:<br>
> > > >>>> Hi Eric,<br>
> > > >>>><br>
> > > >>>> Thank you for your quick reply.<br>
> > > >>>><br>
> > > >>>> Using xfs_io as per your suggestion, I am able to reproduce the<br>
> > issue.<br>
> > > >>>> However, I need to falloc for 256K and write for 257K to see this<br>
> > issue.<br>
> > > >>>><br>
> > > >>>> # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k"<br>
> > /srv/node/r1/t1.txt<br>
> > > >>>> # stat /srv/node/r1/t4.txt | grep Blocks<br>
> > > >>>> Size: 263168 Blocks: 1536 IO Block: 4096 regular file<br>
> > > >>><br>
> > > >>> Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.<br>
> > > >>><br>
> > > >>> When you writing *past the preallocated area* and do delayed<br>
> > > >>> allocation, the speculative preallocation beyond EOF is double the<br>
> > > >>> size of the extent at EOF. i.e. 512k, leading to 768k being<br>
> > > >>> allocated to the file (1536 blocks, exactly).<br>
> > > >>><br>
> > > >><br>
> > > >> Thank you for the details.<br>
> > > >> This is exactly where I am a bit perplexed. Since the reclamation<br>
> > logic<br>
> > > >> skips inodes that have the XFS_DIFLAG_PREALLOC flag set, why did the<br>
> > > >> allocation logic allot more blocks on such an inode?<br>
> > > ><br>
> > > > To store the data you wrote outside the preallocated region, of<br>
> > > > course.<br>
> > ><br>
> > > I think what Dilip meant was, why does it do preallocation, not<br>
> > > why does it allocate blocks for the data. That part is obvious<br>
> > > of course. ;)<br>
> > ><br>
> > > IOWS, if XFS_DIFLAG_PREALLOC prevents speculative preallocation<br>
> > > from being reclaimed, why is speculative preallocation added to files<br>
> > > with that flag set?<br>
> > ><br>
> > > Seems like a fair question, even if Swift's use of preallocation is<br>
> > > ill-advised.<br>
> > ><br>
> > > I don't have all the speculative preallocation heuristics in my<br>
> > > head like you do Dave, but if I have it right, and it's i.e.:<br>
> > ><br>
> > > 1) preallocate 256k<br>
> > > 2) inode gets XFS_DIFLAG_PREALLOC<br>
> > > 3) write 257k<br>
> > > 4) inode gets speculative preallocation added due to write past EOF<br>
> > > 5) inode never gets preallocation trimmed due to XFS_DIFLAG_PREALLOC<br>
> > ><br>
> > > that seems suboptimal.<br>
> ><br>
> > So do things the other way around:<br>
> ><br>
> > 1) write 257k<br>
> > 2) preallocate 256k beyond EOF and speculative prealloc region<br>
> > 3) inode gets XFS_DIFLAG_PREALLOC<br>
> > 4) inode never gets preallocation trimmed due to XFS_DIFLAG_PREALLOC<br>
> ><br>
> > This is correct behaviour.<br>
> ><br>
><br>
> I am sorry, but I don't agree to this. How can an user application know<br>
> about step2.<br>
<br>
</div></div>Step 2 is fallocate(keep size) to a range well beyond EOF. e.g. in<br>
preparation for a bunch of sparse writes that are about to take<br>
place. So userspace will most definitely know about it. It's now the<br>
kernel that now doesn't have a clue what to do about the speculative<br>
preallocation it already has because the application is mixing it's<br>
IO models.<br>
<br>
Fundamentally, if you mix writes across persistent preallocation and<br>
adjacent holes, you are going to get a mess no matter what<br>
filesystem you do this to. If you don't like the way XFS handles it,<br>
either fix the application to not do this, or use the mount option<br>
to turn off speculative preallocation.<br>
<br>
Just like we say "don't mix direct IO and buffered IO on the same<br>
file", it's a really good idea not to mix preallocated and<br>
non-preallocated writes to the same file.<br>
<span class=""><br>
> > But if we decide that we don't do speculative prealloc when<br>
> > XFS_DIFLAG_PREALLOC is set, then workloads that mis-use fallocate<br>
> > (like swift), or use fallocate to fill sparse holes in files are<br>
> > going fragment the hell out of their files when they extending<br>
> > them.<br>
> ><br>
><br>
> I don't understand why would this be the case. If XFS doesn't do<br>
> speculative preallocation then for the 256 byte write after the end of EOF<br>
> will simply result in pushing the EOF ahead. So I see no harm if XFS<br>
> doesn't do speculative preallocation when XFS_DIFLAG_PREALLOC is set.<br>
<br>
</span>I see *potential harm* in changing a long standing default<br>
behaviour.<br>
<span class=""><br>
> > In reality, if swift is really just writing 1k past the prealloc'd<br>
> > range it creates, then that is clearly an application bug. Further,<br>
> > if swift is only ever preallocating the first 256k of each file it<br>
> > writes, regardless of size, then that is also an application bug.<br>
><br>
> Its not a bug. Assume a use-case like appending to a file. Would you say<br>
> append is a buggy operation?<br>
<br>
</span>If the app is using preallocation to reduce append workload file<br>
fragmenation, and then doesn't use preallocation once it is used up,<br>
the the app is definitely buggy because it's not being consistent in<br>
it's IO behaviour. The app should always use fallocate() to control<br>
file layout, or it should never use fallocate and leave the<br>
filesystem to optimise the layout at it sees best.<br>
<br>
In my experience, the filesystem will almost always do a better job<br>
of optimising allocation for best throughput and minimum seeks than<br>
applications using fallocate().<br>
<br>
IOWs, the default behaviour of XFS has been around for more than 15<br>
years and is sane for the majority of applications out there. Hence<br>
the solution here is to either fix the application that is doing<br>
stupid things with fallocate(), or use the allocasize mount option<br>
to minimise the impact of the stupid thing the buggy application is<br>
doing.<br>
<div class="HOEnZb"><div class="h5"><br>
Cheers,<br>
<br>
Dave.<br>
--<br>
Dave Chinner<br>
<a href="mailto:david@fromorbit.com">david@fromorbit.com</a><br>
</div></div></blockquote></div><br></div>