[Top] [All Lists]

Re: unexpected high fragmentation, any ideas?

To: Steve Lord <lord@xxxxxxx>, Russell Cattelan <cattelan@xxxxxxx>, Chris Wedgwood <cw@xxxxxxxx>
Subject: Re: unexpected high fragmentation, any ideas?
From: Marc Lehmann <schmorp@xxxxxxxxxx>
Date: Sun, 3 Apr 2005 23:52:30 +0200
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <20050403190327.GA13211@xxxxxxxxxxxxxxxxxxxxx> <20050403173940.GA12673@xxxxxxxxxxxxxxxxxxxxx> <42500F94.1070603@xxxxxxxxxxx> <4250000C.5070709@xxxxxxx>
References: <20050403135415.GB24559@xxxxxxxxxx> <20050403173940.GA12673@xxxxxxxxxxxxxxxxxxxxx> <20050403004653.GA981@xxxxxxxxxx> <20050403050542.GB5727@xxxxxxxxxxxxxxxxxxxxx> <20050403135805.GC24559@xxxxxxxxxx> <42500F94.1070603@xxxxxxxxxxx> <20050403004653.GA981@xxxxxxxxxx> <20050403050542.GB5727@xxxxxxxxxxxxxxxxxxxxx> <20050403135805.GC24559@xxxxxxxxxx> <4250000C.5070709@xxxxxxx>
Sender: linux-xfs-bounce@xxxxxxxxxxx
Hi, I am answering these mails in one go.

First of all, thanks for the many suggestions, there are lots of ideas to
try out. BTW, this is not a real _big_ problem, as the new disk I now use
XFS on is just fast enough to record/transcode/read two streams at the
same time without skips, so my quest for less fragmentation is mainly an
academic one (I'd also like to let the machine do other things at the same
time, which would be helped by less fragmentation). XFS so far performs
quite adequate.

On Sun, Apr 03, 2005 at 09:39:08AM -0500, Steve Lord <lord@xxxxxxx> wrote:
> Do you know if the app is using buffered or direct I/O ? This is
> an O_SYNC on the writes?

buffered I/O. it just opens, write's with large blocksizes, and closes, no

> How much memory do you have in the box, and what kernel are you
> running.

640MB, pristine 2.6.11

free usually shows around 200MB free without buffers.

> I just tried two streams  of dd using 1M byte writes of 2G
> files - on my laptop with 512M of memory and a single partition
> which has my whole system on it, so the layout is non optimal.

The difference is that my "dd's" are very slow, 500-1500kb/s, over a few

> One file got 49 extents, the other got 25. Combined throughput was
> 25 M/sec which is not too bad for my laptop.

It certainly would be good for mine :->

> on its own ended up with 13 extents - which shows the state of the
> free space on my disk.

The conditions should be much much better on my disk, there are 127 files
_total_ on that disk, and almost all of them are larger than 800MB. The disk
contents were recently copied from a JFS disk and xfs_fsr was run, so there
was no external fragmentation, and I'd think it is very unlikely that there
are many (relatively) small fragments.

> If I switch to O_SYNC for writes it all goes
> to pot very quickly, the files checkerboard with each other, but
> the extents do come out in order.

Very interesting, I wonder why they checkerboard at all - I'd hoped XFS
was smart enough not to do this, as even ext3 is, to a certain extent, and
AFAIK, ext3 doesn't do allocate-on-flush.

> It may be the fact that your data is actually coming in slower than
> this that is causing part of the problem. If the streams are getting
> synced out before they have a chance to build a large chunk of in
> memory data that might do it.

Yes, very likely.

> If you have enough control, try making each incoming stream live in
> its own directory, that will make the allocator start them off in
> different allocation groups by default.

Well... involves quite a bit of hackery in the app I use (mythtv, btw.),
but I might be able to control the transcoding. I'll try to implement

> Are you replaying or transcoding from the same streams as are being
> recorded or from different ones?

It's possible, but rare, that I watch a stream while it is recording, so
streams that are being wirtten are not usually being read at the same
time, but there is read activity while there is write activity, usually at a
similar speed as the writers.

> The reverse ordering you notice is a little odd, it is definitely
> supposed to prefer going the other way.


On Sun, Apr 03, 2005 at 10:45:24AM -0500, Russell Cattelan <cattelan@xxxxxxx> 
> You are right about one thing multiple writers will cause file interleaving.


> It's interesting that delayed allocation is not helping as well as it should
> XFS should be able to cluster delayed allocate pages together and thus
> ask the allocator for larger contiguous block at one go.

Hmm... that could be what is going on, actually. If *only* blocks that
are in-memory will be clustered together then it's easily explainable,
as the kernel will flush data blocks every 50 seconds minimum, which
corresponds to about 30MB of data that, in best circumstances, can be
clustered together.

> I think the problem you are running into is that with a slow writing app
> pdflush is pushing pages out to disk to quickly.


> A way to test that is to increase the pdflush interval, don't remember which
> proc value you need to change for that dirty_writback_centisecs I think.

Just did that (increased dirty_writeback_centisecs from 500 to 1800) and
will check on wether this affects fragmentation (the machine might not
have enough memory for this to work effectively at all times, but it could
have a noticable effect).

I also increased dirty_background_ratio and dirty_ratio to 30 and 60,

On Sun, Apr 03, 2005 at 10:39:40AM -0700, Chris Wedgwood <cw@xxxxxxxx> wrote:
> > yes, same directory. imagine 2-4 slow "dd of=/xfs/video.dat"'s
> the allocator tries to place files in the same ag (near their parent
> directory)

Ok, that explains the fragmentation.

> > ext3 looks much better as it (seemingly) tries to allocate the files
> > in different block groups when multiple files are being written.
> we could put a sysctl or something mount option --- im not sure on the
> whole it's useful

Well, it would be useful to have a filesystem that works fine with
streaming media applications, but it's not typical usage, indeed, and
might not be worth the effort.

(and xfs_fsr remedies that to a great deal. btw., how would, if it were
ported, the realtime subvolume allocator handle that?)

What could be useful would be a mount option to force the block allocator
to treats files in the same directory like files in different directories.

> the current behaviour is pretty nice for smaller files in the same
> directory like source-trees for example as they are closer together
> reducing seeks making some IO patterns better

Indeed. However, even when I mkfs.xfs a partition and then untar some
files onto it, all the files have 2 extents (i.e. only one, fast, writer),
which I also find a bit peculiar.

(again, xfs_fsr makes all complaints about that go away completely. I
think an online-defragmenter is the way to go, even a simple one like
xfs_fsr, especially when volumes start to become full and fragmentation is
unavoidable even with the best allocator).

> > xfs_fsr, OTOH, does a perfect job - all files are single-extent
> > files after it ran, even when I run it while there are three other
> > writers!
> fsr preallocates space in as large-chunks as it can... it's not always
> going to be a single extent but with enough free space you will
> usually get that

It would be interesting to teach mythtv to tell that to XFS, i.e.
preallocate all files to, say 3GB when recording and transcoding. As long
as enough "3GB slots" are free, this should work nicely. Too bad this
can't be done with other filesystems :->

> > I'd run xfs_fsr continuously, but the i/o bandwidth lost is immense,
> > and xfs_fsr tends to copy gigabytes of a file and then detects that
> > the file is being modified, which somewhat precludes it's use on a
> > busy filesystem.
> two things you can try
>   (1) create the files in different directories (you can move them
>       onces created to the same place if you like)

I'll try to implement that for the transcoders.

>   (2) preallocate space --- this is XFS specific though.  this will
>       end up giving you more-or-less the same result as what yoiu end
>       up with after running xfs_fsr

Would probably be easier to hack into mythtv, will think about that

> are you able to try either of those?  it would be nice to know if (1)

I don't know when I can get around trying that, but I'll certainly want to
try (1).

> makes a differnce as that behaviour is something that could be
> tweaked/changed as a mount option maybe

Would it be easier to just disable it in the kernel? Patching kernel code
would be certainly easier for me than implementing it inside mythtv. I'd
look myself, but do you have any pointers at which file to look? That
would be great, thanks!

On Sun, Apr 03, 2005 at 12:03:27PM -0700, Chris Wedgwood <cw@xxxxxxxx> wrote:
> for really slow writes i found a large biosize helped.  i've had this

Is this somewthing inside xfs or is this just setting the st_blksize stat
data field? If the latter, then its unlikely to help, as I configured
mythtv to use fairly large write's already (minimum 512KB, usually around
2MB). But thanks for the tip, it might help some other XFS filesystems I
have (although it isn't a problem there).

                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      pcg@xxxxxxxx
      --==---/ / _ \/ // /\ \/ /      http://schmorp.de/
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

<Prev in Thread] Current Thread [Next in Thread>