Thanks again for your replies.
On Thu, Sep 16, 2010 at 10:18:37AM +1000, Dave Chinner wrote:
> On Wed, Sep 15, 2010 at 10:26:33AM -0500, Shawn Bohrer wrote:
> > Hello,
> > A little while ago I asked about ways to solve the occasional spikes in
> > latency that I see when writing to a shared memory mapped file.
> > http://oss.sgi.com/pipermail/xfs/2010-July/046311.html
> > With Dave's suggestions I enabled lazy-count=1 which did help a little:
> > # xfs_info /home/
> > meta-data=/dev/sda5 isize=256 agcount=32, agsize=8472969
> > blks
> > = sectsz=512 attr=1
> > data = bsize=4096 blocks=271135008, imaxpct=25
> > = sunit=0 swidth=0 blks
> > naming =version 2 bsize=4096 ascii-ci=0
> > log =internal bsize=4096 blocks=32768, version=1
> > = sectsz=512 sunit=0 blks, lazy-count=1
> > realtime =none extsz=4096 blocks=0, rtextents=0
> > I'm also mounting the partition with "noatime,nobarrier,logbufs=8".
> You really should add logbsize=262144 there - it won't prevent the
> latencies, but with less log writes the incidence should decrease
I initially tested with logbsize=256k but did not notice any
difference. I then later realized that I had been doing a:
mount -o remount,logbsize=256k /home
You may also notice from above that I had version 1 logs and the max
logbsize for version 1 logs is 32k. Apparently the "mount -o remount"
silently ignores the option. If I instead unmount and remount it
complains and refuses to mount.
Yesterday I converted to version 2 logs and ran tests for an hour with
both logbsize=32k and logbsize=256k and I still don't see any
noticeable difference. This of course assumes I tested correctly this
> > The other change I made which helped the most was to use fallocate()
> > to grow the file instead of lseek() and a write().
> Ok, so now you are doing unwritten extent conversion at IO
> completion, which is where this new latency issue has come from.
I'm a complete newbie when it comes to filesystems. For my education
would you mind elaborating a little more here, or pointing to
something I could read?
An extent describes a contiguous section of data on disk correct?
So when the page fault occurs and modifies the block the extent is
modified in memory to record the change? What does "conversion" mean
in this context? And as silly as it sounds what exactly does IO
completion refer to? Is this when the data is written to disk?
> Ok, the transaction was blocked on a buffer that had it's IO
> completion queued to the xfslogd. But this happened some
> 320ms after the above page fault occurred, and 340ms after the
> xfsconvertd got stuck waiting for it. In other words, it looks
> like it it took at least 340ms for the buffer IO to complete after
> it was issued.
OK, so xfslogd is writing the log, which frees up log buffers.
Meanwhile xfsconvertd is waiting on a free buffer so it can write more
to the log correct? So setting logbufs=8 gives me 8 log buffers and
logbsize controls how big each of those buffers are correct? When are
these buffers filled and freed? Are they filled when the process
actually performs the write in memory, or at writeback time? Likewise
it seems that they are freed at writeback time correct? Also do the
buffers only get freed when fill they completely or are they also
flushed when partially full?
I get why you suggest increasing logbsize but I'm curious why I don't
see any difference. Could it be because I always end up running out
of log buffers during writeback even at 256k so some of the I/O gets
stalled anyway? Maybe increasing the logbsize increases the threshold
of the amount of data I can writeback before I see a spike?
> And so the delay you app saw was ~320ms. Basically, it blocked
> waiting for an IO to complete. I don't think there is anything we ca
> really do from a filesystem point of view to avoid that - we cannot
> avoid metadata buffer writeback indefinitely.
One more bit of information which may be relevant here is that since I
see these latencies during writeback I've increased
vm.dirty_writeback_centisecs from the default 500 to 30000. I'm OK
with loosing 5 minutes of data in the event of a crash, and at our
data rates we still stay well below the vm.dirty_background_ratio.
This does improve the spikes (I only see them every 5 min) but
intuitively this seems like might actually make the magnitude of the
delays larger since there is more to write back. Strangely from my
point of view it doesn't seem to increase the magnitude of the spikes,
so I'm not entirely sure how it really fits into the big picture. Of
course with higher data rates it does take longer to write out the
data so the duration where the spikes can occur does increase.
> Fundamentally you are seeing the reason why filesystems cannot
> easily guarantee maximum bound latencies - if we have to wait for IO
> for anything, then the latency is effectively uncontrollable. XFS
> does as much as possible to avoid such latencies for data IO, but
> even then it's not always possible. Even using the RT device in XFS
> won't avoid these latencies - it's caused by latencies in metadata
> modification, not data....
> Effectively, the only way you can minimise this is to design your
> storage layout for minimal IO latency under writes (e.g. use mirrors
> instead of RAID5, etc) or use faster drives. Also using the deadline
> scheduler (if you aren't already) might help....
I have tested with the deadline and noop schedulers and they actually
make the spikes noticeably worse than cfq. I'm not sure why that would
be but, one thought is that I am writing lots of small sequential
chunks but from a whole disk perspective it would create a lot of
random IO. Perhaps cfq is just better at merging these requests.