xfs
[Top] [All Lists]

Re: 12x performance drop on md/linux+sw raid1 due to barriers [xfs]

To: xfs@xxxxxxxxxxx
Subject: Re: 12x performance drop on md/linux+sw raid1 due to barriers [xfs]
From: Martin Steigerwald <Martin@xxxxxxxxxxxx>
Date: Sun, 14 Dec 2008 19:12:51 +0100
Cc: Linux RAID <linux-raid@xxxxxxxxxxxxxxx>
In-reply-to: <18757.4606.966139.10342@xxxxxxxxxxxxxxxxxx>
References: <alpine.DEB.1.10.0812060928030.14215@xxxxxxxxxxxxxxxx> <1229225480.16555.152.camel@localhost> <18757.4606.966139.10342@xxxxxxxxxxxxxxxxxx> (sfid-20081214_183524_928808_CA8411E0)
User-agent: KMail/1.9.9
Am Sonntag 14 Dezember 2008 schrieb Peter Grandi:
> First of all, why are you people sending TWO copies to the XFS
> mailing list? (to both linux-xfs@xxxxxxxxxxx and xfs@xxxxxxxxxxx).

Just took the CC as it seems to be custom on xfs mailinglist to take it. I 
stripped it this time.

> >>> At the moment it appears to me that disabling write cache
> >>> may often give more performance than using barriers. And
> >>> this doesn't match my expectation of write barriers as a
> >>> feature that enhances performance.
> >>
> >> Why do you have that expectation?  I've never seen barriers
> >> advertised as enhancing performance.  :)
>
> This entire discussion is based on the usual misleading and
> pointless avoidance of the substance, in particular because of
> stupid, shallow diregard for the particular nature of the
> "benchmark" used.
>
> Barriers can be used to create atomic storage transaction for
> metadata or data. For data, they mean that 'fsync' does what is
> expected to do. It is up to the application to issue 'fsync' as
> often or as rarely as appropriate.
>
> For metadata, it is the file system code itself that uses
> barriers to do something like 'fsync' for metadata updates, and
> enforce POSIX or whatever guarantees.
>
> The "benchmark" used involves 290MB of data in around 26k files
> and directories, that is the average inode size is around 11KB.
>
> That means that an inode is created and flushed to disk every
> 11KB written; a metadata write barrier happens every 11KB.
>
> A synchronization every 11KB is a very high rate, and it will
> (unless the disk host adapter or the disk controller are clever
> mor have battery backed memory for queues) involve a lot of
> waiting for the barrier to complete, and presumably break the
> smooth flow of data to the disk with pauses.

But - as far as I understood - the filesystem doesn't have to wait for 
barriers to complete, but could continue issuing IO requests happily. A 
barrier only means, any request prior to that have to land before and any 
after it after it. It doesn't mean that the barrier has to land 
immediately and the filesystem has to wait for this.

At least that always was the whole point of barriers for me. If thats not 
the case I misunderstood the purpose of barriers to the maximum extent 
possible.

> Also whether or not the host adapter or the conroller write
> cache are disabled, 290MB will fit inside most recent hosts' RAM
> entirely, and even adding 'sync' at the end will not help that
> much as to helping with a meaningful comparison.

Okay, so dropping caches would be required. Got that in the meantime.

> > My initial thoughts were that write barriers would enhance
> > performance, in that, you could have write cache on.
>
> Well, that all depends on whether the write caches (in the host
> adapter or the controller) are persistent and how frequently
> barriers are issued.
>
> If the write caches are not persistent (at least for a while),
> the hard disk controller or the host adapter cannot have more
> than one barrier completion request in flight at a time, and if
> a barrier completion is requested every 11KB that will be pretty
> constraining.

Hmmm, didn't know that. How comes? But the IO scheduler should be able to 
handle more than one barrier request at a time, shouldn't it? And even 
than how can it be slower writing 11 KB at a time than writing every IO 
request at a time - i.e. write cache *off*.

> Barriers are much more useful when the host adapter or the disk
> controller can cache multiple transactions and then execute them
> in the order in which barriers have been issued, so that the
> host can pipeline transactions down to the last stage in the
> chain, instead of operating the last stages synchronously or
> semi-synchronously.
>
> But talking about barriers in the context of metadata, and for a
> "benchmark" which has a metadata barrier every 11KB, and without
> knowing whether the storage subsystem can queue multiple barrier
> operations seems to be pretty crass and meangingless, if not
> misleading. A waste of time at best.

Hmmm, as far as I understood it would be that the IO scheduler would 
handle barrier requests itself if the device was not capable for queuing 
and ordering requests.

Only thing that occurs to me know, that with barriers off it has more 
freedom to order requests and that might matter for that metadata 
intensive workload. With barriers it can only order 11 KB of requests. 
Without it could order as much as it wants... but even then the 
filesystem would have to make sure that metadata changes land in the 
journal first and then in-place. And this would involve a sync, if no 
barrier request was possible.

So I still don't get why even that metadata intense workload of tar -xf 
linux-2.6.27.tar.bz2 - or may better bzip2 -d the tar before - should be 
slower with barriers + write cache on than with no barriers and write 
cache off.

> > So its really more of an expectation that wc+barriers on,
> > performs better than wc+barriers off :)
>
> This is of course a misstatement: perhaps you intended to write
> that ''wc on+barriers on'' would perform better than ''wc off +
> barriers off'.
>
> As to this apparent anomaly, I am only mildly surprised, as
> there are plenty of similar anomalies (why ever should have a
> very large block device readahead to get decent performance from
> MD block devices?), due to poorly ill conceived schemes in all
> sorts of stages of the storage chain, from the sometimes
> comically misguided misdesigns in the Linux block cache or
> elevators or storage drivers, to the often even worse
> "optimizations" embedded in the firmware of host adapters and
> hard disk controllers.

Well and then that is something that could potentially be fixed!

> Consider for example (and also as a hint towards less futile and
> meaningless "benchmarks") the 'no-fsync' option of 'star', the
> reasons for its existence and for the Linux related advice:
>
>   http://gd.tuwien.ac.at/utils/schilling/man/star.html
>
>     «-no-fsync
>           Do not call  fsync(2)  for  each  file  that  has  been
>           extracted  from  the archive. Using -no-fsync may speed
>           up extraction on operating systems with slow  file  I/O
>           (such  as  Linux),  but includes the risk that star may
>           not be able to detect extraction  problems  that  occur
>           after  the  call to close(2).»
>
> Now ask yourself if you know whether GNU tar does 'fsync' or not
> (a rather interesting detail, and the reasons why may also be
> interesting...).

Talking about less futile benchmarks and mentioning the manpage of a tool 
from a author who is known as Solaris advocate appears to be a bit futile 
in itself for me. Especially if the author tends to chime into into any 
discussion mentioning his name and at least in my experience is very 
difficult to talk with in a constructive manner. 

For me its important to look whether there might be reason to look in more 
detail at how efficient write barriers work on Linux. For that as I 
mentioned already, testing just this simple workload would not be enough. 
And testing just on XFS neither.

I think this is neither useless nor futile. The simplified benchmark IMHO 
has shown something that deserves further investigation. Nothing more, 
nothing less.

[1] http://oss.sgi.com/archives/xfs/2008-12/msg00244.html

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

Attachment: signature.asc
Description: This is a digitally signed message part.

<Prev in Thread] Current Thread [Next in Thread>