[ ... ]
>>>>> We are using Amazon EC2 instances.
>>>>> [ ... ] one of the the worst possible platforms for XFS.
>>>> I don't agree with you there. If the workload works best on
>>>> XFs, it doesn't matter what the underlying storage device
>>>> is. e.g. if it's a fsync heavy workload, it will still
>>>> perform better on XFS on EC2 than btrfs on EC2...
>> There are special cases, but «fsync heavy» is a bit of bad
> It's actually a really good example of where XFS will be
> better than other filesystems.
But this is better at being less bad. Because we are talking here
about «fsync heavy» workloads on a VM, and these should not be
run on a VM if performance matters. That's why I wrote about a
«bad example» on which to discuss XFS for a VM.
But even with «fsync heavy» workloads in general your argument is
not exactly appropriate:
> Why? Because XFS does less log IO due to aggregation of log
> writes during concurrent fsyncs.
But «fsync heavy» does not necessarily means «concurrent fsyncs»,
for me it typically means logging or database apps where every
'write' is 'fsync'ed, even if there is a single thread. But let's
imagine for a moment we were talking about the special case where
«fsync heavy» involves a high degree of concurrency.
> The more latency there is on a log write, the more aggregation
> that occurs.
This seems to describe hardcoding in XFS a decision to trade
worse latency for better throughput, understandable as XFS was
after all quite clearly aimed at high throughput (or isochronous
throughput), rather than low latency (except for metadata, and
that has been "fixed" with 'delaylog').
Unless you mean that if the latency is low, then aggregation does
not take place, but then it is hard for me to see how that can be
*predicted*. I am assuming that in the above you refer to:
the XFS transaction subsystem is that most transactions are
asynchronous. That is, they don't commit to disk until either a
log buffer is filled (a log buffer can hold multiple transactions)
or a synchronous operation forces the log buffers holding the
transactions to disk. This means that XFS is doing aggregation of
transactions in memory - batching them, if you like - to minimise
the impact of the log IO on transaction throughput.
The delaylog mount option also improves sustained metadata
modification performance by reducing the number of changes to the
log. It achieves this by aggregating individual changes in memory
before writing them to the log: frequently modified metadata is
written to the log periodically instead of on every modification.
This option increases the memory usage of tracking dirty metadata
and increases the potential lost operations when a crash occurs,
but can improve metadata modification speed and scalability by an
order of magnitude or more. Use of this option does not reduce
data or metadata integrity when fsync, fdatasync or sync are used
to ensure data and metadata is written to disk.
BTW curious note in the latter:
However, under fsync-heavy workloads, small log buffers can be
noticeably faster than large buffers with a large stripe unit
> On a platform where the IO subsystem is going to give you
> unpredictable IO latencies, that's exactly what want.
This then the argument that on platforms with bad latency that
decision works still works well because then you might as well go
But if someone really aims to run some kind of «fsync heavy»
workload on a high-latency and highly-variable latency VM,
usually their aim is to *minimize* the additional latency the
filesystem imposes, because «fsync heavy» workloads tend to be
transactional, and persisting data without delay is part of their
> Sure, it was designed to optimise spinning rust performance,
> but that same design is also optimal for virtual devices with
> unpredictable IO latency...
Ahhhh, now the «bad example» has become a worse one :-).
The argument you are making here is one for crass layering
violation: that the filesystem code should embed storage-layer
specific optimizations within it, and then one might get lucky
with other storage layers of similar profile. Tsk tsk :-). At
least it is not as breathtakingly inane as putting plug/unplug
block io subsystem.
But even on spinning rust, and on real host, and even forgiving
the layering violation, I question the aim to get better
throughput at the expense of worse latency for «fsync heavy»
loads, and even for the type of workloads for which this tradeoff
Because *my* argument is that how often 'fsync' "happens" should
be a decision by the application programmer; if they want higher
throughput at the cost of higher latency, they should issue it
less frequently, as 'fsync' should be executed with as low a
latency as possible.
Your underlying argument for XFS and its handling of «fsync
heavy» workloads (and it is the same argument for 'delaylog' I
guess) seems to me that applications issue 'fsync' too often, and
thus we can briefly hold them back to bunch them up, and people
like the extra throughput more than they dislike the extra latency.
Which reminds me of a discussions I had some time ago with some
misguided person who argued that 'fsync' and Linux barriers only
require ordering constraints, and don't imply any actual writing
to persistent storage, or within any specific timeframe, where
instead I was persuaded that their main purpose (no matter what
POSIX says :->) is to commit to persistent storage as quickly as
It looks like that XFS has gone more the way of something like
his position, because admittedly in practice keeping commits a
bit looser does deliver better throughput (hints of O_PONIES
But again, that's not what should be happening. Perhaps POSIX
should have provided :-) two barrier operations, a purely
ordering one, and a commit-now one. And application writers would
use them at the right times. And ponies for everybody :-).
>> In general file system designs are not at all independent of
>> the expected storage platform, and some designs are far better
>> than others for specific storage platforms, and viceversa.
> Sure, but filesystems also have inherent capabilities that are
> independent of the underlying storage.
But the example you make is not a «capability», it is the
hardcoded assumption that it is better to trade worse latency for
better throughput, which only makes sense for workloads that
don't want tight latency, or else storage layers that don't
> In these cases, the underlying storage really doesn't matter if
> the filesystem can't do what the application needs. Allocation
> parallelism, CPU parallelism, minimal concurrent fsync latency,
But you seemed to be describing above that XFS good at "maximal
concurrent fsync throughput" by disregarding «minimal concurrent
fsync latency» (as in «less log IO due to aggregation of log
writes during concurrent fsyncs. The more latency there is on a
log write, the more aggregation»).
> etc are all characteristics of filesystems that are independent
> of the underlying storage.
Ahhhh, but this is a totally different argument from embedding
specific latency/throughput tradeoffs in the storage layer.
This is an argument that a well designed filesystem that does
have bottlenecks on any aspect of the performance envelope is a
good general purpose one. Well, you can try to design one :-).
XFS comes close, like JFS and OCFS2, but it does have, as you
have pointed out above, workload-specific (which can turn into
storage-friendly) tradeoffs. And since Red Hat's acquisition of
GlusterFS I guess (or at least I hope) that XFS will be even more
central to their strategy.
BTW as to that, did a brief search and found this amusing
article, yet another proof that reality surpasses imagination:
Ah I was totally unaware of the AWS Compute Cluster service.
> If you need those characteristics in your remotely hosted VMs,
> then XFS is what you want regardless of how much storage
> capability you buy for those VMs....
Possibly, but from also a practical viewpoint that is again a
moderately bizarre argument, because workloads requiring high
levels of «Allocation parallelism, CPU parallelism, minimal
concurrent fsync latency» beg to be run on an Altix, or similar,
not on a bunch of random EC2 shared hosts running Xen VMs.