[Top] [All Lists]

Re: raid10n2/xfs setup guidance on write-cache/barrier

To: Linux RAID <linux-raid@xxxxxxxxxxxxxxx>, Linux fs XFS <xfs@xxxxxxxxxxx>
Subject: Re: raid10n2/xfs setup guidance on write-cache/barrier
From: pg@xxxxxxxxxxxxxxxxxxxx (Peter Grandi)
Date: Sat, 17 Mar 2012 15:35:06 +0000
In-reply-to: <CAA8mOyC-xCcNxQFz-M1_TfnxcGBmpUdQ57_vYNVF11jhWK3SSA@xxxxxxxxxxxxxx>
References: <CAA8mOyDKrWg0QUEHxcD4ocXXD42nJu0TG+sXjC4j2RsigHTcmw@xxxxxxxxxxxxxx> <4F61803A.60009@xxxxxxxxxxxxxxxxx> <CAA8mOyCzs36YD_QUMq25HQf8zuq1=tmSTPjYdoFJwy2Oq9sLmw@xxxxxxxxxxxxxx> <20321.63389.586851.689070@xxxxxxxxxxxxxxxxxx> <CAA8mOyCUCyzGCuNtn4txFuBtbk22M6LRpShNVH_Szb5=_F1PBw@xxxxxxxxxxxxxx> <20322.29849.917554.794740@xxxxxxxxxxxxxxxxxx> <CAA8mOyC-xCcNxQFz-M1_TfnxcGBmpUdQ57_vYNVF11jhWK3SSA@xxxxxxxxxxxxxx>
[ ... ]

> I've read a recommendation to start the partition on the 1MB
> mark. Does this make sense?

As a general principle it is good, that has almost no cost.
Indeed recent versions of some partitionig tools do that by

I often recommend aligning partitions to 1GiB, also because I
like to have 1GiB or so of empty space at the very beginning and
end of a drive.

> I'd like to read about the NFS blog entry but the link you
> included results in a 404.  I forgot to mention in my last
> reply.

Oops I forgot a bit of the URL:

Note that currently I suggest different values from:

 «vm/dirty_ratio                  =4
  vm/dirty_background_ratio       =2»


  * 4% of memory "dirty" today is often a gigantic amount.
    I had provided an elegant patch to specify the same in
    absolute terms in
    but now the official way is the "_bytes" alternative.

  * 2% as the level at which writing becomes uncached is too
    low, and the system become unresposive when that level is
    crossed. Sure it is risky, but, regretfully, I think that
    maintaining responsiveness is usually better than limiting
    outstanding background writes.

> Based on what I understood from your thoughts above, if an
> applications issues a flush/fsync and it does not complete due
> to some catastrophic crash, xfs on its own can not roll back
> to the prev version of the file in case of unfinished write
> operation. disabling the device caches wouldn't help either
> right?

If your goal is to make sure incomplete updates don't get
persisted, disabling device caches might help with that, in a
very perverse way (if the whole partial update is still in the
device cache, it just vanishes). Forget that of course :-).

The main message is that filesystems in UNIX-like system should
not provide atomic transactions, just the means to do them at
the applications level, because they are both difficult and very

The secondary message is that some applications and the firmware
of some host adpters and drives don't do the right thing, and
if your really want to make sure about atomic transactions it is
an expensive and difficult system integration challenge.

> [ ... ] only filesystems that do COW can do this at the
> expense of performance? (btrfs and zfs, please hurry and grow
> up!)

Filesystems that do COW sort-of do *global* "rolling" updates,
that is filtree level snapshots, but that's a side effect of a
choice made for other reasons (consistency more than currency).

> [ ... ] If you were in my place with the resource constraints,
> you'd go with: xfs with barriers on top of mdraid10 with
> device cache ON and setting vm/dirty_bytes, [ ... ]

Yes, that seems a reasonable overall tradeoff, because XFS is
implemented to provide well defined (and documented) semantics,
to check whether the underlying storage layer actually does
barriers, and to perform decently even if "delayed" writing is
not that delayed.

<Prev in Thread] Current Thread [Next in Thread>