[Top] [All Lists]

Re: zero size file after power failure with kernel

To: Linux XFS <xfs@xxxxxxxxxxx>
Subject: Re: zero size file after power failure with kernel
From: pg_xf2@xxxxxxxxxxxxxx (Peter Grandi)
Date: Tue, 1 Sep 2009 10:32:46 +0000
In-reply-to: <200909010918.37886@xxxxxx>
References: <200908292102.21710@xxxxxx> <4A99A80C.9010307@xxxxxxxxxxx> <19100.22644.149019.555685@xxxxxxxxxxxxxxxxxx> <200909010918.37886@xxxxxx>
[ ... ]

>> Then 'mount' with '-o sync' [ ... ]

> Yes. I could also simply switch back to reiserfs, where I
> never had this kind of issue, despite lots of crashes etc.

Other people have a very different impression. Like 'ext3'
ReiserFS does ordered writes, but those don't necessarily help
because of the colossal amount of buffering that happens anyhow

> [ ... ] maybe someone taps into the problem and can improve
> it.

It is foremost an application problem, and then a block layer
problem. The first is unsolvable ("user space sucks") in our
lifetimes, and the second depends on the goodwill of the
proprietors of the relevant kernel subsystem.

As to application design, XFS is targeted at heavily parallel
workloads on large storage arrays; its design takes advantage of
what API semantics permit to improve that use case, and relies
on applications making use of those API semantics properly.

If that and having good scalable performance at the same time
requires having dual power supplies, redundant storage paths,
and battery backup, that is the typical platform on which XFS is

> There was a similar problem with the change from ext3 to ext4,
> with a big discussion. Ext4 has been improved,

Actually it has been made worse, to compensate for bad
application and block layer behaviour.

Red Hat with 'ext4' have been trying to imply that an in-place
upgrade to an 'ext3' compatible filesystem can support every
possible point on the spectrum. Well, it turned out that they
cannot. So there have been motions towards supporting XFS in
5.4, to have a dual-filesystem strategy, which is what a large
number of their important enterprise customers do anyhow.

> [ ... ] In ext4 they reorganized the way metaupdates are done,
> maybe that can help xfs too.

But that makes performance worse in the large/paralell case.

> [ ... ] It seems kmail writes its config every 7 minutes, so
> it is vulnerable for 3 seconds then.

That won't help that much. Apps and the block layer are really
designed for older, gentler times. And never mind the clueless,
moronic "optimization" of Linux block layer plugging/unplugging.

Currently a single disk can write 100MB/s, memory sizes on many
_laptops_ are 4GB with potentially 1-2GB or 10-20s of writes
cached. On a server one can have RAIDs that can write at/s.

If applications and the block layer are misbehaving, and '-o
sync' is not used, even if one flushes cache every second, there
can still be dozens of MB (on a laptop) to some GB (on a server)
that get lost in that one second. The filesystem can try hard to
ensure that metadata gets written nearly immediately, ensuring
'fsck'-consistency, but it cannot do that for data in any
sensible way unless the application and the block layer do the
right thing, so data persistency is at best elusive.

> I've set vm.dirty_expire_centisecs = 1000 now to improve the
> situation a bit.

It does not help that not only the applications and the block
layer are misdesigned, but they also misdesigned for a time
where data rates were a lot lower, so outstanding updates were
bounded a lot lower.

There are workarounds and by careful patching and changing
default settings one can palliate the worst situations; but for
example 10 seconds of 'dirty_expire_centisecs' seems way too
long (IIRC you have a fairly large memory and RAID) and other
settings matter more.

I have written quite a bit in my blog about these issues, and
you may find this particular entry rather relevant:


In general on a fast machine I would use:

  vm/dirty_ratio                  =4
  vm/dirty_background_ratio       =2
  vm/dirty_expire_centisecs       =400
  vm/dirty_writeback_centisecs    =200

or half of every one. Short flushing times also ensure more
continuous flushing (without huge periodic gulps), which can
significantly improve *write* performance for streaming
applications (XFS etc. delayed allocation is designed to improve
read performance despite the lack of preallocation).

This cannot be done on laptops, where short flushing times are
bad for power consumption, but at least they are battery backed,
and hopefully SSDs will save us anyhow.

<Prev in Thread] Current Thread [Next in Thread>