Questions about XFS
Ric Wheeler
rwheeler at redhat.com
Tue Jun 11 12:19:53 CDT 2013
On 06/11/2013 12:12 PM, Steve Bergman wrote:
> In #5 I was specifically talking about ext4. After the 2009 brouhaha
> over zero-length files in ext4 with delayed allocation turned on, Ted
> merged some patches into vanilla kernel 2,6,30 which mitigated the
> problem by recognizing certain common idioms and forcing automatically
> forcing an fsync. I'd heard the the XFS team modeled a set of XFS
> patches from them.
>
> Regarding #4, I have 12 years experience with my workloads on ext3 and
> 3 yrs on ext4 and know what I have observed. As a practical matter,
> there are large differences between filesystem behaviors which aren't
> up for debate since I know my workloads' behavior in the real world
> far better than anyone else possibly could. (In fact, I'm not sure how
> anyone else could presume to know how my workloads and filesystems
> interact.) But if I understand correctly, ext4 at default settings
> journals metadata and commits it every 5s, while flushing data every
> 30s. Ext3 journals metadata, and commits it every 5 seconds, while
> effectively flushing data, *immediately before the metadata*, every 5
> seconds. so the window in which data and metadata are not in sync is
> vanishingly small. Are you saying that with XFS there is no periodic
> flushing mechanism at all? And that unless there's an
> fsync/fdatasync/sync or the memory needs to be reclaimed, that it can
> sit in the page cache forever?
I think that you are still missing the bigger point.
Periodic fsync() - done magically under the covers by the file system - does not
provide any useful data integrity for any serious application.
Let's take a simple example - a database app that does say 30 transactions/sec.
In your example, you are extremely likely to lose up to just shy of 5 seconds of
"committed" data - way over 100 transactions! That can be *really* serious
amounts of data and translate into large financial loss.
In a second example, let's say you are copying data to disk (say a movie) at a
rate of 50 MB/second. When the power cut hits at just the wrong time, you will
have lost a large chunk of that data that has been "written" to disk (over 200MB).
You won't get any serious file system or storage person to go out on a limb on
this kind of "it mostly kind of works" type of scenario. It just does not cut it
in the enterprise world.
Hope this is helpful :)
Ric
>
> One thing is puzzling me. Everyone is telling me that I must ensure
> that fsync/fdatasync is used, even in environments where the concept
> doesn't exist. So I've gone to find good examples of how it it used.
> Since RHEL6 has been shipping with ext4 as the default for over 2.5
> years, I figured it would be a great place to find examples. However,
> I've been unable to find examples of fsync or fdatasync being used,
> when using "strace -o file.out -f" on various system programs which
> one would very much expect to use it. We talked about some Python
> config utilities the other day. But now I've moved on to C and C++
> code. e.g. "cupsd" copy/truncate/writes the config file
> "/etc/cups/printers.conf" quite frequently, all day long. But there is
> no sign whatsoever of any fsync or fdatasync when I grep the strace
> output file for those strings case insensitively. (And indeed, a
> complex printers.conf file turned up zero-length on one of my RHEL6.4
> boxes last week.)
>
> So I figured that when rpm installs a new vmlinuz, builds a new
> initramfs and puts it into place, and modifies grub.conf, that surely
> proper sync'ing must be done in this particularly critical case. But
> while I do see rpm fsync/fsync'ing its own database files, it never
> seems to fsync/fdatasync the critical system files it just installed
> and/or modified. Surely, after over 2 - 1/2 years of Red Hat shipping
> RHEL6 to customers, I must be mistaken in some way. Could you point me
> to an example in RHEL6.4 where I can see clearly how fsync is being
> properly used? In the mean time, I'll keep looking.
>
>
> Thanks,
> Steve
>
>
>
> On Tue, Jun 11, 2013 at 8:59 AM, Ric Wheeler <rwheeler at redhat.com> wrote:
>> On 06/11/2013 05:56 AM, Steve Bergman wrote:
>>> 4. From the time I write() a bit of data, what's the maximum time before
>>> the
>>> data is actually committed to disk?
>>>
>>> 5. Ext4 provides some automatic fsync'ing to avoid the zero-length file
>>> issue for some common cases via the auto_da_alloc feature added in kernel
>>> 2.6.30. Does XFS have similar behavior?
>>
>> I think that here you are talking more about ext3 than ext4.
>>
>> The answer to both of these - even for ext4 or ext3 - is that unless your
>> application and storage is all properly configured, you are effectively at
>> risk indefinitely. Chris Mason did a study years ago where he was able to
>> demonstrate that dirty data could get pinned in a disk cache effectively
>> indefinitely. Only an fsync() would push that out.
>>
>> Applications need to use the data integrity hooks in order to have a
>> reliable promise that application data is crash safe. Jeff Moyer wrote up a
>> really nice overview of this for lwn which you can find here:
>>
>> http://lwn.net/Articles/457667
>>
>> That said, if you have applications that do not do any of this, you can roll
>> the dice and use a file system like ext3 that will periodically push data
>> out of the page cache for you.
>>
>> Note that without the barrier mount option, that is not sufficient to push
>> data to platter, just moves it down the line to the next potentially
>> volatile cache :) Even then, 4 out of every 5 seconds, your application
>> will be certain to lose data if the box crashes while it is writing data.
>> Lots of applications don't actually use the file system much (or write
>> much), so ext3's sync behaviour helped mask poorly written applications
>> pretty effectively for quite a while.
>>
>> There really is no short cut to doing the job right - your applications need
>> to use the correct calls and we all need to configure the file and storage
>> stack correctly.
>>
>> Thanks!
>>
>> Ric
>>
>> _______________________________________________
>> xfs mailing list
>> xfs at oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs
More information about the xfs
mailing list