xfs
[Top] [All Lists]

GNU 'tar', Schilling's 'tar', write-cache/barrier

To: Linux fs XFS <xfs@xxxxxxxxxxx>
Subject: GNU 'tar', Schilling's 'tar', write-cache/barrier
From: pg_xf2@xxxxxxxxxxxxxxxxxx (Peter Grandi)
Date: Sat, 24 Mar 2012 16:27:19 +0000
In-reply-to: <20333.8944.573177.821944@xxxxxxxxxxxxxxxxxx>
References: <CAA8mOyDKrWg0QUEHxcD4ocXXD42nJu0TG+sXjC4j2RsigHTcmw@xxxxxxxxxxxxxx> <4F6624A3.5010206@xxxxxxxxxxxxxxxxx> <20331.39194.377610.888636@xxxxxxxxxxxxxxxxxx> <201203232348.09158.Martin@xxxxxxxxxxxx> <20333.8944.573177.821944@xxxxxxxxxxxxxxxxxx>
>> [ ... ] there has been quite some other metadata related
>> performance improvements. Thus IMHO reducing the recent
>> improvements in metadata performance is underselling XFS and
>> overselling delaylog. [ ... ]

> That's a good way of putting it, and I am pleased that I finally
> get a reasonable comment on this story, and one that agrees with
> one of my previous points in this thread: [ ... ]
[ ... ]
> http://xfs.org/images/d/d1/Xfs-scalability-lca2012.pdf
>  «* Ext4 can be up 20-50x times than XFS when data is also being
>     written as well (e.g. untarring kernel tarballs).
>   * This is XFS @ 2009-2010.
>   * Unless you have seriously fast storage, XFS just won't
>     perform well on metadata modification heavy workloads.»

> It is never mentioned that 'ext4' is 20-50x faster on metadata
> modification workloads because it implements much weaker
> semantics than «XFS @ 2009-2010», and that 'delaylog' matches
> 'ext4' because it implements similarly weaker semantics, by
> reducing the frequency of commits, as the XFS FAQ briefly
> summarizes: [ ... ]

As to this, I have realized that there is a very big detail that
I have given for implicit but that perhaps at this point should
be made explicit as to the deliberately misleading propaganda
that «Ext4 can be up 20-50x times than XFS when data is also
being written as well (e.g. untarring kernel tarballs).»:

  Almost all «untarring kernel tarballs» "benchmarks" are done
  with GNU 'tar', and it does not 'fsync'.

This matters because XFS has done the "right thing" with 'fsync'
for a long time, and if the application does 'fsync' then 'ext4',
XFS without and with 'delaylog' are mostly equivalent.

Conversely Schilling's 'tar' does 'fsync' and as a result it is
often considered (by the gullible crowd to which the presentation
propaganda referred to above is addressed) to have less
"performance" than GNU 'tar'.

To illustrate I have done a tiny test '.tar' file with a
directory and two files within, and this is what happens with
Schilling's 'tar':

  $ strace -f -e trace=file,fsync,fdatasync,read,write star xf d.tar
  open("d.tar", O_RDONLY)                 = 7
  read(7, "d/\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
512) = 512
  Process 8201 attached
  [ ... ]
  [pid  8200] lstat("d/", 0x7fff174d9490) = -1 ENOENT (No such file or 
directory)
  [pid  8200] lstat("d/", 0x7fff174d9330) = -1 ENOENT (No such file or 
directory)
  [pid  8200] access("d", F_OK)           = -1 ENOENT (No such file or 
directory)
  [pid  8200] mkdir("d", 0700)            = 0
  [pid  8200] lstat("d/", {st_mode=S_IFDIR|0700, st_size=6, ...}) = 0
  [pid  8200] lstat("d/f1", 0x7fff174d9490) = -1 ENOENT (No such file or 
directory)
  [pid  8200] open("d/f1", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 4
  [pid  8200] write(4, 
"3\275@&{U(\356\332\25z\250\236\256v\6U[5\334\265\313\206:\351\335\366Q\21\231\210H"...,
 128) = 128
  [pid  8200] fsync(4 <unfinished ...>
  [pid  8201] <... write resumed> )       = 1
  [pid  8201] read(7, "", 10240)          = 0
  Process 8201 detached
  <... fsync resumed> )                   = 0
  --- SIGCHLD (Child exited) @ 0 (0) ---
  utimes("d/f1", {{1332588240, 0}, {1332588240, 0}}) = 0
  utimes("d/f1", {{1332588240, 0}, {1332588240, 0}}) = 0
  lstat("d/f2", 0x7fff174d9490)           = -1 ENOENT (No such file or 
directory)
  open("d/f2", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 4
  write(4, 
"\377\325\253\257,\210\2719e\24\347*P\325x\357\345\220\375Ei\375\355\22063\17\355\312.\6\347"...,
 4096) = 4096
  fsync(4)                                = 0
  utimes("d/f2", {{1332588257, 0}, {1332588257, 0}}) = 0
  utimes("d/f2", {{1332588257, 0}, {1332588257, 0}}) = 0
  utimes("d", {{1332588242, 0}, {1332588242, 0}}) = 0
  write(2, "star: 1 blocks + 0 bytes (total "..., 58star: 1 blocks + 0 bytes 
(total of 10240 bytes = 10.00k).
  ) = 58

Compare with GNU 'tar':

  $ strace -f -e trace=file,fsync,fdatasync,read,write tar xf d.tar
  [ ... ]
  open("d.tar", O_RDONLY)                 = 3
  read(3, "d/\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
10240) = 10240
  [ ... ]
  mkdir("d", 0700)                        = -1 EEXIST (File exists)
  stat("d", {st_mode=S_IFDIR|0700, st_size=24, ...}) = 0
  open("d/f1", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists)
  unlink("d/f1")                          = 0
  open("d/f1", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4
  write(4, 
"3\275@&{U(\356\332\25z\250\236\256v\6U[5\334\265\313\206:\351\335\366Q\21\231\210H"...,
 128) = 128
  close(4)                                = 0
  utimensat(AT_FDCWD, "d/f1", {{1332589368, 193330071}, {1332588240, 0}}, 0) = 0
  open("d/f2", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists)
  unlink("d/f2")                          = 0
  open("d/f2", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4
  write(4, 
"\377\325\253\257,\210\2719e\24\347*P\325x\357\345\220\375Ei\375\355\22063\17\355\312.\6\347"...,
 4096) = 4096
  close(4)                                = 0
  utimensat(AT_FDCWD, "d/f2", {{1332589368, 193330071}, {1332588257, 0}}, 0) = 0
  close(3)                                = 0
  utimensat(AT_FDCWD, "d", {{1332589368, 193330071}, {1332588242, 0}}, 0) = 0
  close(1)                                = 0
  close(2)                                = 0

In effect running GNU 'tar x' (GNU 'tar') is the same as running
'eatmydata tar x ...'; and indeed as its documentation says,
'eatmydata' is designed to achieve higher "performance" by
turning programs that behave like Schilling's 'tar' into programs
that behave like GNU 'tar'.

When GNU 'tar' is used as a "benchmark" for 'delaylog' and there
are no 'fsync's, the longer the interval between commits (and
thus the implicit unsafety) the higher the "performance", or at
least that's the argument I think propagandists and buffoons may
be using.

That's one important reason why I mentioned 'eatmydata' as one
performance enhancing technique in a group with 'nobarrier' and
'delaylog'; and why I was amused by this buffoonery:

 «So you're comparing delaylog's volatile buffer architecture to
  software that *intentionally and transparently disables fsync*?»

Because when the 'delaylog' propagandists write that:

  «Ext4 can be up 20-50x times than XFS when data is also being
   written as well (e.g. untarring kernel tarballs).»

it is them who are comparing "performance" using GNU 'tar' which
intentionally and transparently does not use at all 'fsync'.

To illustrate here are some "benchmarks", which hopefully should
be revealing as to the merit of the posturings of some of the
buffoons or propagandists that have been discontributing to this
discussion (note that there are somewhat subtle details both as
to the setup and the results):

--------------------------------------------------------------
#  uname -a
Linux base.ty.sabi.co.uk 2.6.18-274.18.1.el5 #1 SMP Thu Feb 9 12:20:03 EST 2012 
x86_64 x86_64 x86_64 GNU/Linux
#  egrep ' (/tmp|/tmp/(ext4|xfs))' /proc/mounts; sysctl vm | egrep 
'_(bytes|centisecs)' | sort
none /tmp tmpfs rw 0 0
/dev/sdd8 /tmp/xfs xfs 
rw,nouuid,attr2,inode64,logbsize=256k,sunit=8,swidth=8,noquota 0 0
/dev/sdd3 /tmp/ext4 ext4 rw,barrier=1,data=ordered 0 0
vm.dirty_background_bytes = 900000000
vm.dirty_bytes = 500000000
vm.dirty_expire_centisecs = 2000
vm.dirty_writeback_centisecs = 1000
--------------------------------------------------------------
#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f 
/tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m1.027s
user    0m0.105s
sys     0m0.922s
Dirty:          419700 kB
Writeback:           0 kB

real    0m5.163s
user    0m0.000s
sys     0m0.473s
--------------------------------------------------------------
#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -no-fsync -x -f 
/tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).

real    0m1.204s
user    0m0.139s
sys     0m1.270s
Dirty:          419456 kB
Writeback:           0 kB

real    0m5.012s
user    0m0.000s
sys     0m0.458s
--------------------------------------------------------------
#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -x -f 
/tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).

real    23m29.346s
user    0m0.327s
sys     0m2.280s
Dirty:             108 kB
Writeback:           0 kB

real    0m0.236s
user    0m0.000s
sys     0m0.199s
--------------------------------------------------------------
#  (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f 
/tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m46.554s
user    0m0.107s
sys     0m1.271s
Dirty:          415168 kB
Writeback:           0 kB

real    1m54.913s
user    0m0.000s
sys     0m0.325s
----------------------------------------------------------------
#  (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time star -x -f 
/tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).

real    60m15.723s
user    0m0.442s
sys     0m7.009s
Dirty:               4 kB
Writeback:           0 kB

real    0m0.222s
user    0m0.000s
sys     0m0.194s
----------------------------------------------------------------

>From the above my conclusion is that «XFS @ 2009-2010» half the
performance of 'ext4' on this workload, and that «Ext4 can be up
20-50x times than XFS when data is also being written as well
(e.g. untarring kernel tarballs).» only when both data and
metadata are written to RAM by 'ext4'.

One can spend a lot of time changing parameters, as in using
'delaylog' or 'nobarrier' etc.

I have tried with my favourite rather "tighter" flusher
parameters, some comparisons that I find interesting:

----------------------------------------------------------------
#  egrep ' (/tmp|/tmp/(ext4|xfs))' /proc/mounts; sysctl vm | egrep 
'_(bytes|centisecs)' | sort
none /tmp tmpfs rw 0 0
/dev/sdd3 /tmp/ext4 ext4 rw,barrier=1,data=ordered 0 0
/dev/sdd8 /tmp/xfs xfs 
rw,nouuid,attr2,inode64,logbsize=256k,sunit=8,swidth=8,noquota 0 0
vm.dirty_background_bytes = 900000000
vm.dirty_bytes = 100000
vm.dirty_expire_centisecs = 200
vm.dirty_writeback_centisecs = 100
#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f 
/tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m6.776s
user    0m0.107s
sys     0m1.260s
Dirty:            1776 kB
Writeback:           0 kB

real    0m0.231s
user    0m0.000s
sys     0m0.197s
----------------------------------------------------------------
#  (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f 
/tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    2m25.805s
user    0m0.135s
sys     0m1.812s
Dirty:            2372 kB
Writeback:          84 kB

real    0m1.683s
user    0m0.000s
sys     0m0.196s
----------------------------------------------------------------

That's a bit of a surprise, because time to completion on both
when the flusher parameters allowed writing entirely to memory
for both with 'eatmydata tar' were the same. It looks like that
when flushing 'xfs' still does a fair bit of implicit metadata
commits, as switching off barriers shows:

----------------------------------------------------------------
#  mount -o remount,barrier=0 /dev/sdd8 /tmp/ext4
#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f 
/tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m7.388s
user    0m0.127s
sys     0m1.235s
Dirty:             508 kB
Writeback:           0 kB

real    0m0.243s
user    0m0.000s
sys     0m0.199s
----------------------------------------------------------------
#  mount -o remount,nobarrier /dev/sdd3 /tmp/xfs
#  (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f 
/tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m31.047s
user    0m0.124s
sys     0m1.880s
Dirty:            2324 kB
Writeback:          24 kB

real    0m0.269s
user    0m0.000s
sys     0m0.195s
----------------------------------------------------------------

While it seems likely 'ext4' runs headlong without commits on
either metadata or data ('ext4' and 'ext3' in effect have a
rather loose 'delaylog'). XFS however seems to be a bit at a
disadvantage though as with 'nobarrier' and 'eatmydata tar' the
time to completion should be the same. The partition for XFS is
on inner tracks, but that does not make that much of a
difference.

Also compare with 'ext4' using 'eatmydata tar' with no barriers
and using 'star' with no barrier and also 'data=writeback':

----------------------------------------------------------------
  base#  umount /tmp/ext4; mount -t ext4 -o defaults,barrier=0,data=writeback 
/dev/sdd3 /tmp/ext4
  base#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f 
/tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m6.158s
user    0m0.123s
sys     0m1.233s
Dirty:            1704 kB
Writeback:           0 kB

real    0m0.247s
user    0m0.001s
sys     0m0.194s
----------------------------------------------------------------
  base#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -x -f 
/tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).

real    0m32.101s
user    0m0.196s
sys     0m1.718s
Dirty:              24 kB
Writeback:          48 kB

real    0m0.217s
user    0m0.000s
sys     0m0.193s
----------------------------------------------------------------

Finally here is on XFS, with 'delaylog', on a system with a
3.x kernel and a rather fast (especially on small random writes)
SSD drive (and my usual tighter flusher parameters):

----------------------------------------------------------------
#  uname -a
Linux.ty.sabi.co.UK 3.0.0-15-generic #26~lucid1-Ubuntu SMP Wed Jan 25 15:37:10 
UTC 2012 x86_64 GNU/Linux
#  egrep ' (/tmp|/tmp/(ext4|xfs))' /proc/mounts; sysctl -a 2>/dev/null | egrep 
'_(bytes|centisecs)' | sort
none /tmp tmpfs rw,relatime,size=1024000k 0 0
/dev/sda6 /tmp/xfs xfs 
rw,noatime,nodiratime,attr2,delaylog,discard,inode64,logbsize=256k,sunit=16,swidth=8192,noquota
 0 0
/dev/sda3 /tmp/ext4 ext4 
rw,nodiratime,relatime,errors=remount-ro,user_xattr,acl,barrier=1,data=ordered,discard
 0 0
fs.xfs.age_buffer_centisecs = 1500
fs.xfs.filestream_centisecs = 3000
fs.xfs.xfsbufd_centisecs = 100
fs.xfs.xfssyncd_centisecs = 3000
vm.dirty_background_bytes = 900000000
vm.dirty_bytes = 100000000
vm.dirty_expire_centisecs = 200
vm.dirty_writeback_centisecs = 100
----------------------------------------------------------------
#  (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f 
/tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m5.148s
user    0m0.300s
sys     0m2.876s
Dirty:             50052 kB
Writeback:             0 kB
WritebackTmp:          0 kB

real    0m0.784s
user    0m0.000s
sys     0m0.100s
----------------------------------------------------------------
#  (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time star -x -f 
/tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).

real    6m21.946s
user    0m0.808s
sys     0m11.321s
Dirty:                 0 kB
Writeback:             0 kB
WritebackTmp:          0 kB

real    0m0.097s
user    0m0.000s
sys     0m0.044s
----------------------------------------------------------------

The effect of 'delaylog' is pretty obvious there.

The numbers above with their wide variation depending on changes
in the level of safety requested amply demonstrate that it takes
the skills of a propagandist or a buffoon to boast about the
"performance" of 'delaylog' and comparisons with 'ext4' without
prominently mentioning the big safety tradeoffs involved.

<Prev in Thread] Current Thread [Next in Thread>