GNU 'tar', Schilling's 'tar', write-cache/barrier

Peter Grandi pg_xf2 at xf2.for.sabi.co.UK
Sat Mar 24 11:27:19 CDT 2012


>> [ ... ] there has been quite some other metadata related
>> performance improvements. Thus IMHO reducing the recent
>> improvements in metadata performance is underselling XFS and
>> overselling delaylog. [ ... ]

> That's a good way of putting it, and I am pleased that I finally
> get a reasonable comment on this story, and one that agrees with
> one of my previous points in this thread: [ ... ]
[ ... ]
> http://xfs.org/images/d/d1/Xfs-scalability-lca2012.pdf
>  «* Ext4 can be up 20-50x times than XFS when data is also being
>     written as well (e.g. untarring kernel tarballs).
>   * This is XFS @ 2009-2010.
>   * Unless you have seriously fast storage, XFS just won't
>     perform well on metadata modification heavy workloads.»

> It is never mentioned that 'ext4' is 20-50x faster on metadata
> modification workloads because it implements much weaker
> semantics than «XFS @ 2009-2010», and that 'delaylog' matches
> 'ext4' because it implements similarly weaker semantics, by
> reducing the frequency of commits, as the XFS FAQ briefly
> summarizes: [ ... ]

As to this, I have realized that there is a very big detail that
I have given for implicit but that perhaps at this point should
be made explicit as to the deliberately misleading propaganda
that «Ext4 can be up 20-50x times than XFS when data is also
being written as well (e.g. untarring kernel tarballs).»:

  Almost all «untarring kernel tarballs» "benchmarks" are done
  with GNU 'tar', and it does not 'fsync'.

This matters because XFS has done the "right thing" with 'fsync'
for a long time, and if the application does 'fsync' then 'ext4',
XFS without and with 'delaylog' are mostly equivalent.

Conversely Schilling's 'tar' does 'fsync' and as a result it is
often considered (by the gullible crowd to which the presentation
propaganda referred to above is addressed) to have less
"performance" than GNU 'tar'.

To illustrate I have done a tiny test '.tar' file with a
directory and two files within, and this is what happens with
Schilling's 'tar':

  $ strace -f -e trace=file,fsync,fdatasync,read,write star xf d.tar
  open("d.tar", O_RDONLY)                 = 7
  read(7, "d/\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
  Process 8201 attached
  [ ... ]
  [pid  8200] lstat("d/", 0x7fff174d9490) = -1 ENOENT (No such file or directory)
  [pid  8200] lstat("d/", 0x7fff174d9330) = -1 ENOENT (No such file or directory)
  [pid  8200] access("d", F_OK)           = -1 ENOENT (No such file or directory)
  [pid  8200] mkdir("d", 0700)            = 0
  [pid  8200] lstat("d/", {st_mode=S_IFDIR|0700, st_size=6, ...}) = 0
  [pid  8200] lstat("d/f1", 0x7fff174d9490) = -1 ENOENT (No such file or directory)
  [pid  8200] open("d/f1", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 4
  [pid  8200] write(4, "3\275@&{U(\356\332\25z\250\236\256v\6U[5\334\265\313\206:\351\335\366Q\21\231\210H"..., 128) = 128
  [pid  8200] fsync(4 <unfinished ...>
  [pid  8201] <... write resumed> )       = 1
  [pid  8201] read(7, "", 10240)          = 0
  Process 8201 detached
  <... fsync resumed> )                   = 0
  --- SIGCHLD (Child exited) @ 0 (0) ---
  utimes("d/f1", {{1332588240, 0}, {1332588240, 0}}) = 0
  utimes("d/f1", {{1332588240, 0}, {1332588240, 0}}) = 0
  lstat("d/f2", 0x7fff174d9490)           = -1 ENOENT (No such file or directory)
  open("d/f2", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 4
  write(4, "\377\325\253\257,\210\2719e\24\347*P\325x\357\345\220\375Ei\375\355\22063\17\355\312.\6\347"..., 4096) = 4096
  fsync(4)                                = 0
  utimes("d/f2", {{1332588257, 0}, {1332588257, 0}}) = 0
  utimes("d/f2", {{1332588257, 0}, {1332588257, 0}}) = 0
  utimes("d", {{1332588242, 0}, {1332588242, 0}}) = 0
  write(2, "star: 1 blocks + 0 bytes (total "..., 58star: 1 blocks + 0 bytes (total of 10240 bytes = 10.00k).
  ) = 58

Compare with GNU 'tar':

  $ strace -f -e trace=file,fsync,fdatasync,read,write tar xf d.tar
  [ ... ]
  open("d.tar", O_RDONLY)                 = 3
  read(3, "d/\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) = 10240
  [ ... ]
  mkdir("d", 0700)                        = -1 EEXIST (File exists)
  stat("d", {st_mode=S_IFDIR|0700, st_size=24, ...}) = 0
  open("d/f1", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists)
  unlink("d/f1")                          = 0
  open("d/f1", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4
  write(4, "3\275@&{U(\356\332\25z\250\236\256v\6U[5\334\265\313\206:\351\335\366Q\21\231\210H"..., 128) = 128
  close(4)                                = 0
  utimensat(AT_FDCWD, "d/f1", {{1332589368, 193330071}, {1332588240, 0}}, 0) = 0
  open("d/f2", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists)
  unlink("d/f2")                          = 0
  open("d/f2", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4
  write(4, "\377\325\253\257,\210\2719e\24\347*P\325x\357\345\220\375Ei\375\355\22063\17\355\312.\6\347"..., 4096) = 4096
  close(4)                                = 0
  utimensat(AT_FDCWD, "d/f2", {{1332589368, 193330071}, {1332588257, 0}}, 0) = 0
  close(3)                                = 0
  utimensat(AT_FDCWD, "d", {{1332589368, 193330071}, {1332588242, 0}}, 0) = 0
  close(1)                                = 0
  close(2)                                = 0

In effect running GNU 'tar x' (GNU 'tar') is the same as running
'eatmydata tar x ...'; and indeed as its documentation says,
'eatmydata' is designed to achieve higher "performance" by
turning programs that behave like Schilling's 'tar' into programs
that behave like GNU 'tar'.

When GNU 'tar' is used as a "benchmark" for 'delaylog' and there
are no 'fsync's, the longer the interval between commits (and
thus the implicit unsafety) the higher the "performance", or at
least that's the argument I think propagandists and buffoons may
be using.

That's one important reason why I mentioned 'eatmydata' as one
performance enhancing technique in a group with 'nobarrier' and
'delaylog'; and why I was amused by this buffoonery:

 «So you're comparing delaylog's volatile buffer architecture to
  software that *intentionally and transparently disables fsync*?»

Because when the 'delaylog' propagandists write that:

  «Ext4 can be up 20-50x times than XFS when data is also being
   written as well (e.g. untarring kernel tarballs).»

it is them who are comparing "performance" using GNU 'tar' which
intentionally and transparently does not use at all 'fsync'.

To illustrate here are some "benchmarks", which hopefully should
be revealing as to the merit of the posturings of some of the
buffoons or propagandists that have been discontributing to this
discussion (note that there are somewhat subtle details both as
to the setup and the results):

--------------------------------------------------------------
#  uname -a
Linux base.ty.sabi.co.uk 2.6.18-274.18.1.el5 #1 SMP Thu Feb 9 12:20:03 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
#  egrep ' (/tmp|/tmp/(ext4|xfs))' /proc/mounts; sysctl vm | egrep '_(bytes|centisecs)' | sort
none /tmp tmpfs rw 0 0
/dev/sdd8 /tmp/xfs xfs rw,nouuid,attr2,inode64,logbsize=256k,sunit=8,swidth=8,noquota 0 0
/dev/sdd3 /tmp/ext4 ext4 rw,barrier=1,data=ordered 0 0
vm.dirty_background_bytes = 900000000
vm.dirty_bytes = 500000000
vm.dirty_expire_centisecs = 2000
vm.dirty_writeback_centisecs = 1000
--------------------------------------------------------------
#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m1.027s
user    0m0.105s
sys     0m0.922s
Dirty:          419700 kB
Writeback:           0 kB

real    0m5.163s
user    0m0.000s
sys     0m0.473s
--------------------------------------------------------------
#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -no-fsync -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).

real    0m1.204s
user    0m0.139s
sys     0m1.270s
Dirty:          419456 kB
Writeback:           0 kB

real    0m5.012s
user    0m0.000s
sys     0m0.458s
--------------------------------------------------------------
#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).

real    23m29.346s
user    0m0.327s
sys     0m2.280s
Dirty:             108 kB
Writeback:           0 kB

real    0m0.236s
user    0m0.000s
sys     0m0.199s
--------------------------------------------------------------
#  (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m46.554s
user    0m0.107s
sys     0m1.271s
Dirty:          415168 kB
Writeback:           0 kB

real    1m54.913s
user    0m0.000s
sys     0m0.325s
----------------------------------------------------------------
#  (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).

real    60m15.723s
user    0m0.442s
sys     0m7.009s
Dirty:               4 kB
Writeback:           0 kB

real    0m0.222s
user    0m0.000s
sys     0m0.194s
----------------------------------------------------------------



More information about the xfs mailing list