On 3/14/2013 9:59 AM, Dave Hall wrote:
> Linux decoy 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.35-2~bpo60+1 x86_64
Ok, so you're already on a recent kernel with delaylog.
>>> > ~$ grep xfs /etc/fstab
> LABEL=backup /infortrend xfs
> inode64,noatime,nodiratime,nobarrier 0 0
XFS uses relatime by default, so noatime/nodiratime are useless, though
not part of the problem. inode64 is good as your files and metadata
have locality. Nobarrier is good with functioning BBWC.
> meta-data=/dev/sdb1 isize=256 agcount=26,
> agsize=268435455 blks
> = sectsz=512 attr=2
> data = bsize=4096 blocks=6836364800, imaxpct=5
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0
> log =internal bsize=4096 blocks=521728, version=2
> = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
Standard internal log, no alignment. With delaylog, 512MB BBWC, and a
nearly pure metadata workload, this should be fine.
> Filesystem 1K-blocks Used Available Use% Mounted on
> /dev/sdb1 27343372288 20432618356 6910753932 75% /infortrend
Looks good. 75% is close to tickling the free space fragmentation
dragon but you're not there yet.
> Filesystem Inodes IUsed IFree IUse% Mounted on
> /dev/sdb1 5469091840 1367746380 4101345460 26% /infortrend
Plenty of free inodes.
> # xfs_db -r -c freesp /dev/sdb1
> from to extents blocks pct
> 1 1 832735 832735 0.05
> 2 3 432183 1037663 0.06
> 4 7 365573 1903965 0.11
> 8 15 352402 3891608 0.23
> 16 31 332762 7460486 0.43
> 32 63 300571 13597941 0.79
> 64 127 233778 20900655 1.21
> 128 255 152003 27448751 1.59
> 256 511 112673 40941665 2.37
> 512 1023 82262 59331126 3.43
> 1024 2047 53238 76543454 4.43
> 2048 4095 34092 97842752 5.66
> 4096 8191 22743 129915842 7.52
> 8192 16383 14453 162422155 9.40
> 16384 32767 8501 190601554 11.03
> 32768 65535 4695 210822119 12.20
> 65536 131071 2615 234787546 13.59
> 131072 262143 1354 237684818 13.76
> 262144 524287 470 160228724 9.27
> 524288 1048575 74 47384798 2.74
> 1048576 2097151 1 2097122 0.12
Your free space map isn't completely horrible given you're at 75%
capacity. Looks like most of it is in chunks 32MB and larger. Those
14.8m files have a mean size of ~1.22MB which suggests most of the files
are small, so you shouldn't be having high seek load (thus latency)
> The RAID box is an Infortrend S16S-G1030 with 512MB cache and a fully
> functional battery. I couldn't find any details about the internal
> RAID implementation used by Infortrend. The array is SAS attached to
> an LSI HBA (SAS2008 PCI-Express Fusion-MPT SAS-2).
It's an older unit, definitely not the fastest in its class, but unless
the firmware is horrible the 512MB BBWC should handle this metadata
workload with aplomb. With 128GB RAM and Linux read-ahead caching you
don't need the RAID controller to be doing read caching. Go into the
SANWatch interface and make sure you're dedicating all the cache to
writes not reads. This may or may not be configurable. Some firmware
will simply drop read cache lines dynamically when writes come in. Some
let you manually tweak the ratio. I'm not that familiar with the
Infortrend units. But again, this is a minor optimization, and I don't
think this is part of the problem.
> The system hardware is a SuperMicro quad 8-core XEON E7-4820 2.0GHz with
> 128 GB of ram, hyper-theading enabled. (This is something that I
> inherited. There is no doubt that it is overkill.)
Just a bit. 64 hardware threads, 72MB of L3 cache, and 128GB RAM for a
storage server with two storage HBAs and low throughput disk arrays.
Apparently running a Debian mirror is more compute intensive than I
> Another bit of information that you didn't ask about is the I/O
> scheduler algorithm.
Didn't get that far yet. ;)
> I just checked and found it set to 'cfq', although
> I though I had set it to 'noop' via a kernel parameter in GRUB.
As you're using a distro kernel, I recommend simply doing it in root's
crontab. That way it can't get 'lost' during kernel upgrades due to
grub update problems, etc. The scheduler can be changed on the fly so
it doesn't matter where you set it in the boot sequence.
@reboot /bin/echo noop > /sys/block/sdb/queue/scheduler
> Also, some observations about the cp -al: In parallel to investigating
> hardware/OS/filesystem issue I have done some experiments with cp -al.
> It hurts to have 64 cores available and see cp -al running the wheels
> off just one, with a couple others slightly active with system level
This tends to happen when one runs single threaded user space code on a
> So I tried some experiments where I copied smaller segments of
> the file tree in parallel (using make -j). I haven't had the chance to
> fully play this out, but these parallel cp invocations completed very
> quickly. So it would appear that the cp command itself may bog down
> with such a large file tree. I haven't had a chance to tear apart the
> source code or do any profiling to see if there are any obvious problems
> Lastly, I will mention that I see almost 0% wa when watching top.
So it's probably safe to say at this point that XFS and IO in general
are not the problem.
One thing you did not mention is how you are using rsnapshot. If you
are using it as most folks do to backup remote filesystems of other
machines over ethernet, what happens when you simply schedule multiple
rsnapshot processes concurrently, targeting each at a different remote
If you're using rsnapshot strictly locally, you should take a hard look
at xfsdump. It exists specifically for backing up XFS filesystems/files
and has been around a very long time, is very mature. It's not quite as
flexible as rsnapshot and may require more disk space, but it is
lighting fast, even though limited to a single thread on Linux. Why is
it lightning fast? Because the bulk of the work is performed in kernel
space by the XFS driver, directly manipulating the filesystem--no user
space execution or system calls. See 'man xfsdump'.
Familiarize yourself with it and perform a test dump, to a file, of a
large (~1TB) directory/tree. You'll see what we mean by lightning fast,
compared to rsnapshot and other user space methods. And you'll actually
see some IO throughput with this. ;)