[Top] [All Lists]

XFS/Linux Sanity check

To: xfs@xxxxxxxxxxx
Subject: XFS/Linux Sanity check
From: Paul Anderson <pha@xxxxxxxxx>
Date: Mon, 2 May 2011 11:47:48 -0400
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:date:x-google-sender-auth :message-id:subject:from:to:content-type; bh=JkM/TWhaF1zr2k+APMhPzGcHFxnkJ4HBKRHaRtN5cIw=; b=YFNOnfmX2B7+89+APhTilsFZ6pwaeBddyhIC0mAsE7BfiNYuSb8Qgv/azdN+98HBPe RwFyiWVJRQARG0IssPcesEQAtx6HMplC/4qSdaYxAdE2JcqkfRcnr7FVL5BVDm03rrrV 41l6nZysC63Y7C/4NZEq1avsb4OhrIPqfwb2w=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; b=vAYDOilpF11NaW5A5gJTfBSAeSFxtavCetU2312RMavPl3ACoRmg/0EkqhCLMRVzCX WSzCX05ysHrjQUUtm4zn/CK1g4td71h6QTRV8GUfSXwB7VjYqxa3ouwK/tKy1gKh/FkS BaHV/FmWtb6QMerh7n6DNBReSbdQCXNDzDVCM=
Sender: powool@xxxxxxxxx
Our genetic sequencing research group is growing our file storage from
1PB to 2PB.

Our workload looks very much like large video processing might look -
relatively low metadata, very, very high sequential I/O.  The servers
will either be doing very high I/O with local I/O bound jobs, or
serving data via NFSv4 (or possibly custom data distribution means) to
our compute grid for compute bound jobs.  Our first PB of data is
largely on Promise RAID arrays, all of which are set up with XFS.
Generally, we're big fans of XFS for stability, high performance, and
robustness in the face of crashes.  We tried ZFS, ran into I/O
throttling issues that at the time seem intractable (write picketing -
essentially half the maximum write rate of hardware).

We are deploying five Dell 810s, 192GiB RAM, 12 core, each with three
LSI 9200-8E SAS controllers, and three SuperMicro 847 45 drive bay
cabinets with enterprise grade 2TB drives.

We're running Ubuntu 10.04 LTS, and have tried either the stock kernel
(2.6.32-30) or 2.6.35 from linux.org.  We organize the storage as one
software (MD) RAID 0 composed of 7 software RAID (MD) 6s, each with 18
drives, giving 204 TiB usable (9 drives of the 135 are unused).  XFS
is set up properly (as far as I know) with respect to stripe and chunk
sizes.  Allocation groups are 1TiB in size, which seems sane for the
size of files we expect to work with.

In isolated testing, I see around 5GiBytes/second raw (135 parallel dd
reads), and with a benchmark test of 10 simultaneous 64GiByte dd
commands, I can see just shy of 2 GiBytes/second reading, and around
1.4GiBytes/second writing through XFS.   The benchmark is crude, but
fairly representative of our expected use.

md apparently does not support barriers, so we are badly exposed in
that manner, I know.  As a test, I disabled write cache on all drives,
performance dropped by 30% or so, but since md is apparently the
problem, barriers still didn't work.

Nonetheless, what we need, but don't have, is stability.

With 2.6.32-30, we get reliable kernel panics after 2 days of
sustained rsync to the machine (around 150-250MiBytes/second for the
entire time - the source machines are slow), and with 2.6.35, we get a
bad resource contention problem fairly quickly - much less than 24
hours (in this instance, we start getting XFS kernel thread timeouts
similar to what I've seen posted here recently, but it isn't clear
whether it is only XFS or also ext3 boot drives that are starved for
I/O - suspending or killing all I/O load doesn't solve the problem -
only a reboot does).

Ideally, I'd firstly be able to find informed opinions about how I can
improve this arrangement - we are mildly flexible on RAID controllers,
very flexible on versions of Linux, etc, and can try other OS's as a
last resort (but the leading contender here would be "something"
running ZFS, and though I love ZFS, it really didn't seem to work well
for our needs).

Secondly, I welcome suggestions about which version of the linux
kernel you'd prefer to hear bug reports about, as well as what kinds
of output is most useful (we're getting all chassis set up with serial
console so we can do kgdb and also full kernel panic output results).

Thanks in advance,

Paul Anderson
Center for Statistical Genetics
University of Michigan

<Prev in Thread] Current Thread [Next in Thread>