xfs
[Top] [All Lists]

Re: Fwd: Sudden File System Corruption

To: Mike Dacre <mike.dacre@xxxxxxxxx>
Subject: Re: Fwd: Sudden File System Corruption
From: Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>
Date: Sat, 07 Dec 2013 05:12:41 -0600
Cc: "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <CAPd9ww8+W2VX2HAfxEkVN5mL1a_+=HDAStf1126WSE33Vb=VsQ@xxxxxxxxxxxxxx>
References: <CAPd9ww_qT9J_Rt04g7+OApoBeggNOyWNwD+57DiDTuUvz-O-0g@xxxxxxxxxxxxxx> <52A03513.6030408@xxxxxxxxxxxxxxxxx> <CAPd9ww9hsOFK6pxqRY-YtLLAkkJHCuSi1BaM4n9=2XTjNVAn2Q@xxxxxxxxxxxxxx> <CAPd9ww9QzFWUnLtzkdktd+fSX9pdft+wL6cvG2MzLpSdLko1dg@xxxxxxxxxxxxxx> <52A191BA.20800@xxxxxxxxxxxxxxxxx> <CAPd9ww8+W2VX2HAfxEkVN5mL1a_+=HDAStf1126WSE33Vb=VsQ@xxxxxxxxxxxxxx>
Reply-to: stan@xxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:24.0) Gecko/20100101 Thunderbird/24.1.1
On 12/6/2013 4:14 PM, Mike Dacre wrote:
> On Fri, Dec 6, 2013 at 12:58 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>wrote:
...
> UUID=a58bf1db-0d64-4a2d-8e03-aad78dbebcbe /science                xfs
> defaults,inode64          1 0

Your RAID card has persistent write cache (BBWC) and we know it's
enabled from tool your output.  By default XFS assumes BBWC is not
present, and uses write barriers to ensure order/consistency.   Using
barriers on top of BBWC will be detrimental to write performance, for a
couple of reasons:

1.  Prevents the controller from optimizing writeback patterns
2.  A portion, or all of, the write cache is frequently flushed

Add 'nobarrier' to your mount options to avoid this problem.  It should
speed up many, if not all, write operations considerably, which will in
turn decrease seek contention amongst jobs.  Currently your write cache
isn't working nearly as well as it should, and in fact could be
operating horribly.

> On the slave nodes, I managed to reduce the demand on the disks by adding
> the actimeo=60 mount option.  Prior to doing this I would sometimes see the
> disk being negatively affected by enormous numbers of getattr requests.
>  Here is the fstab mount on the nodes:
> 
> 192.168.2.1:/science                      /science                nfs
> defaults,vers=3,nofail,actimeo=60,bg,hard,intr,rw  0 0

One minute attribute cache lifetime seems maybe a little high for a
compute cluster.  But if you've had no ill effects and it squelched the
getattr flood this is good.

...
> Correct, I am not consciously aligning the XFS to the RAID geometry, I
> actually didn't know that was possible.

XFS alignment is not something to worry about in this case.

...
>> So it's a small compute cluster using NFS over Infiniband for shared
>> file access to a low performance RAID6 array.  The IO resource sharing
>> is automatic.  But AFAIK there's no easy way to enforce IO quotas on
>> users or processes, if at all.  You may simply not have sufficient IO to
>> go around.  Let's ponder that.
> 
> I have tried a few things to improve IO allocation.  BetterLinux have a
> cgroup control suite that allow on-the-fly user-level IO adjustments,
> however I found them to be quite cumbersome.

This isn't going to work well because a tiny IO stream can seek the
disks to death, such as a complex find command, ls -R, etc.  A single
command such as these can generate thousands of seeks.  Shaping/limiting
user IO won't affect this.

...
>> Looking at the math, you currently have approximately 14*150=2100
>> seeks/sec capability with 14x 7.2k RPM data spindles.  That's less than
>> 100 seeks/sec per compute node, i.e. each node is getting about 2/3rd of
>> the performance of a single SATA disk from this array.  This simply
>> isn't sufficient for servicing a 23 node cluster, unless all workloads
>> are compute bound, and none IO/seek bound.  Given the overload/crash
>> that brought you to our attention, I'd say some of your workloads are
>> obviously IO/seek bound.  I'd say you probably need more/faster disks.
>> Or you need to identify which jobs are IO/seek heavy and schedule them
>> so they're not running concurrently.
> 
> Yes, this is a problem.  We sadly lack the resources to do much better than
> this, we have recently been adding extra storage by just chaining together
> USB3 drives with RAID and LVM... which is cumbersome and slow, but cheaper.

USB disk is generally a recipe for disaster.  Plenty of horror stories
on both this list and linux-raid regarding USB connected drives,
enclosures, etc.  I pray you don't run into those problems.

> My current solution is to be on the alert for high IO jobs, and to move
> them to a specific torque queue that limits the number of concurrent jobs.
>  This works, but I have not found a way to do it automatically.
>  Thankfully, with a 12 member lab, it is actually not terribly complex to
> handle, but I would definitely prefer a more comprehensive solution.  I
> don't doubt that the huge IO and seek demands we put on these disks will
> cause more problems in the future.

Your LSI 9260 controller supports using SSDs for read/write flash cache.
 LSI charges $279 for it.  It's called CacheCade Pro:

http://www.lsi.com/products/raid-controllers/pages/megaraid-cachecade-pro-software.aspx.


Connect two good quality fast SSDs to the controller, such as:
http://www.newegg.com/Product/Product.aspx?Item=N82E16820147192

Two SSDs, mirrored, to prevent cached writes from being lost if a single
SSD fails.  You now have a ~90K IOPS, 128GB, 500MB/s low latency
read/write cache in front of your RAID6 array.  This should go a long
way toward eliminating your bottlenecks.  You can accomplish this for
~$550 assuming you have two backplane drive slots free for the SSDs.  If
not, you add one of these for $279:

http://www.newegg.com/Product/Product.aspx?Item=N82E16816117207

This is an Intel 24 port SAS expander, the same device as in your drive
backplane.  SAS expanders can be daisy chained many deep.  You can drop
it into a PCIe x4 or greater slot from which it only draws power--no
data pins are connected.  Or if no slots are available you can mount it
to the side wall of your rack server chassis and power it via the 4 pin
Molex plug.  This requires a drill, brass or plastic standoffs, and DIY
skills.  I use this option as it provides a solid mount for un/plugging
the SAS cables, and being side mounted neither it nor the cables
interfere with airflow.

You'll plug the 9260-4i into one port of the Intel expander.  You'll
need another SFF-8087 cable for this:

http://www.newegg.com/Product/Product.aspx?Item=N82E16812652015

You will plug your drive backplane cable into another of the 6 SFF-8087
ports on the Intel.  Into a 3rd port you will plug an SFF-8087 breakout
cable to give you 4 individual drive connections.  You will plug two of
these into your two SSDs.

http://www.newegg.com/Product/Product.aspx?Item=N82E16816116097

If you have no internal 2.5/3.5" drive brackets free for the SSDs and
you'd prefer not to drill (more) holes in the chassis to directly mount
them or a new cage for them, simply use some heavy duty Velcro squares,
2" is fine.

Worst case scenario you're looking at less than $1000 to cure your IO
bottlenecks, or at the very least mitigate them to a minor annoyance
instead of a show stopper.  And if you free up some money for some
external JBODs and drives in the future, you can route 2 of the unused
SFF-8087 connectors of the Intel Expander out the back panel to attach
expander JBOD enclosures, using one of these and 2 more of the 8087
cables up above:

http://www.ebay.com/itm/8-Port-SAS-SATA-6G-Dual-SFF-8088-mini-SAS-to-SFF-8087-PCIe-Adapter-w-LP-Bracket-/390508767029

I'm sure someone makes a 3 port model but 10 minutes of searching didn't
turn one up.  These panel adapters are application specific.  Most are
made to be mounted in a disk enclosure where the HBA/RAID card is on the
outside of the chassis, on the other end of the 8088 cable.  This two
port model is designed to be inside a server chassis, where the HBA
connects to the internal 8087 ports.  Think Ethernet x-over cable.

The 9260-4i supports up to 128 drives.  This Intel expander and a panel
connector allow you to get there with external JBODs.  The only caveat
being that you're limited to "only" 4.8 GB/s to/from all the disks.

-- 
Stan

<Prev in Thread] Current Thread [Next in Thread>