xfs
[Top] [All Lists]

Re: Fwd: Sudden File System Corruption

To: stan@xxxxxxxxxxxxxxxxx
Subject: Re: Fwd: Sudden File System Corruption
From: Mike Dacre <mike.dacre@xxxxxxxxx>
Date: Sat, 7 Dec 2013 10:36:41 -0800
Cc: "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=IC1/odMsQ642XuVm3MmaYmfhxa3JBQoCrj6oCrFiEfQ=; b=VZhFh006lfKEtImnJOMV6u0LU8zoo6xtuSfKRJvNmTwBU73CkgPlSwOLidlqpwnmfJ iu3So1OhVFMB2d9NQKevx+D5DddslGtNuEWOQEALS/ZpWv9+6cHMRt1QYCQhJA4PL/ky UdkjkUYIfY/rew0rfqI5uvbfbjFF5P9cWt+/ONnXZROrmUpCFe9+sbZBlrZeED25oYKD 2lshb+vDCdaS/imIxqXlcQQXa5WMTQYdNhrJ3eDPfBYh+pEynLpzr5bRZDHp9/gWpQuw GOP4QNT6iLGIW1GwXzHovjr+xXtM9o+tf35VTHfdV1WxTIbRDV+eIzslibkzJiE1stkM e76w==
In-reply-to: <52A302A9.9050509@xxxxxxxxxxxxxxxxx>
References: <CAPd9ww_qT9J_Rt04g7+OApoBeggNOyWNwD+57DiDTuUvz-O-0g@xxxxxxxxxxxxxx> <52A03513.6030408@xxxxxxxxxxxxxxxxx> <CAPd9ww9hsOFK6pxqRY-YtLLAkkJHCuSi1BaM4n9=2XTjNVAn2Q@xxxxxxxxxxxxxx> <CAPd9ww9QzFWUnLtzkdktd+fSX9pdft+wL6cvG2MzLpSdLko1dg@xxxxxxxxxxxxxx> <52A191BA.20800@xxxxxxxxxxxxxxxxx> <CAPd9ww8+W2VX2HAfxEkVN5mL1a_+=HDAStf1126WSE33Vb=VsQ@xxxxxxxxxxxxxx> <52A302A9.9050509@xxxxxxxxxxxxxxxxx>
On Sat, Dec 7, 2013 at 3:12 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> wrote:
On 12/6/2013 4:14 PM, Mike Dacre wrote:
> On Fri, Dec 6, 2013 at 12:58 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>wrote:
...
> UUID=a58bf1db-0d64-4a2d-8e03-aad78dbebcbe /science        Âxfs
> defaults,inode64 Â Â Â Â Â1 0

Your RAID card has persistent write cache (BBWC) and we know it's
enabled from tool your output. ÂBy default XFS assumes BBWC is not
present, and uses write barriers to ensure order/consistency. Â Using
barriers on top of BBWC will be detrimental to write performance, for a
couple of reasons:

1. ÂPrevents the controller from optimizing writeback patterns
2. ÂA portion, or all of, the write cache is frequently flushed

Add 'nobarrier' to your mount options to avoid this problem. ÂIt should
speed up many, if not all, write operations considerably, which will in
turn decrease seek contention amongst jobs. ÂCurrently your write cache
isn't working nearly as well as it should, and in fact could be
operating horribly.

> On the slave nodes, I managed to reduce the demand on the disks by adding
> the actimeo=60 mount option. ÂPrior to doing this I would sometimes see the
> disk being negatively affected by enormous numbers of getattr requests.
> ÂHere is the fstab mount on the nodes:
>
> 192.168.2.1:/science           Â/science        Ânfs
> defaults,vers=3,nofail,actimeo=60,bg,hard,intr,rw Â0 0

One minute attribute cache lifetime seems maybe a little high for a
compute cluster. ÂBut if you've had no ill effects and it squelched the
getattr flood this is good.

...
> Correct, I am not consciously aligning the XFS to the RAID geometry, I
> actually didn't know that was possible.

XFS alignment is not something to worry about in this case.

...
>> So it's a small compute cluster using NFS over Infiniband for shared
>> file access to a low performance RAID6 array. ÂThe IO resource sharing
>> is automatic. ÂBut AFAIK there's no easy way to enforce IO quotas on
>> users or processes, if at all. ÂYou may simply not have sufficient IO to
>> go around. ÂLet's ponder that.
>
> I have tried a few things to improve IO allocation. ÂBetterLinux have a
> cgroup control suite that allow on-the-fly user-level IO adjustments,
> however I found them to be quite cumbersome.

This isn't going to work well because a tiny IO stream can seek the
disks to death, such as a complex find command, ls -R, etc. ÂA single
command such as these can generate thousands of seeks. ÂShaping/limiting
user IO won't affect this.

...
>> Looking at the math, you currently have approximately 14*150=2100
>> seeks/sec capability with 14x 7.2k RPM data spindles. ÂThat's less than
>> 100 seeks/sec per compute node, i.e. each node is getting about 2/3rd of
>> the performance of a single SATA disk from this array. ÂThis simply
>> isn't sufficient for servicing a 23 node cluster, unless all workloads
>> are compute bound, and none IO/seek bound. ÂGiven the overload/crash
>> that brought you to our attention, I'd say some of your workloads are
>> obviously IO/seek bound. ÂI'd say you probably need more/faster disks.
>> Or you need to identify which jobs are IO/seek heavy and schedule them
>> so they're not running concurrently.
>
> Yes, this is a problem. ÂWe sadly lack the resources to do much better than
> this, we have recently been adding extra storage by just chaining together
> USB3 drives with RAID and LVM... which is cumbersome and slow, but cheaper.

USB disk is generally a recipe for disaster. ÂPlenty of horror stories
on both this list and linux-raid regarding USB connected drives,
enclosures, etc. ÂI pray you don't run into those problems.

> My current solution is to be on the alert for high IO jobs, and to move
> them to a specific torque queue that limits the number of concurrent jobs.
> ÂThis works, but I have not found a way to do it automatically.
> ÂThankfully, with a 12 member lab, it is actually not terribly complex to
> handle, but I would definitely prefer a more comprehensive solution. ÂI
> don't doubt that the huge IO and seek demands we put on these disks will
> cause more problems in the future.

Your LSI 9260 controller supports using SSDs for read/write flash cache.
ÂLSI charges $279 for it. ÂIt's called CacheCade Pro:

http://www.lsi.com/products/raid-controllers/pages/megaraid-cachecade-pro-software.aspx.


Connect two good quality fast SSDs to the controller, such as:
http://www.newegg.com/Product/Product.aspx?Item=N82E16820147192

Two SSDs, mirrored, to prevent cached writes from being lost if a single
SSD fails. ÂYou now have a ~90K IOPS, 128GB, 500MB/s low latency
read/write cache in front of your RAID6 array. ÂThis should go a long
way toward eliminating your bottlenecks. ÂYou can accomplish this for
~$550 assuming you have two backplane drive slots free for the SSDs. ÂIf
not, you add one of these for $279:

http://www.newegg.com/Product/Product.aspx?Item=N82E16816117207

This is an Intel 24 port SAS expander, the same device as in your drive
backplane. ÂSAS expanders can be daisy chained many deep. ÂYou can drop
it into a PCIe x4 or greater slot from which it only draws power--no
data pins are connected. ÂOr if no slots are available you can mount it
to the side wall of your rack server chassis and power it via the 4 pin
Molex plug. ÂThis requires a drill, brass or plastic standoffs, and DIY
skills. ÂI use this option as it provides a solid mount for un/plugging
the SAS cables, and being side mounted neither it nor the cables
interfere with airflow.

You'll plug the 9260-4i into one port of the Intel expander. ÂYou'll
need another SFF-8087 cable for this:

http://www.newegg.com/Product/Product.aspx?Item=N82E16812652015

You will plug your drive backplane cable into another of the 6 SFF-8087
ports on the Intel. ÂInto a 3rd port you will plug an SFF-8087 breakout
cable to give you 4 individual drive connections. ÂYou will plug two of
these into your two SSDs.

http://www.newegg.com/Product/Product.aspx?Item=N82E16816116097

If you have no internal 2.5/3.5" drive brackets free for the SSDs and
you'd prefer not to drill (more) holes in the chassis to directly mount
them or a new cage for them, simply use some heavy duty Velcro squares,
2" is fine.

Worst case scenario you're looking at less than $1000 to cure your IO
bottlenecks, or at the very least mitigate them to a minor annoyance
instead of a show stopper. ÂAnd if you free up some money for some
external JBODs and drives in the future, you can route 2 of the unused
SFF-8087 connectors of the Intel Expander out the back panel to attach
expander JBOD enclosures, using one of these and 2 more of the 8087
cables up above:

http://www.ebay.com/itm/8-Port-SAS-SATA-6G-Dual-SFF-8088-mini-SAS-to-SFF-8087-PCIe-Adapter-w-LP-Bracket-/390508767029

I'm sure someone makes a 3 port model but 10 minutes of searching didn't
turn one up. ÂThese panel adapters are application specific. ÂMost are
made to be mounted in a disk enclosure where the HBA/RAID card is on the
outside of the chassis, on the other end of the 8088 cable. ÂThis two
port model is designed to be inside a server chassis, where the HBA
connects to the internal 8087 ports. ÂThink Ethernet x-over cable.

The 9260-4i supports up to 128 drives. ÂThis Intel expander and a panel
connector allow you to get there with external JBODs. ÂThe only caveat
being that you're limited to "only" 4.8 GB/s to/from all the disks.

--
Stan

Hi Stan,

Thanks for the great advice, I think you are on to something there. ÂI will look into doing this in the next week or so when I have more time. ÂI added 'nobarrier' to my mount options.

Thanks again, I will let you know how it goes after I have upgraded.

Best,

Mike
<Prev in Thread] Current Thread [Next in Thread>