Fwd: Sudden File System Corruption
Mike Dacre
mike.dacre at gmail.com
Sat Dec 7 12:36:41 CST 2013
On Sat, Dec 7, 2013 at 3:12 AM, Stan Hoeppner <stan at hardwarefreak.com>wrote:
> On 12/6/2013 4:14 PM, Mike Dacre wrote:
> > On Fri, Dec 6, 2013 at 12:58 AM, Stan Hoeppner <stan at hardwarefreak.com
> >wrote:
> ...
> > UUID=a58bf1db-0d64-4a2d-8e03-aad78dbebcbe /science xfs
> > defaults,inode64 1 0
>
> Your RAID card has persistent write cache (BBWC) and we know it's
> enabled from tool your output. By default XFS assumes BBWC is not
> present, and uses write barriers to ensure order/consistency. Using
> barriers on top of BBWC will be detrimental to write performance, for a
> couple of reasons:
>
> 1. Prevents the controller from optimizing writeback patterns
> 2. A portion, or all of, the write cache is frequently flushed
>
> Add 'nobarrier' to your mount options to avoid this problem. It should
> speed up many, if not all, write operations considerably, which will in
> turn decrease seek contention amongst jobs. Currently your write cache
> isn't working nearly as well as it should, and in fact could be
> operating horribly.
>
> > On the slave nodes, I managed to reduce the demand on the disks by adding
> > the actimeo=60 mount option. Prior to doing this I would sometimes see
> the
> > disk being negatively affected by enormous numbers of getattr requests.
> > Here is the fstab mount on the nodes:
> >
> > 192.168.2.1:/science /science nfs
> > defaults,vers=3,nofail,actimeo=60,bg,hard,intr,rw 0 0
>
> One minute attribute cache lifetime seems maybe a little high for a
> compute cluster. But if you've had no ill effects and it squelched the
> getattr flood this is good.
>
> ...
> > Correct, I am not consciously aligning the XFS to the RAID geometry, I
> > actually didn't know that was possible.
>
> XFS alignment is not something to worry about in this case.
>
> ...
> >> So it's a small compute cluster using NFS over Infiniband for shared
> >> file access to a low performance RAID6 array. The IO resource sharing
> >> is automatic. But AFAIK there's no easy way to enforce IO quotas on
> >> users or processes, if at all. You may simply not have sufficient IO to
> >> go around. Let's ponder that.
> >
> > I have tried a few things to improve IO allocation. BetterLinux have a
> > cgroup control suite that allow on-the-fly user-level IO adjustments,
> > however I found them to be quite cumbersome.
>
> This isn't going to work well because a tiny IO stream can seek the
> disks to death, such as a complex find command, ls -R, etc. A single
> command such as these can generate thousands of seeks. Shaping/limiting
> user IO won't affect this.
>
> ...
> >> Looking at the math, you currently have approximately 14*150=2100
> >> seeks/sec capability with 14x 7.2k RPM data spindles. That's less than
> >> 100 seeks/sec per compute node, i.e. each node is getting about 2/3rd of
> >> the performance of a single SATA disk from this array. This simply
> >> isn't sufficient for servicing a 23 node cluster, unless all workloads
> >> are compute bound, and none IO/seek bound. Given the overload/crash
> >> that brought you to our attention, I'd say some of your workloads are
> >> obviously IO/seek bound. I'd say you probably need more/faster disks.
> >> Or you need to identify which jobs are IO/seek heavy and schedule them
> >> so they're not running concurrently.
> >
> > Yes, this is a problem. We sadly lack the resources to do much better
> than
> > this, we have recently been adding extra storage by just chaining
> together
> > USB3 drives with RAID and LVM... which is cumbersome and slow, but
> cheaper.
>
> USB disk is generally a recipe for disaster. Plenty of horror stories
> on both this list and linux-raid regarding USB connected drives,
> enclosures, etc. I pray you don't run into those problems.
>
> > My current solution is to be on the alert for high IO jobs, and to move
> > them to a specific torque queue that limits the number of concurrent
> jobs.
> > This works, but I have not found a way to do it automatically.
> > Thankfully, with a 12 member lab, it is actually not terribly complex to
> > handle, but I would definitely prefer a more comprehensive solution. I
> > don't doubt that the huge IO and seek demands we put on these disks will
> > cause more problems in the future.
>
> Your LSI 9260 controller supports using SSDs for read/write flash cache.
> LSI charges $279 for it. It's called CacheCade Pro:
>
>
> http://www.lsi.com/products/raid-controllers/pages/megaraid-cachecade-pro-software.aspx
> .
>
>
> Connect two good quality fast SSDs to the controller, such as:
> http://www.newegg.com/Product/Product.aspx?Item=N82E16820147192
>
> Two SSDs, mirrored, to prevent cached writes from being lost if a single
> SSD fails. You now have a ~90K IOPS, 128GB, 500MB/s low latency
> read/write cache in front of your RAID6 array. This should go a long
> way toward eliminating your bottlenecks. You can accomplish this for
> ~$550 assuming you have two backplane drive slots free for the SSDs. If
> not, you add one of these for $279:
>
> http://www.newegg.com/Product/Product.aspx?Item=N82E16816117207
>
> This is an Intel 24 port SAS expander, the same device as in your drive
> backplane. SAS expanders can be daisy chained many deep. You can drop
> it into a PCIe x4 or greater slot from which it only draws power--no
> data pins are connected. Or if no slots are available you can mount it
> to the side wall of your rack server chassis and power it via the 4 pin
> Molex plug. This requires a drill, brass or plastic standoffs, and DIY
> skills. I use this option as it provides a solid mount for un/plugging
> the SAS cables, and being side mounted neither it nor the cables
> interfere with airflow.
>
> You'll plug the 9260-4i into one port of the Intel expander. You'll
> need another SFF-8087 cable for this:
>
> http://www.newegg.com/Product/Product.aspx?Item=N82E16812652015
>
> You will plug your drive backplane cable into another of the 6 SFF-8087
> ports on the Intel. Into a 3rd port you will plug an SFF-8087 breakout
> cable to give you 4 individual drive connections. You will plug two of
> these into your two SSDs.
>
> http://www.newegg.com/Product/Product.aspx?Item=N82E16816116097
>
> If you have no internal 2.5/3.5" drive brackets free for the SSDs and
> you'd prefer not to drill (more) holes in the chassis to directly mount
> them or a new cage for them, simply use some heavy duty Velcro squares,
> 2" is fine.
>
> Worst case scenario you're looking at less than $1000 to cure your IO
> bottlenecks, or at the very least mitigate them to a minor annoyance
> instead of a show stopper. And if you free up some money for some
> external JBODs and drives in the future, you can route 2 of the unused
> SFF-8087 connectors of the Intel Expander out the back panel to attach
> expander JBOD enclosures, using one of these and 2 more of the 8087
> cables up above:
>
>
> http://www.ebay.com/itm/8-Port-SAS-SATA-6G-Dual-SFF-8088-mini-SAS-to-SFF-8087-PCIe-Adapter-w-LP-Bracket-/390508767029
>
> I'm sure someone makes a 3 port model but 10 minutes of searching didn't
> turn one up. These panel adapters are application specific. Most are
> made to be mounted in a disk enclosure where the HBA/RAID card is on the
> outside of the chassis, on the other end of the 8088 cable. This two
> port model is designed to be inside a server chassis, where the HBA
> connects to the internal 8087 ports. Think Ethernet x-over cable.
>
> The 9260-4i supports up to 128 drives. This Intel expander and a panel
> connector allow you to get there with external JBODs. The only caveat
> being that you're limited to "only" 4.8 GB/s to/from all the disks.
>
> --
> Stan
>
Hi Stan,
Thanks for the great advice, I think you are on to something there. I will
look into doing this in the next week or so when I have more time. I added
'nobarrier' to my mount options.
Thanks again, I will let you know how it goes after I have upgraded.
Best,
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20131207/15ec3b7c/attachment.html>
More information about the xfs
mailing list