[Top] [All Lists]

Fwd: Fwd: Sudden File System Corruption

To: xfs@xxxxxxxxxxx
Subject: Fwd: Fwd: Sudden File System Corruption
From: Mike Dacre <mike.dacre@xxxxxxxxx>
Date: Fri, 6 Dec 2013 15:15:29 -0800
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=zoePqR+/y/fzUCyBWUc4awiCCp2grtmHfc02p8tbKxU=; b=Ivdm3GjgyYX48tI7lLh9f7ZIJtes9N6qtJSB4nYDOIu7kgKhhTOG1eaat34hooWoXz xhauWhBWj+V4Z4+WMEjSV8LFrpd7TwJRDWs7Ah5NlODumBrh5PYGsMEj3mtjkQEQc2hd nzInsOo907e1AUX9yCHVKzrSDUwLmwSKo3ZLF9x2rK692FJlYgfXwbol4l7e/9lM3ILD kwSpSZ2ZsKnlPslPo/0h9MPS+6jm46HOsomQUKPoXazoeVcQNC1HOACxpFN6bZPYKMvt +1u03dVdaCOqnJvm8FPSRdUe4I5v0BHdbtXIXYcnJLSygd18afIvYOIBuCS2DeVfIJhg WPpQ==
In-reply-to: <CAPd9ww8+W2VX2HAfxEkVN5mL1a_+=HDAStf1126WSE33Vb=VsQ@xxxxxxxxxxxxxx>
References: <CAPd9ww_qT9J_Rt04g7+OApoBeggNOyWNwD+57DiDTuUvz-O-0g@xxxxxxxxxxxxxx> <52A03513.6030408@xxxxxxxxxxxxxxxxx> <CAPd9ww9hsOFK6pxqRY-YtLLAkkJHCuSi1BaM4n9=2XTjNVAn2Q@xxxxxxxxxxxxxx> <CAPd9ww9QzFWUnLtzkdktd+fSX9pdft+wL6cvG2MzLpSdLko1dg@xxxxxxxxxxxxxx> <52A191BA.20800@xxxxxxxxxxxxxxxxx> <CAPd9ww8+W2VX2HAfxEkVN5mL1a_+=HDAStf1126WSE33Vb=VsQ@xxxxxxxxxxxxxx>

---------- Forwarded message ----------
From: Mike Dacre <mike.dacre@xxxxxxxxx>
Date: Fri, Dec 6, 2013 at 2:14 PM
Subject: Re: Fwd: Sudden File System Corruption
To: stan@xxxxxxxxxxxxxxxxx

On Fri, Dec 6, 2013 at 12:58 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> wrote:
On 12/5/2013 9:58 AM, Mike Dacre wrote:

> On Thu, Dec 5, 2013 at 12:10 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>wrote:
>> On 12/4/2013 8:55 PM, Mike Dacre wrote:
>> ...
> Definitely RAID6
> 2. ÂStrip size? Â(eg 512KB)
> 64KB

Ok, so 64*14 = 896KB stripe. ÂThis seems pretty sane for a 14 spindle
parity array and mixed workloads.

> 4. ÂBBU module?
> Yes. iBBU, state optimal, 97% charged.
> 5. ÂIs write cache enabled?
>> Yes: Cahced IO and Write Back with BBU are enabled.

I should have pointed you this this earlier:

but we've got most of it already. ÂWe don't have your fstab mount
options. ÂPlease provide that.
UUID=a58bf1db-0d64-4a2d-8e03-aad78dbebcbe /science        Âxfs   defaults,inode64     Â1 0Â

On the slave nodes, I managed to reduce the demand on the disks by adding the actimeo=60 mount option. ÂPrior to doing this I would sometimes see the disk being negatively affected by enormous numbers of getattr requests. ÂHere is the fstab mount on the nodes:           Â/science        Ânfs   defaults,vers=3,nofail,actimeo=60,bg,hard,intr,rw Â0 0

> This is also attached as xfs_info.txt

You're not aligning XFS to the RAID geometry (unless you're overriding
in fstab). ÂNo alignment is good though for small (<896KB) file
allocations but less than optimal for large streaming allocation writes.
ÂBut it isn't a factor in the problems you reported.

Correct, I am not consciously aligning the XFS to the RAID geometry, I actually didn't know that was possible.
>> Good point. ÂThese happened while trying to ls. ÂI am not sure why I can't
> find them in the log, they printed out to the console as 'Input/Output'
> errors, simply stating that the ls command failed.

We look for SCSI IO errors preceding an XFS error as a causal indicator.
ÂI didn't see that here. ÂYou could have run into the bug Ben described
earlier. ÂI can't really speak to the console errors.

>> With delaylog enabled, which I believe it is in RHEL/CentOS 6, a single
>> big rm shouldn't kill the disks. ÂBut with the combination of other
>> workloads it seems you may have been seeking the disks to death.
> That is possible, workloads can get really high sometimes. ÂI am not sure
> how to control that without significantly impacting performance - I want a
> single user to be able to use 98% IO capacity sometimes... but other times
> I want the load to be split amongst many users.

You can't control the seeking at the disks. ÂYou can only schedule
workloads together that don't compete for seeks. ÂAnd if you have one
metadata or random read/write heavy workload, with this SATA RAID6
array, it will need exclusive access for the duration of execution, or
the portion that does all the random IO. ÂOtherwise other workloads
running concurrently will crawl while competing for seek bandwidth.

> Also, each user can
> execute jobs simultaneously on 23 different computers, each acessing the
> same drive via NFS. ÂThis is a great system most of the time, but sometimes
> the workloads on the drive get really high.

So it's a small compute cluster using NFS over Infiniband for shared
file access to a low performance RAID6 array. ÂThe IO resource sharing
is automatic. ÂBut AFAIK there's no easy way to enforce IO quotas on
users or processes, if at all. ÂYou may simply not have sufficient IO to
go around. ÂLet's ponder that.

I have tried a few things to improve IO allocation. ÂBetterLinux have a cgroup control suite that allow on-the-fly user-level IO adjustments, however I found them to be quite cumbersome.

I considered an ugly hack in which I would run two NFS servers, one on the network to the login node, and one on the network to the other nodes, so that I could use cgroups to limit IO by process, effectively guaranteeing a 5% IO capacity window to the login node, even if the compute nodes were all going crazy. ÂI quickly came to the conclusion that I don't know enough about filesystems, nfs, or the linux kernel to do this effectively: I would almost certainly just make an ugly mess that accomplished little more than breaking a lot of things, while not solving the problem. ÂI still think it is a good idea in principle though, I just recognize that it would need to be implemented by someone with a lot more experience than me, and that it would probably be a major undertaking.
Looking at the math, you currently have approximately 14*150=2100
seeks/sec capability with 14x 7.2k RPM data spindles. ÂThat's less than
100 seeks/sec per compute node, i.e. each node is getting about 2/3rd of
the performance of a single SATA disk from this array. ÂThis simply
isn't sufficient for servicing a 23 node cluster, unless all workloads
are compute bound, and none IO/seek bound. ÂGiven the overload/crash
that brought you to our attention, I'd say some of your workloads are
obviously IO/seek bound. ÂI'd say you probably need more/faster disks.
Or you need to identify which jobs are IO/seek heavy and schedule them
so they're not running concurrently.

Yes, this is a problem. ÂWe sadly lack the resources to do much better than this, we have recently been adding extra storage by just chaining together USB3 drives with RAID and LVM... which is cumbersome and slow, but cheaper.

My current solution is to be on the alert for high IO jobs, and to move them to a specific torque queue that limits the number of concurrent jobs. ÂThis works, but I have not found a way to do it automatically. ÂThankfully, with a 12 member lab, it is actually not terribly complex to handle, but I would definitely prefer a more comprehensive solution. ÂI don't doubt that the huge IO and seek demands we put on these disks will cause more problems in the future.Â
>> http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E
>> "As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much
>> of the parallelization in XFS."
>> echo deadline > /sys/block/sda/queue/scheduler
> Wow, this is huge, I can't believe I missed that. ÂI have switched it to
> noop now as we use write caching. ÂI have been trying to figure out for a
> while why I would keep getting timeouts when the NFS load was high. ÂIf you
> have any other suggestions for how I can improve performance, I would
> greatly appreciate it.

This may not fix NFS timeouts entirely but it should help. ÂIf the NFS
operations are seeking the disks to death you may still see timeouts.

>> This one simple command line may help pretty dramatically, immediately,
>> assuming your hardware array parameters aren't horribly wrong for your
>> workloads, and your XFS alignment correctly matches the hardware geometry.
> Great, thanks. ÂOur workloads vary considerably as we are a biology
> research lab, sometimes we do lots of seeks, other times we are almost
> maxing out read or write speed with massively parallel processes all
> accessing the disk at the same time.

Do you use munin or something similar? ÂSample output:

Project page:

I have been using Ganglia, but it doesn't have good NFS monitoring as far as I can tell. ÂI will check out Munin, thanks for the advice.Â
It also has an NFS module and many others. ÂThe storage oriented metrics
may be very helpful to you. ÂYou would install munin-node on the NFS
server and all compute nodes, and munin on a collector/web server. ÂThis
will allow you to cross reference client and server NFS loads. ÂYou can
then cross reference the time in your PBS logs to see which users were
running which jobs when IO spikes occur on the NFS server. ÂYou'll know
exactly which workloads, or combination thereof, are causing IO spikes.



<Prev in Thread] Current Thread [Next in Thread>