|Subject:||Fwd: Fwd: Sudden File System Corruption|
|From:||Mike Dacre <mike.dacre@xxxxxxxxx>|
|Date:||Fri, 6 Dec 2013 15:15:29 -0800|
|Dkim-signature:||v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=zoePqR+/y/fzUCyBWUc4awiCCp2grtmHfc02p8tbKxU=; b=Ivdm3GjgyYX48tI7lLh9f7ZIJtes9N6qtJSB4nYDOIu7kgKhhTOG1eaat34hooWoXz xhauWhBWj+V4Z4+WMEjSV8LFrpd7TwJRDWs7Ah5NlODumBrh5PYGsMEj3mtjkQEQc2hd nzInsOo907e1AUX9yCHVKzrSDUwLmwSKo3ZLF9x2rK692FJlYgfXwbol4l7e/9lM3ILD kwSpSZ2ZsKnlPslPo/0h9MPS+6jm46HOsomQUKPoXazoeVcQNC1HOACxpFN6bZPYKMvt +1u03dVdaCOqnJvm8FPSRdUe4I5v0BHdbtXIXYcnJLSygd18afIvYOIBuCS2DeVfIJhg WPpQ==|
|References:||<CAPd9ww_qT9J_Rt04g7+OApoBeggNOyWNwD+57DiDTuUvz-O-0g@xxxxxxxxxxxxxx> <52A03513.6030408@xxxxxxxxxxxxxxxxx> <CAPd9ww9hsOFK6pxqRY-YtLLAkkJHCuSi1BaM4n9=2XTjNVAn2Q@xxxxxxxxxxxxxx> <CAPd9ww9QzFWUnLtzkdktd+fSX9pdft+wL6cvG2MzLpSdLko1dg@xxxxxxxxxxxxxx> <52A191BA.20800@xxxxxxxxxxxxxxxxx> <CAPd9ww8+W2VX2HAfxEkVN5mL1a_+=HDAStf1126WSE33Vb=VsQ@xxxxxxxxxxxxxx>|
---------- Forwarded message ----------
From: Mike Dacre <mike.dacre@xxxxxxxxx>
Date: Fri, Dec 6, 2013 at 2:14 PM
Subject: Re: Fwd: Sudden File System Corruption
On Fri, Dec 6, 2013 at 12:58 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> wrote:
UUID=a58bf1db-0d64-4a2d-8e03-aad78dbebcbe /science Â Â Â Â Â Â Â Âxfs Â Â defaults,inode64 Â Â Â Â Â1 0Â
On the slave nodes, I managed to reduce the demand on the disks by adding the actimeo=60 mount option. ÂPrior to doing this I would sometimes see the disk being negatively affected by enormous numbers of getattr requests. ÂHere is the fstab mount on the nodes:
192.168.2.1:/science Â Â Â Â Â Â Â Â Â Â Â/science Â Â Â Â Â Â Â Ânfs Â Â defaults,vers=3,nofail,actimeo=60,bg,hard,intr,rw Â0 0
Correct, I am not consciously aligning the XFS to the RAID geometry, I actually didn't know that was possible.
I have tried a few things to improve IO allocation. ÂBetterLinux have a cgroup control suite that allow on-the-fly user-level IO adjustments, however I found them to be quite cumbersome.
I considered an ugly hack in which I would run two NFS servers, one on the network to the login node, and one on the network to the other nodes, so that I could use cgroups to limit IO by process, effectively guaranteeing a 5% IO capacity window to the login node, even if the compute nodes were all going crazy. ÂI quickly came to the conclusion that I don't know enough about filesystems, nfs, or the linux kernel to do this effectively: I would almost certainly just make an ugly mess that accomplished little more than breaking a lot of things, while not solving the problem. ÂI still think it is a good idea in principle though, I just recognize that it would need to be implemented by someone with a lot more experience than me, and that it would probably be a major undertaking.
Looking at the math, you currently have approximately 14*150=2100
Yes, this is a problem. ÂWe sadly lack the resources to do much better than this, we have recently been adding extra storage by just chaining together USB3 drives with RAID and LVM... which is cumbersome and slow, but cheaper.
My current solution is to be on the alert for high IO jobs, and to move them to a specific torque queue that limits the number of concurrent jobs. ÂThis works, but I have not found a way to do it automatically. ÂThankfully, with a 12 member lab, it is actually not terribly complex to handle, but I would definitely prefer a more comprehensive solution. ÂI don't doubt that the huge IO and seek demands we put on these disks will cause more problems in the future.Â
I have been using Ganglia, but it doesn't have good NFS monitoring as far as I can tell. ÂI will check out Munin, thanks for the advice.Â
It also has an NFS module and many others. ÂThe storage oriented metrics
|<Prev in Thread]||Current Thread||[Next in Thread>|
|Previous by Date:||[PATCH 2/2] xfs: free the efi AIL entry on log recovery failure, Mark Tinguely|
|Next by Date:||Re: Sudden File System Corruption, Mike Dacre|
|Previous by Thread:||Re: Fwd: Sudden File System Corruption, Stan Hoeppner|
|Next by Thread:||Re: Fwd: Sudden File System Corruption, Stan Hoeppner|
|Indexes:||[Date] [Thread] [Top] [All Lists]|