Fwd: Fwd: Sudden File System Corruption
Mike Dacre
mike.dacre at gmail.com
Fri Dec 6 17:15:29 CST 2013
---------- Forwarded message ----------
From: Mike Dacre <mike.dacre at gmail.com>
Date: Fri, Dec 6, 2013 at 2:14 PM
Subject: Re: Fwd: Sudden File System Corruption
To: stan at hardwarefreak.com
On Fri, Dec 6, 2013 at 12:58 AM, Stan Hoeppner <stan at hardwarefreak.com>wrote:
> On 12/5/2013 9:58 AM, Mike Dacre wrote:
>
> > On Thu, Dec 5, 2013 at 12:10 AM, Stan Hoeppner <stan at hardwarefreak.com
> >wrote:
> >> On 12/4/2013 8:55 PM, Mike Dacre wrote:
> >> ...
> >
> > Definitely RAID6
> >
> > 2. Strip size? (eg 512KB)
> >>
> > 64KB
>
> Ok, so 64*14 = 896KB stripe. This seems pretty sane for a 14 spindle
> parity array and mixed workloads.
>
> > 4. BBU module?
> >>
> > Yes. iBBU, state optimal, 97% charged.
> >
> > 5. Is write cache enabled?
> >>
> >> Yes: Cahced IO and Write Back with BBU are enabled.
>
> I should have pointed you this this earlier:
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
> but we've got most of it already. We don't have your fstab mount
> options. Please provide that.
>
UUID=a58bf1db-0d64-4a2d-8e03-aad78dbebcbe /science xfs
defaults,inode64 1 0
On the slave nodes, I managed to reduce the demand on the disks by adding
the actimeo=60 mount option. Prior to doing this I would sometimes see the
disk being negatively affected by enormous numbers of getattr requests.
Here is the fstab mount on the nodes:
192.168.2.1:/science /science nfs
defaults,vers=3,nofail,actimeo=60,bg,hard,intr,rw 0 0
...
> > This is also attached as xfs_info.txt
>
> You're not aligning XFS to the RAID geometry (unless you're overriding
> in fstab). No alignment is good though for small (<896KB) file
> allocations but less than optimal for large streaming allocation writes.
> But it isn't a factor in the problems you reported.
>
>
Correct, I am not consciously aligning the XFS to the RAID geometry, I
actually didn't know that was possible.
> ...
> >> Good point. These happened while trying to ls. I am not sure why I
> can't
> > find them in the log, they printed out to the console as 'Input/Output'
> > errors, simply stating that the ls command failed.
>
> We look for SCSI IO errors preceding an XFS error as a causal indicator.
> I didn't see that here. You could have run into the bug Ben described
> earlier. I can't really speak to the console errors.
>
> >> With delaylog enabled, which I believe it is in RHEL/CentOS 6, a single
> >> big rm shouldn't kill the disks. But with the combination of other
> >> workloads it seems you may have been seeking the disks to death.
> >>
> > That is possible, workloads can get really high sometimes. I am not sure
> > how to control that without significantly impacting performance - I want
> a
> > single user to be able to use 98% IO capacity sometimes... but other
> times
> > I want the load to be split amongst many users.
>
> You can't control the seeking at the disks. You can only schedule
> workloads together that don't compete for seeks. And if you have one
> metadata or random read/write heavy workload, with this SATA RAID6
> array, it will need exclusive access for the duration of execution, or
> the portion that does all the random IO. Otherwise other workloads
> running concurrently will crawl while competing for seek bandwidth.
>
> > Also, each user can
> > execute jobs simultaneously on 23 different computers, each acessing the
> > same drive via NFS. This is a great system most of the time, but
> sometimes
> > the workloads on the drive get really high.
>
> So it's a small compute cluster using NFS over Infiniband for shared
> file access to a low performance RAID6 array. The IO resource sharing
> is automatic. But AFAIK there's no easy way to enforce IO quotas on
> users or processes, if at all. You may simply not have sufficient IO to
> go around. Let's ponder that.
>
I have tried a few things to improve IO allocation. BetterLinux have a
cgroup control suite that allow on-the-fly user-level IO adjustments,
however I found them to be quite cumbersome.
I considered an ugly hack in which I would run two NFS servers, one on the
network to the login node, and one on the network to the other nodes, so
that I could use cgroups to limit IO by process, effectively guaranteeing a
5% IO capacity window to the login node, even if the compute nodes were all
going crazy. I quickly came to the conclusion that I don't know enough
about filesystems, nfs, or the linux kernel to do this effectively: I would
almost certainly just make an ugly mess that accomplished little more than
breaking a lot of things, while not solving the problem. I still think it
is a good idea in principle though, I just recognize that it would need to
be implemented by someone with a lot more experience than me, and that it
would probably be a major undertaking.
> Looking at the math, you currently have approximately 14*150=2100
> seeks/sec capability with 14x 7.2k RPM data spindles. That's less than
> 100 seeks/sec per compute node, i.e. each node is getting about 2/3rd of
> the performance of a single SATA disk from this array. This simply
> isn't sufficient for servicing a 23 node cluster, unless all workloads
> are compute bound, and none IO/seek bound. Given the overload/crash
> that brought you to our attention, I'd say some of your workloads are
> obviously IO/seek bound. I'd say you probably need more/faster disks.
> Or you need to identify which jobs are IO/seek heavy and schedule them
> so they're not running concurrently.
>
Yes, this is a problem. We sadly lack the resources to do much better than
this, we have recently been adding extra storage by just chaining together
USB3 drives with RAID and LVM... which is cumbersome and slow, but cheaper.
My current solution is to be on the alert for high IO jobs, and to move
them to a specific torque queue that limits the number of concurrent jobs.
This works, but I have not found a way to do it automatically.
Thankfully, with a 12 member lab, it is actually not terribly complex to
handle, but I would definitely prefer a more comprehensive solution. I
don't doubt that the huge IO and seek demands we put on these disks will
cause more problems in the future.
> ...
> >>
> http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E
> >>
> >> "As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much
> >> of the parallelization in XFS."
> ...
> >> echo deadline > /sys/block/sda/queue/scheduler
> >>
> > Wow, this is huge, I can't believe I missed that. I have switched it to
> > noop now as we use write caching. I have been trying to figure out for a
> > while why I would keep getting timeouts when the NFS load was high. If
> you
> > have any other suggestions for how I can improve performance, I would
> > greatly appreciate it.
>
> This may not fix NFS timeouts entirely but it should help. If the NFS
> operations are seeking the disks to death you may still see timeouts.
>
> >> This one simple command line may help pretty dramatically, immediately,
> >> assuming your hardware array parameters aren't horribly wrong for your
> >> workloads, and your XFS alignment correctly matches the hardware
> geometry.
> >>
> > Great, thanks. Our workloads vary considerably as we are a biology
> > research lab, sometimes we do lots of seeks, other times we are almost
> > maxing out read or write speed with massively parallel processes all
> > accessing the disk at the same time.
>
> Do you use munin or something similar? Sample output:
>
> http://demo.munin-monitoring.org/munin-monitoring.org/demo.munin-monitoring.org/index.html#disk
>
> Project page:
> http://munin-monitoring.org/
I have been using Ganglia, but it doesn't have good NFS monitoring as far
as I can tell. I will check out Munin, thanks for the advice.
> It also has an NFS module and many others. The storage oriented metrics
> may be very helpful to you. You would install munin-node on the NFS
> server and all compute nodes, and munin on a collector/web server. This
> will allow you to cross reference client and server NFS loads. You can
> then cross reference the time in your PBS logs to see which users were
> running which jobs when IO spikes occur on the NFS server. You'll know
> exactly which workloads, or combination thereof, are causing IO spikes.
>
> --
> Stan
>
-Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20131206/26378780/attachment-0001.html>
More information about the xfs
mailing list