[Top] [All Lists]

Fwd: Sudden File System Corruption

To: xfs@xxxxxxxxxxx
Subject: Fwd: Sudden File System Corruption
From: Mike Dacre <mike.dacre@xxxxxxxxx>
Date: Thu, 5 Dec 2013 07:58:06 -0800
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=JHBv5YBIeJR0YNsnBerCX6rDWKt8Git74zuI+DM/f+s=; b=ZFCCJOZme5pn4+yOe7dNCLZlkqG4cPe7wwMTBwfrTeLyZCo+ucCieOHv3mw8U/9W/f 7LQdOVXl5KXoHyQkDeMUkBPHbePS/RpeMnUFxvcomxxiaGyJLpO2CNj8dLGeCxZXw+pB eFsF6uWWHxIFl5r64zrsCRmX5HN5wexi/ueQT902ovYrqw3W70yyZQBGX5qspn3zpiug r4NQA19ewOG17lUfb2MQ+Sg200uaoEk49b6anZpIntN743S/x7YpCjFtvzSM3AAcCobA 4vitMIb2omTvk9MNzOw0J+G2oskjOQee/mQMow+mK2x4vgXi8/wWrscywGINIhGTBfkT y/pQ==
In-reply-to: <CAPd9ww9hsOFK6pxqRY-YtLLAkkJHCuSi1BaM4n9=2XTjNVAn2Q@xxxxxxxxxxxxxx>
References: <CAPd9ww_qT9J_Rt04g7+OApoBeggNOyWNwD+57DiDTuUvz-O-0g@xxxxxxxxxxxxxx> <52A03513.6030408@xxxxxxxxxxxxxxxxx> <CAPd9ww9hsOFK6pxqRY-YtLLAkkJHCuSi1BaM4n9=2XTjNVAn2Q@xxxxxxxxxxxxxx>
Hi Stan,

On Thu, Dec 5, 2013 at 12:10 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> wrote:
On 12/4/2013 8:55 PM, Mike Dacre wrote:
> I have a 16 2TB drive RAID6 array powered by an LSI 9240-4i. ÂIt has an XFS.

It's a 9260-4i, not a 9240, a huge difference. ÂI went digging through
your dmesg output because I knew the 9240 doesn't support RAID6. ÂA few
questions. ÂWhat is the LSI RAID configuration?
You are right, sorry. Â9260-4i

1. ÂLevel -- confirm RAID6
Definitely RAID6Â

2. ÂStrip size? Â(eg 512KB)

3. ÂStripe size? (eg 7168KB, 14*256)
Not sure how to get thisÂ

4. ÂBBU module?
Yes. iBBU, state optimal, 97% charged.Â

5. ÂIs write cache enabled?

Yes: Cahced IO and Write Back with BBU are enabled.

I have also attached an adapter summary (megaraid_adp_info.txt) and a virtual and physical drive summary (megaraid_drive_info.txt).Â
What is the XFS geometry?

5. Âxfs_info /dev/sda

`xfs_info /dev/sda1`
meta-data ="" Â Â Â Â Âisize=256 Â Âagcount=26, agsize=268435455 blks
       Â=             sectsz=512  attr=2
data     =             bsize=4096  blocks=6835404288, imaxpct=5
       Â=             sunit=0   Âswidth=0 blks
naming  Â=version 2      Âbsize=4096  ascii-ci=0
log     Â=internal        bsize=4096  blocks=521728, version=2
       =             Âsectsz=512  sunit=0 blks, lazy-count=1
realtime  =none          extsz=4096  blocks=0, rtextents=0

This is also attached as xfs_info.txtÂ

A combination of these these being wrong could very well be part of your

> IO errors when any requests were made. ÂThis happened while it was being

I didn't see any IO errors in your dmesg output. ÂNone.

Good point. ÂThese happened while trying to ls. ÂI am not sure why I can't find them in the log, they printed out to the console as 'Input/Output' errors, simply stating that the ls command failed.
> accessed by Â5 different users, one was doing a very large rm operation (rm
> *sh on thousands on files in a directory). ÂAlso, about 30 minutes before
> we had connected the globus connect endpoint to allow easy file transfers
> to SDSC.

With delaylog enabled, which I believe it is in RHEL/CentOS 6, a single
big rm shouldn't kill the disks. ÂBut with the combination of other
workloads it seems you may have been seeking the disks to death.
That is possible, workloads can get really high sometimes. ÂI am not sure how to control that without significantly impacting performance - I want a single user to be able to use 98% IO capacity sometimes... but other times I want the load to be split amongst many users. ÂAlso, each user can execute jobs simultaneously on 23 different computers, each acessing the same drive via NFS. ÂThis is a great system most of the time, but sometimes the workloads on the drive get really high.Â

> In the end, I successfully repaired the filesystem with `xfs_repair -L
> /dev/sda1`. ÂHowever, I am nervous that some files may have been corrupted.

I'm sure your users will let you know. ÂI'd definitely have a look in
the directory that was targeted by the big rm operation which apparently
didn't finish when XFS shutdown.

> Do any of you have any idea what could have caused this problem?

Yes. ÂA few things. ÂThe first is this, and it's a big one:

Dec Â4 18:15:28 fruster kernel: io scheduler noop registered
Dec Â4 18:15:28 fruster kernel: io scheduler anticipatory registered
Dec Â4 18:15:28 fruster kernel: io scheduler deadline registered
Dec Â4 18:15:28 fruster kernel: io scheduler cfq registered (default)


"As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much
of the parallelization in XFS."

*Never* use the CFQ elevator with XFS, and never with a high performance
storage system. ÂIn fact, IMHO, never use CFQ period. ÂIt was horrible
even before 3.2.12. ÂIt is certain that CFQ is playing a big part in
your 120s timeouts, though it may not be solely responsible for your IO
bottleneck. ÂSwitch to deadline or noop immediately, deadline if LSI
write cache is disabled, noop if it is enabled. ÂExecute this manually
now, and add it to a startup script and verify it is being set at
startup, as it's not permanent:

echo deadline > /sys/block/sda/queue/scheduler

Wow, this is huge, I can't believe I missed that. ÂI have switched it to noop now as we use write caching. ÂI have been trying to figure out for a while why I would keep getting timeouts when the NFS load was high. ÂIf you have any other suggestions for how I can improve performance, I would greatly appreciate it.
This one simple command line may help pretty dramatically, immediately,
assuming your hardware array parameters aren't horribly wrong for your
workloads, and your XFS alignment correctly matches the hardware geometry.

Great, thanks. ÂOur workloads vary considerably as we are a biology research lab, sometimes we do lots of seeks, other times we are almost maxing out read or write speed with massively parallel processes all accessing the disk at the same time.


Attachment: megaraid_adp_info.txt
Description: Text document

Attachment: megaraid_drive_info.txt
Description: Text document

Attachment: xfs_info.txt
Description: Text document

<Prev in Thread] Current Thread [Next in Thread>