[Top] [All Lists]

Re: easily reproducible filesystem crash on rebuilding array

To: Emmanuel Florac <eflorac@xxxxxxxxxxxxxx>
Subject: Re: easily reproducible filesystem crash on rebuilding array
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Wed, 17 Dec 2014 06:58:15 +1100
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20141216123405.111c7ac0@xxxxxxxxxxxxxxxxxxxx>
References: <20141211123936.1f3d713d@xxxxxxxxxxxxxxxxxxxx> <20141215130715.4dfaaa8e@xxxxxxxxxxxxxxxxxxxx> <20141215132500.13210fdb@xxxxxxxxxxxxxxxxxxxx> <20141215201036.GQ24183@dastard> <20141216123405.111c7ac0@xxxxxxxxxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Tue, Dec 16, 2014 at 12:34:05PM +0100, Emmanuel Florac wrote:
> The RAID hardware is an adaptec 71685 running the latest firmware
> ( 32033 ). This is a 16 drives RAID-6 array of 4 TB HGST drives. The
> problem occurs repeatly with any combination of 7xx5 controllers and 3
> or 4 TB HGST drives in RAID-6 of various types, with XFS or JFS (it
> never occurs with either ext4 or reiserfs).

Do you have systems with any other type of 3/4TB drives in them?

> As I mentioned, when the disk drives cache is on the corruption is
> serious. With disk cache off, the corruption is minimal, however the
> filesystem shuts down.

That really sounds like a hardware problem - maybe with the disk
drives themselves, not necessarily the controller.

> The filesystem has been primed with a few (23) terabytes of mixed data
> with both small (few KB or less), medium, and big (few gigabytes or
> more) files. Two simultaneous, long running copies are made ( cp -a
> somedir someotherdir) , while three simultaneous, long running read
> operations are run ( md5sum -c mydir.md5 mydir), while the array is
> busy rebuilding. Disk usage (as reported by iostat -mx 5) stays solidly
> at 100%, with a continuous throughput of a few hundred megabytes per
> second. The full test runs for about 12 hours (when not failing), and
> ends up copying 6 TB or so, and md5summing 12 TB or so.
> > I'd start with upgrading the firmware on your RAID controller and
> > turning the XFS error level up to 11....
> The firmware is the latest available. How do I turn logging to 11
> please ?

# echo 11 > /proc/sys/fs/xfs/error_level


Dave Chinner

<Prev in Thread] Current Thread [Next in Thread>