[Top] [All Lists]

Re: I/O hang, possibly XFS, possibly general

To: Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>
Subject: Re: I/O hang, possibly XFS, possibly general
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Sun, 5 Jun 2011 09:10:32 +1000
Cc: Paul Anderson <pha@xxxxxxxxx>, Christoph Hellwig <hch@xxxxxxxxxxxxx>, xfs-oss <xfs@xxxxxxxxxxx>
In-reply-to: <4DEA2106.5000900@xxxxxxxxxxxxxxxxx>
References: <BANLkTim_BCiKeqi5gY_gXAcmg7JgrgJCxQ@xxxxxxxxxxxxxx> <20110603004247.GA28043@xxxxxxxxxxxxx> <20110603013948.GX561@dastard> <BANLkTi=FjSzSZJXGofVjtiUe2ZNvki2R-Q@xxxxxxxxxxxxxx> <4DE9E97D.30500@xxxxxxxxxxxxxxxxx> <20110604103247.GG561@dastard> <4DEA2106.5000900@xxxxxxxxxxxxxxxxx>
User-agent: Mutt/1.5.20 (2009-06-14)
On Sat, Jun 04, 2011 at 07:11:50AM -0500, Stan Hoeppner wrote:
> On 6/4/2011 5:32 AM, Dave Chinner wrote:
> > On Sat, Jun 04, 2011 at 03:14:53AM -0500, Stan Hoeppner wrote:
> >> On 6/3/2011 10:59 AM, Paul Anderson wrote:
> >>> Not sure what I can do about the log - man page says xfs_growfs
> >>> doesn't implement log moving.  I can rebuild the filesystems, but for
> >>> the one mentioned in this theread, this will take a long time.
> >>
> >> See the logdev mount option.  Using two mirrored drives was recommended,
> >> I'd go a step further and use two quality "consumer grade", i.e. MLC
> >> based, SSDs, such as:
> >>
> >> http://www.cdw.com/shop/products/Corsair-Force-Series-F40-solid-state-drive-40-GB-SATA-300/2181114.aspx
> >>
> >> Rated at 50K 4K write IOPS, about 150 times greater than a 15K SAS drive.
> > 
> > If you are using delayed logging, then a pair of mirrored 7200rpm
> > SAS or SATA drives would be sufficient for most workloads as the log
> > bandwidth rarely gets above 50MB/s in normal operation.
> Hi Dave.  I made the first reply to Paul's post, recommending he enable
> delayed logging as a possible solution to his I/O hang problem.  I
> recommended this due to his mention of super heavy metadata operations
> at the time on his all md raid60 on plain HBA setup.  Paul did not list
> delaylog when he submitted his mount options:
> inode64,largeio,logbufs=8,noatime
> Being the author of the delayed logging code, I had expected you to
> comment on this, either expounding on my recommendation, or shooting it
> down, and giving the reasons why.
> So, would delayed logging have possibly prevented his hang problem or
> no?  I always read your replies at least twice, and I don't recall you
> touching on delayed logging in this thread.  If you did and I missed it,
> my apologies.

It might, but I delayed logging iћ not he solution to every problem,
and NFS servers are notoriously heavy on log forces due to COMMIT
operations during writes. So it's a good bet that delyed logging
won't fix the problem entirely.

> > If you have fsync heavy workloads, or are not using delayed logging,
> > then you really need to use the RAID5/6 device behind a BBWC because
> > the log is -seriously- bandwidth intensive. I can drive >500MB/s of
> > log throughput on metadata intensive workloads on 2.6.39 when not
> > using delayed logging or I'm regularly forcing the log via fsync.
> > You sure as hell don't want to be running a sustained long term
> > write load like that on consumer grade SSDs.....
> Given that the max log size is 2GB, IIRC, and that most recommendations
> I've seen here are against using a log that big, I figure such MLC
> drives would be fine.  AIUI, modern wear leveling will spread writes
> throughout the entire flash array before going back and over writing the
> first sector.  Published MTBF on most MLC drives rates are roughly
> equivalent to enterprise SRDs, 1+ million hours.
> Do you believe MLC based SSDs are simply never appropriate for anything
> but consumer use, and that only SLC devices should be used for real
> storage applications?  AIUI SLC flash cells do have about a 10:1 greater
> lifetime than MLC cells.  However, there have been a number of
> articles/posts demonstrating math which shows a current generation
> SandForce based MLC SSD, under a constant 100MB/s write stream, will run
> for 20+ years, IIRC, before sufficient live+reserved spare cells burn
> out to cause hard write errors, thus necessitating drive replacement.
> Under your 500MB/s load, assuming that's constant, the drives would
> theoretically last 4+ years.  If that 500MB/s load was only for 12 hours
> each day, the drives would last 8+ years.  I wish I had one of those
> articles bookmarked...

That's the theory, anyway. Let's call it an expected 4 year life
cycle under this workload (which is highly optimistic, IMO). Now you
have two drives in RAID1, that means one will fail in 2 years, or if
you need more drives to sustain that performance the log needs (*)
you might be looking at 4 or more drives, and that brings the expet
failure rate down under one drive per year. Multiply that across
5-10 servers, and that's a drive failure every month just on the log

That failure rate would make me extremely nervous - losing the log
is a -major- filesystem corruption event - and make me want to spend
more money or change the config to reduce the risk of a double
failure causing the log device to be lost. Especially if there are
hundreds of terabytes of data at risk.



(*) You have to consider that sustained workloads mean that the
drives don't get idle time to trigger background garbage collection,
which is one of the key features that current consumer level drives
rely on for maintaining performance and even wear levelling. The
"spare" area in the drives is kept small because it is assumed that
there won't be long term sustained IO so that the garbage collection
can clean up before spare area is exhausted.

Enterprise drives have a much larger relative percentage of flash in
the drive reserved as spare to avoid severe degradation in such
sustained (common enterprise) workloads.  Hence performance on
consumer MLC drives tails off much more quickly than SLC drives.

Hence performance on consumer MLC drives may not be sustainable, and
wear leveling may not be optimal, resulting in flash failure earlier
than you expect.  To maintain performance, you'll need more MLC
drives to maintain baseline performance.  And with more drives, the
chance of failure goes up...
Dave Chinner

<Prev in Thread] Current Thread [Next in Thread>