[Top] [All Lists]

Re: I/O hang, possibly XFS, possibly general

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: I/O hang, possibly XFS, possibly general
From: Paul Anderson <pha@xxxxxxxxx>
Date: Fri, 3 Jun 2011 11:59:02 -0400
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>, xfs-oss <xfs@xxxxxxxxxxx>
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=AgIxMfsRzt6mabN/Ukqlz4UMO0v2NPMSCabd5dvPSSQ=; b=nno/4cF+SffxKkUSQy71ldtyKMN7AgYEM0P9M5p5H0tvZo9/iRmtFhizNC0vFmOp5K z4WDGHruLPgza6qqFv0FaLgTrFmzliD1qEGfZVck0akHXrR7AN3GKTSFGvnPvbSpIlRn fkdwezULGQrMP333Z8dTmRAMbNrHB8yhYgcb8=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=v1H+GIxDlbM3paJX5Uhw1WqjrxrqeKcz72Ev/CeUN4vBgga1xY+xK1W/7F+1WGzppp jp/dGfjmXG4ZTlKSyFHOzqijsHl93TYUlqEw4sj4zYDtHxYI7H9nb17zu3AJ+BzRvM/3 zgJyKZkBdxeoaJWHWyXhwhKZg9kcnA4M5uaXA=
In-reply-to: <20110603013948.GX561@dastard>
References: <BANLkTim_BCiKeqi5gY_gXAcmg7JgrgJCxQ@xxxxxxxxxxxxxx> <20110603004247.GA28043@xxxxxxxxxxxxx> <20110603013948.GX561@dastard>
Sender: powool@xxxxxxxxx
On Thu, Jun 2, 2011 at 9:39 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Thu, Jun 02, 2011 at 08:42:47PM -0400, Christoph Hellwig wrote:
>> On Thu, Jun 02, 2011 at 10:42:46AM -0400, Paul Anderson wrote:
>> > This morning, I had a symptom of a I/O throughput problem in which
>> > dirty pages appeared to be taking a long time to write to disk.
>> >
>> > The system is a large x64 192GiB dell 810 server running from
>> > kernel.org - the basic workload was data intensive - concurrent large
>> > NFS (with high metadata/low filesize), rsync/lftp (with low
>> > metadata/high file size) all working in a 200TiB XFS volume on a
>> > software MD raid0 on top of 7 software MD raid6, each w/18 drives.  I
>> > had mounted the filesystem with inode64,largeio,logbufs=8,noatime.
>> A few comments on the setup before trying to analze what's going on in
>> detail.  I'd absolutely recommend an external log device for this setup,
>> that is buy another two fast but small disks, or take two existing ones
>> and use a RAID 1 for the external log device.  This will speed up
>> anything log intensive, which both NFS, and resync workloads are lot.
>> Second thing if you can split the workloads into multiple volumes if you
>> have two such different workloads, so thay they don't interfear with
>> each other.
>> Second a RAID0 on top of RAID6 volumes sounds like a pretty worst case
>> for almost any type of I/O.  You end up doing even relatively small I/O
>> to all of the disks in the worst case.  I think you'd be much better
>> off with a simple linear concatenation of the RAID6 devices, even if you
>> can split them into multiple filesystems
>> > The specific symptom was that 'sync' hung, a dpkg command hung
>> > (presumably trying to issue fsync), and experimenting with "killall
>> > -STOP" or "kill -STOP" of the workload jobs didn't let the system
>> > drain I/O enough to finish the sync.  I probably did not wait long
>> > enough, however.
>> It really sounds like you're simply killloing the MD setup with a
>> log of log I/O that does to all the devices.
> And this is one of the reasons why I originally suggested that
> storage at this scale really should be using hardware RAID with
> large amounts of BBWC to isolate the backend from such problematic
> IO patterns.

> Dave Chinner
> david@xxxxxxxxxxxxx

Good HW RAID cards are on order - seems to be backordered at least a
few weeks now at CDW.  Got the batteries immediately.

That will give more options for test and deployment.

Not sure what I can do about the log - man page says xfs_growfs
doesn't implement log moving.  I can rebuild the filesystems, but for
the one mentioned in this theread, this will take a long time.

I'm guessing we'll need to split out the workload - aside from the
differences in file size and use patterns, they also have
fundamentally different values (the high metadata dataset happens to
be high value relative to the low metadata/large file dataset).


<Prev in Thread] Current Thread [Next in Thread>