xfs
[Top] [All Lists]

Re: Storage server, hung tasks and tracebacks

To: Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>
Subject: Re: Storage server, hung tasks and tracebacks
From: Brian Candler <B.Candler@xxxxxxxxx>
Date: Sun, 20 May 2012 17:35:06 +0100
Cc: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha1; c=relaxed; d=pobox.com; h=date:from:to :cc:subject:message-id:references:mime-version:content-type :in-reply-to; s=sasl; bh=5txfwuwAYRXiIOwurUzqmQ7eVVA=; b=uitx5ZC MxCmkWz8CJygSatvarnqEIlLAxtEga1lAH4hZYdCT3QNisHEHlAqSvIwp2vuFiZG uj0Yxll2I9brkNRuOE/nLAfHnFAqj9l5UFcG7lHmrs+uY5wrG2iSfbR0PdM6tNcH 31hbwIu8scL8mCgFVWa1gYyxlYOEkHL7zRc4=
Domainkey-signature: a=rsa-sha1; c=nofws; d=pobox.com; h=date:from:to:cc :subject:message-id:references:mime-version:content-type :in-reply-to; q=dns; s=sasl; b=H3DTrEebyDIscrplPVglOXfuo0cr3zR4z S7PNtdrDwTbT3PWpSXMdyz/1sVY6ILAu6txqRVWY/jWOMM+tH5DI1RbhkTKv0INU KjXSuNuzicmHWJjWAxdX6WAjr19Uo2yr80qCYRTXuf338X+v+5fAf1hCgkLwAvrM cSUDOU+iKw=
In-reply-to: <20120515140237.GA3630@xxxxxxxx>
References: <20120502184450.GA2557@xxxxxxxx> <4FA27EF8.6040002@xxxxxxxxxxxxxxxxx> <20120503204157.GC4387@xxxxxxxx> <4FA3047D.8060908@xxxxxxxxxxxxxxxxx> <20120504163237.GA6128@xxxxxxxx> <4FA4C321.2070105@xxxxxxxxxxxxxxxxx> <20120515140237.GA3630@xxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
Another update:

I have been trying some various combinations to see under what circumstances
I can make things lock up.

The main discovery: using ext4 instead of xfs, I cannot get the server to
lock up - after 36 hours of continuous testing anyway. With xfs and
everything else identical, it typically locks up within 10 minutes.

This is not to say that xfs is at fault. It may be that xfs generates a
higher peak load of I/O ops or something, and that tickles the problem.  In
any case I see a mixture of unkillable processes: not only bonnie++ and
xfsaild but I have also seen kswapd, kworker, irqbalance, even postfix
processes (which should not even be touching the 24-disk array; there is a
separate system disk directly connected to the motherboard's own SATA
controller)

The test is running four concurrent bonnie++ sessions in separate screen
sessions.

Some of the tests performed:

- 24 SATA disks, LSI HBAs, md RAID0, XFS: rapid lockup

- 24 SATA disks, LSI HBAs, md RAID0, ext4: no lockup seen so far

- 2 SATA disks, LSI HBAs, md RAID0, XFS: no lockup

- 1 system SATA disk, motherboard SATA, no RAID, ext4: no lockup

I did also write a ruby script to do lots of concurrent dd reads (at random
offsets) directly from the array. I wasn't able to replicate the problem
with that.

This is with Seagate 7200rpm drives, and the total I/O bandwidth I can see
is quite a lot (see iostat below).  I can also replicate the problem in a
similar system with Hitachi "coolspin" (5940rpm?) drives, but it seems to
take somewhat longer, maybe an hour or two, so perhaps the peak I/O ops is
something to do with it?

(These systems do have only 8GB RAM, so I also wondered if it was something
to do with deadlocking when allocating buffer space if not enough was
available)

Regards,

Brian.


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.55    0.00   80.67   12.63    0.00    5.15

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.40         0.00         2.40          0         12
sdf             187.80     17817.60    109416.80      89088     547084
sde             182.60     17817.60    112051.20      89088     560256
sdd             183.00     17817.60    108800.00      89088     544000
sdc             167.00     17612.80    105840.80      88064     529204
sdg             162.80     17612.80    107735.20      88064     538676
sdh             180.00     18022.40    112230.40      90112     561152
sdp             168.20     17408.00    107929.60      87040     539648
sdj             179.60     17614.40    111346.40      88072     556732
sdq             174.20     17408.00    108544.00      87040     542720
sdk             201.60     17612.80    111206.40      88064     556032
sdb             189.20     17819.20    108800.00      89096     544000
sdl             195.60     17542.40    110387.20      87712     551936
sdo             196.00     17408.00    111206.40      87040     556032
sdm             200.00     17408.00    110796.80      87040     553984
sdn             189.00     17408.00    108544.00      87040     542720
sdi             168.60     18022.40    112025.60      90112     560128
sdr             192.60     17819.20    111858.40      89096     559292
sdu             193.80     17612.80    108953.60      88064     544768
sdv             202.60     17612.80    108851.20      88064     544256
sdw             178.20     17612.80    108953.60      88064     544768
sdy             191.60     17612.80    110796.80      88064     553984
sdx             196.00     17612.80    111616.00      88064     558080
sds             182.80     17612.80    109158.40      88064     545792
sdt             191.60     17203.20    111219.20      86016     556096
md127          7569.80    415064.00   2620999.20    2075320   13104996

<Prev in Thread] Current Thread [Next in Thread>