xfs hang or slowness while removing files

Avi Kivity avi at scylladb.com
Thu Dec 3 06:02:58 CST 2015



On 12/02/2015 11:09 PM, Dave Chinner wrote:
> On Wed, Dec 02, 2015 at 01:07:56PM +0200, Avi Kivity wrote:
>> Removing a directory with ~900 32MB files, we saw this:
>>
>> [ 5645.684464] INFO: task xfsaild/md0:12247 blocked for more than
>> 120 seconds.
>> [ 5645.686488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [ 5645.687713] xfsaild/md0     D ffff88103f9d3680     0 12247    2
>> 0x00000080
>> [ 5645.687729]  ffff8810136f7d40 0000000000000046 ffff882026d82220
>> ffff8810136f7fd8
>> [ 5645.687732]  ffff8810136f7fd8 ffff8810136f7fd8 ffff882026d82220
>> ffff882026d82220
>> [ 5645.687734]  ffff88103f9d44c0 0000000000000001 0000000000000000
>> ffff8820285aa928
>> [ 5645.687737] Call Trace:
>> [ 5645.687747]  [<ffffffff816098d9>] schedule+0x29/0x70
>> [ 5645.687768]  [<ffffffffa06cd880>] _xfs_log_force+0x230/0x290 [xfs]
>> [ 5645.687773]  [<ffffffff810a9510>] ? wake_up_state+0x20/0x20
>> [ 5645.687796]  [<ffffffffa06cd906>] xfs_log_force+0x26/0x80 [xfs]
>> [ 5645.687808]  [<ffffffffa06d2380>] ?
>> xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
>> [ 5645.687818]  [<ffffffffa06d24d1>] xfsaild+0x151/0x5e0 [xfs]
>> [ 5645.687828]  [<ffffffffa06d2380>] ?
>> xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
>> [ 5645.687831]  [<ffffffff8109727f>] kthread+0xcf/0xe0
>> [ 5645.687834]  [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
>> [ 5645.687837]  [<ffffffff81614358>] ret_from_fork+0x58/0x90
>> [ 5645.687852]  [<ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
>>
>> 'rm' did not complete, but was killable.  Nothing else was running
>> on the system at the time.
> Which means the filesystem was not hung, nor was rm blocked in XFS.
> That implies the directory/inode reads that rm does were running
> really slowly. Something else going on here.
>
>> The filesystem was mounted with the discard option set, but since
>> that is discouraged, we'll retry without it.
> Ah, yes, that could cause exactly these symptoms.
>
> I'd guess you are using storage that has unqueued TRIM operations
> (i.e. SATA 3.0 hardware somewhere in your storage path, as queued
> TRIM only came along with SATA 3.1 and AFAIA there's not a lot of
> 3.1 hardware out there yet) which means while discards are
> being issued all other IO tanks and goes really slow.

It's bare-metal cloud hardware so I don't immediately know (I don't 
control the machine).  I could find out -

> We have seen individual TRIM requests on some SSDs take tens of
> milliseconds to complete, regardless of their size. Hence if you
> have one of these devices and you're running thousands of TRIM
> commands across ~30GB of data being freed, then you'd see things
> like rm being really slow on the read side and log forces waiting an
> awful long time for journal IO completion processing to take
> place...
>

- but it's probably better to just drop discard and see if it happens again.



More information about the xfs mailing list