xfs
[Top] [All Lists]

Re: XFS hung on 2.6.33.3 kernel

To: Ilia Mirkin <imirkin@xxxxxxxxxxxx>
Subject: Re: XFS hung on 2.6.33.3 kernel
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Sun, 18 Jul 2010 11:20:33 +1000
Cc: xfs@xxxxxxxxxxx
In-reply-to: <AANLkTilX3l8TbUztLStj_u9OqOZnBrsNQxmeV4DuBmYJ@xxxxxxxxxxxxxx>
References: <AANLkTilX3l8TbUztLStj_u9OqOZnBrsNQxmeV4DuBmYJ@xxxxxxxxxxxxxx>
User-agent: Mutt/1.5.20 (2009-06-14)
On Sat, Jul 17, 2010 at 12:01:11AM -0400, Ilia Mirkin wrote:
> Hi XFS developers,
> 
> I ended up getting this on a 2.6.33.3 kernel. I realize this is not
> the latest, but I've only gotten this once, so it will not be possible
> to easily determine whether this is gone in the latest release or not.
> 
> The details of the situation is that there is an XFS filesystem on a
> hardware RAID10 array (directly on the device, no partition table),
> and it exclusively houses a mysql (5.0) database. The database gets
> backed up nightly, which is basically a glorified tar'ing of most of
> the partition. The non-mysql process you see there are related to a
> mysql backup program (which coincidentally triggers WARN's in xfs
> code, as per http://marc.info/?l=linux-xfs&m=127632333825584&w=2;
> however it sounded like those WARN's were not warning about an
> in-kernel problems; curiously, the last WARN's in the kernel log
> appear to have been from the prior day's backup run, not the one that
> triggered the hang). There's also a sync that I ran as a last-ditch
> effort at saving the system. On reboot, everything came back up OK.

<sigh>

That WARN_ON() is not coincidental at all.  I´ll come back to this
later.

> While the system was stuck, vmstat did not report any block-in/out on
> the device. I have rrd's of those if necessary. Let me know if I might
> be able to provide any other useful info, or if you think that this
> sort of thing should be fixed in the latest linux kernel release. I
> saw that there have been a lot of XFS updates since 2.6.33, but none
> stood out as addressing this type of situation.
> 
> Here is the output of "echo w > /proc/sysrq-trigger":
> 
>  SysRq : Show Blocked State
>    task                        PC stack   pid father
>  xfsdatad/7    D ffff88123885d3e0  6792   772      2 0x00000000
>   ffff88123887fd90 0000000000000046 0000000000000000 ffff88123887fd5c
>   0000000000000000 000000000000de78 ffff88123887ffd8 0000000000013c00
>   ffff88123887ffd8 0000000000013c00 ffff88123885d3e0 ffff881239d23d00
>  Call Trace:
>   [<ffffffff81045f9c>] ? __mod_timer+0x13d/0x14f
>   [<ffffffff814eb81c>] schedule_timeout+0x98/0xbf
>   [<ffffffff81045b27>] ? process_timeout+0x0/0xb
>   [<ffffffff814eb85c>] schedule_timeout_uninterruptible+0x19/0x1b
>   [<ffffffff81206512>] xfs_end_io+0x7d/0x8c
>   [<ffffffff8104de65>] worker_thread+0x18d/0x222
>   [<ffffffff81206495>] ? xfs_end_io+0x0/0x8c
>   [<ffffffff810517ab>] ? autoremove_wake_function+0x0/0x38
>   [<ffffffff8104dcd8>] ? worker_thread+0x0/0x222
>   [<ffffffff81051336>] kthread+0x7d/0x85
>   [<ffffffff81003814>] kernel_thread_helper+0x4/0x10
>   [<ffffffff810512b9>] ? kthread+0x0/0x85
>   [<ffffffff81003810>] ? kernel_thread_helper+0x0/0x10

Waiting on IO completion - this is probably getting EAGAIN due
to not being able to take the inode ilock....

>  mysqld        D ffff880938821380  4696  7750      1 0x00000000
>   ffff880650e01cb8 0000000000000082 00715f4f00000001 0000000000000000
>   ffff88090a0bd3e0 000000000000de78 ffff880650e01fd8 0000000000013c00
>   ffff880650e01fd8 0000000000013c00 ffff88090a0bd3e0 ffff88090a0bf260
>  Call Trace:
>   [<ffffffff814ebc5a>] __mutex_lock_common+0x139/0x1a2
>   [<ffffffff814ebcd7>] __mutex_lock_slowpath+0x14/0x16
>   [<ffffffff814ebda5>] mutex_lock+0x1e/0x35
>   [<ffffffff8120e1bf>] xfs_read+0x100/0x1f9
>   [<ffffffff8120a574>] xfs_file_aio_read+0x58/0x5c
>   [<ffffffff810d7dc4>] do_sync_read+0xc0/0x106
>   [<ffffffff8102832b>] ? pick_next_task+0x22/0x3e
>   [<ffffffff810d8813>] vfs_read+0xa8/0x102
>   [<ffffffff810d88c9>] sys_pread64+0x5c/0x7d
>   [<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b

Waiting on i_mutex, no other locks held.

....

>  mysqld        D ffff880938821380  4648  7763      1 0x00000000
>   ffff8801b4febbb8 0000000000000082 ffff8801b4febb58 ffff880937ea2df8
>   000000000000de78 000000000000de78 ffff8801b4febfd8 0000000000013c00
>   ffff8801b4febfd8 0000000000013c00 ffff880937ea6ac0 ffff8812116787a0
>  Call Trace:
>   [<ffffffff81027beb>] ? activate_task+0x28/0x31
>   [<ffffffff814ec6ac>] rwsem_down_failed_common+0x96/0xc8
>   [<ffffffff814ec6fe>] rwsem_down_write_failed+0x20/0x2a
>   [<ffffffff812504b3>] call_rwsem_down_write_failed+0x13/0x20
>   [<ffffffff814ec005>] ? down_write+0x2d/0x31
>   [<ffffffff811ebd14>] xfs_ilock+0x23/0x94
>   [<ffffffff8120d92f>] xfs_write+0x2c5/0x784
>   [<ffffffff81034825>] ? wake_up_state+0xb/0xd
>   [<ffffffff814ecb8e>] ? common_interrupt+0xe/0x13
>   [<ffffffff8120a51a>] xfs_file_aio_write+0x5a/0x5c
>   [<ffffffff810d7cbe>] do_sync_write+0xc0/0x106
>   [<ffffffff810d8624>] vfs_write+0xab/0x105
>   [<ffffffff810d86da>] sys_pwrite64+0x5c/0x7d
>   [<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b

waiting on ione of the xfs inode locks in exclusive mode here  -
probably holds the i_mutex too.  Because it's an exclusive lock we're
waiting on here that, IIRC, will hold out any new readers, too.
Its probably stuck because the lock is already held in shared mode.

.....

>  mysqld        D 0000000000000002  4648  7765      1 0x00000000
>   ffff880287b4ba78 0000000000000082 0000000000000000 0000000000000001
>   0000000000000096 000000000000de78 ffff880287b4bfd8 0000000000013c00
>   ffff880287b4bfd8 0000000000013c00 ffff880937ea2dc0 ffff88123885d3e0
>  Call Trace:
>   [<ffffffff8104e062>] ? insert_work+0x92/0x9e
>   [<ffffffff8105945a>] ? ktime_get_ts+0xb1/0xbe
>   [<ffffffff814eb4b9>] io_schedule+0x7b/0xc1
>   [<ffffffff8109fc02>] sync_page+0x41/0x45
>   [<ffffffff814eb8d9>] __wait_on_bit_lock+0x45/0x8c
>   [<ffffffff8109fbc1>] ? sync_page+0x0/0x45
>   [<ffffffff8109fb84>] __lock_page+0x63/0x6a
>   [<ffffffff810517e3>] ? wake_bit_function+0x0/0x2a
>   [<ffffffff810a82de>] lock_page+0x24/0x28
>   [<ffffffff810a8364>] invalidate_inode_pages2_range+0x82/0x2af
>   [<ffffffff8120646a>] ? xfs_vm_direct_IO+0x90/0xbb
>   [<ffffffff812066a8>] ? xfs_get_blocks_direct+0x0/0x17
>   [<ffffffff812062c8>] ? xfs_end_io_direct+0x0/0x4e
>   [<ffffffff810a085d>] generic_file_direct_write+0xf5/0x14b
>   [<ffffffff8120db02>] xfs_write+0x498/0x784
>   [<ffffffff81034825>] ? wake_up_state+0xb/0xd
>   [<ffffffff8120a51a>] xfs_file_aio_write+0x5a/0x5c
>   [<ffffffff810d7cbe>] do_sync_write+0xc0/0x106
>   [<ffffffff810d8624>] vfs_write+0xab/0x105
>   [<ffffffff810d86da>] sys_pwrite64+0x5c/0x7d
>   [<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b

Direct IO write waiting on page cache invalidation which is blocked
on a locked page. i.e. it's trying to flush pages cached during
buffered IO.  Holds inode iolock here shared.

Also, it does not hold the i_mutex but is invalidating the page cache.
Remember that WARN_ON() condition that I'd say I'd come back to?
Guess what is warns about:

        if ((ioflags & IO_ISDIRECT)) {
                if (mapping->nrpages) {
>>>>>>>                 WARN_ON(need_i_mutex == 0);
                        error = xfs_flushinval_pages(xip,
                                        (pos & PAGE_CACHE_MASK),
                                        -1, FI_REMAPF_LOCKED);
                        if (error)
                                goto out_unlock_internal;
                }

Right - it's warning about invalidating the page cache without the
i_mutex held. And FWIW the comment in generic_file_direct_write()
about this invalidation during direct IO:

        /*
         * Finally, try again to invalidate clean pages which might have been
         * cached by non-direct readahead, or faulted in by get_user_pages()
         * if the source of the write was an mmap'ed region of the file
         * we're writing.  Either one is a pretty crazy thing to do,
         * so we don't support it 100%.  If this invalidation
         * fails, tough, the write still worked...
         */
        if (mapping->nrpages) {
                invalidate_inode_pages2_range(mapping,
                                              pos >> PAGE_CACHE_SHIFT, end);
        }

So the generic page cache code does not handle mixed bufered/direct
IO to the same file concurrently with a 100% coherency guarantee.
IOWs, just because the WARN_ON() is not being triggered, it *does not
mean that there is no problem*.


>  tar4ibd       D 0000000000000000  4552  6721   6720 0x00000004
>   ffff88016a7c3740 0000000000000086 0000000000000000 ffff88095fc13c00
>   0000000000000001 000000000000de78 ffff88016a7c3fd8 0000000000013c00
>   ffff88016a7c3fd8 0000000000013c00 ffff8806b39344a0 ffff881239d253e0
>  Call Trace:
>   [<ffffffff81034837>] ? wake_up_process+0x10/0x12
>   [<ffffffff8124c328>] ? __rwsem_do_wake+0x80/0x152
>   [<ffffffff814ec6ac>] rwsem_down_failed_common+0x96/0xc8
>   [<ffffffff81244595>] ? cfq_close_cooperator+0xd9/0x184
>   [<ffffffff814ec72e>] rwsem_down_read_failed+0x26/0x30
>   [<ffffffff81250484>] call_rwsem_down_read_failed+0x14/0x30
>   [<ffffffff814ec028>] ? down_read+0x1f/0x23
>   [<ffffffff811ebd4e>] xfs_ilock+0x5d/0x94
>   [<ffffffff811ebdae>] xfs_ilock_map_shared+0x29/0x2f
>   [<ffffffff811f231b>] xfs_iomap+0x9f/0x39c
>   [<ffffffff81239215>] ? __blk_run_queue+0x4a/0x70
>   [<ffffffff81247a91>] ? cpumask_next_and+0x2b/0x3c
>   [<ffffffff812065bd>] __xfs_get_blocks+0x71/0x15c
>   [<ffffffff812066d1>] xfs_get_blocks+0x12/0x14
>   [<ffffffff810fec23>] do_mpage_readpage+0x173/0x467
>   [<ffffffff810b2234>] ? __inc_zone_page_state+0x25/0x27
>   [<ffffffff8109fecc>] ? add_to_page_cache_locked+0x82/0xc6
>   [<ffffffff810ff04a>] mpage_readpages+0xd6/0x11c
>   [<ffffffff812066bf>] ? xfs_get_blocks+0x0/0x14
>   [<ffffffff812066bf>] ? xfs_get_blocks+0x0/0x14
>   [<ffffffff81207724>] xfs_vm_readpages+0x1a/0x1c
>   [<ffffffff810a72c4>] __do_page_cache_readahead+0x10c/0x1a2
>   [<ffffffff810a7376>] ra_submit+0x1c/0x20
>   [<ffffffff810a7606>] ondemand_readahead+0x1b2/0x1c5
>   [<ffffffff810a76f3>] page_cache_sync_readahead+0x38/0x3a
>   [<ffffffff810a0e0c>] generic_file_aio_read+0x229/0x582
>   [<ffffffff8120e262>] xfs_read+0x1a3/0x1f9
>   [<ffffffff8120a574>] xfs_file_aio_read+0x58/0x5c
>   [<ffffffff810d7dc4>] do_sync_read+0xc0/0x106
>   [<ffffffff81032879>] ? finish_task_switch+0x43/0xac
>   [<ffffffff810d8813>] vfs_read+0xa8/0x102
>   [<ffffffff810d8931>] sys_read+0x47/0x70
>   [<ffffffff81002a6b>] system_call_fastpath+0x16/0x1b

And there's the backup program, doing buffered readahead, holding a
page locked and stuck trying to get the xfs inode ilock. i.e. this
trace shows your system has tripped over the exact case that the
above code says is "a pretty crazy thing to do". IOWs, this is not
an XFS specific problem.

I can't find a thread that holds the XFS inode lock that everything
is waiting on. I think it is the ILOCK, but none of the threads in
this trace should be holding it where they are blocked. IOWs, the
output does not give me enough information to get to the root cause.

So, back to the situation with the WARN_ON(). You're running
applications that are doing something that:

        a) is not supported;
        b) compromises data integrity guarantees;
        c) is not reliably reported; and
        d) might be causing hangs

Right now I'm not particularly inclined to dig into this further;
it's obvious the applications are doing something that is not
supported (by XFS or the generic page cache code), so this is the
first thing you really need to care about getting fixed if you value
your backups...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>