On Tue, Aug 03, 2010 at 06:30:36PM -0400, Ilia Mirkin wrote:
> On Sun, Jul 18, 2010 at 7:50 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Sat, Jul 17, 2010 at 09:35:33PM -0400, Ilia Mirkin wrote:
> >> On Sat, Jul 17, 2010 at 9:20 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >> > On Sat, Jul 17, 2010 at 12:01:11AM -0400, Ilia Mirkin wrote:
> >> > I can't find a thread that holds the XFS inode lock that everything
> >> > is waiting on. I think it is the ILOCK, but none of the threads in
> >> > this trace should be holding it where they are blocked. IOWs, the
> >> > output does not give me enough information to get to the root cause.
> >> In case this happens again, was there something more useful I could
> >> have collected? Should I have grabbed all task states?
> > All the task states, including the running tasks, is probably a good
> > Though I suspect the only way to get to the bottom of it will
> > be to work out a reproducable test case....
> I felt a little bothered by this issue, so I dug in a little further.
> I basically created a sample mysql, tar, and it seemed that having
> memory pressure helped, so I created an allocator as well.
[snip test case]
I think I see the problem, and it's not something that
can be easily fixed. The test-tar process is blocked in
xfs_ilock_map_shared() trying to get the ip->i_lock in exclusive
You have a couple of IO's waiting to complete in the xfsdatad's
waiting to get the ip->i_Ilock in exclusive mode in a non-blocking
There are test-mysqld processes stuck doing page cache invalidation,
waiting for IO completion to occur (which can't occur because of the
above stuck IOs). These hold the ip->i_iolock in shared mode.
There's a couple of test-mysqld processes stuck on the
2883, 2884, 2886, 2888, 2889, 2891
There's a test-mysqld process stuck waiting for either
the ip->i_iolock or ip->i_ilock in exclusive mode:
And then there is the test-tar process, which is blocked on the
ip->i_ilock trying to get it in _shared_ mode, holding the ip->i_iolock
in shared mode
And a sync process stuck waiting for io completion (no locks held)
So in summary, we have:
1) xfsdatad sleeping for a clock tick because it can't get the
ip->i_lock(EXCL) without blocking
2) 2882, 2887 waiting for IO completion, holding ip->i_iolock(SHARED)
(blocked behind #1))
3) 2885 waiting on an ip->i_iolock(EXCL), holding the i_mutex and
blocked behind #2) and #5)
4) 2883, 2884, 2886, 2888, 2889, 2891, 2890, 2892 waiting on
i_mutex, blocked behind #3)
5) 2897 doing buffered read, holding ip->i_iolock(SHARED),
waiting on ip->i_lock(SHARED)
So, this leaves me with the question - what is holding the
ip->i_ilock()? Everything is blocked waiting for it and there are no
reported holders, either through stack trace analysis or from the
lockdep lock holder report.
I said this previously:
>> > I can't find a thread that holds the XFS inode lock that everything
>> > is waiting on. I think it is the ILOCK, but none of the threads in
>> > this trace should be holding it where they are blocked. IOWs, the
>> > output does not give me enough information to get to the root cause.
And my conclusion is the same from this analysis - I cannot see why
everything has stopped waiting on an ILOCK that nobody should hold
and lockdep is saying that nobody does actually hold.
I'm starting to suspect memory corruption or something similar that
is corrupting the inode lock state, but I don't hold much hope of
tracking down something like that if it takes hours to reproduce.
Ilia, while I try to think of the next step to take, can you try
a couple of things to see if the hang can be caused faster? Maybe
trying these different initial conditions:
- use a smaller test file
- allocate the file with worst case fragmentation by
writing it backwards in synchronous 4k blocks before
running the test
- preallocating the test file with fallocate()
Also, it woul dbe interesting to find out if a newer kernel also
hangs in a similar manner, say 2.6.35?