xfs-masters
[Top] [All Lists]

[xfs-masters] [Bug 10421] Filesystem failing without any warning or etc

To: xfs-masters@xxxxxxxxxxx
Subject: [xfs-masters] [Bug 10421] Filesystem failing without any warning or etc
From: bugme-daemon@xxxxxxxxxxxxxxxxxxx
Date: Wed, 23 Apr 2008 18:37:00 -0700 (PDT)
In-reply-to: <bug-10421-470@http.bugzilla.kernel.org/>
Reply-to: xfs-masters@xxxxxxxxxxx
Sender: xfs-masters-bounce@xxxxxxxxxxx
http://bugzilla.kernel.org/show_bug.cgi?id=10421





------- Comment #30 from dgc@xxxxxxx  2008-04-23 18:36 -------
An incore inode saying it's inode number is -1 (NULLFSINO)? That's a
new one. That would also explain the:

"xfs_dilocate: agno (127) >= mp->m_sb.sb_agcount (16)"

error message as well.

The reason we are trying to look up the inode is that the inode number matches
with the base unlinked inode list number (of NULLAGINO). This acutally means
"this unlinked list is empty", so we are right to be shutting down here as
there's definitely in memory corruption of some kind.

FWIW, this is definitely an in-memory problem as the inode number does not
exist on disk at all...

What I don't understand yet is why, if the inode number is -1, we didn't
abort earlier in xfs_iunlink_remove when we tried to read way beyond the
end of device to pull in the AGI buffer....

Hmmmm. I suspect that this is not a full 64 bit word corruption. I think
the lower 32bits of the inode number have been overwritten with 0xffffffff.
The agno printed in the above error message should have been much larger
than 127 (0x7f) if it was a full 64 bit word corruption.

IF we look at the agino - 0x1fffff - that's 25 bits of 1s and the agno is
7 bits of 1s that are packed tightly against the agino. Hence we have 32 bits
of 1s in a 64bit word.

On top of that the corruption had to happen between calculating the agno
from ip->i_ino and calculating the agino from ip->i_ino. In between this we
read teh AGI buffer, so there's a big window for corruption to occur there
(possible disk I/O).

I think we are definitely looking at some kind of memory corruption occurring
here and we are no closer to isolating the cause at all. It's likely to be
a use-after-free or some such issue. Can you please turn on every single
memory debugging option (slab poisoning, etc) in the kernel to try to
catch whatever is corrupting memory.

Ideally, I'd like you to run with KDB or another debugger so you can dump the
memory around the inode that has been corrupted so we can see whether or not
it's just this 32 bit word that has been corrupted.


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


<Prev in Thread] Current Thread [Next in Thread>