xfs
[Top] [All Lists]

Daily crash in xfs_cmn_err

To: xfs@xxxxxxxxxxx
Subject: Daily crash in xfs_cmn_err
From: Juerg Haefliger <juergh@xxxxxxxxx>
Date: Mon, 29 Oct 2012 11:55:15 +0100
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=HCCw20S+fDbayPKOaNDqnXTSiDMAPl/WZ4OUE2dxyJg=; b=PEP+tWhJ+ZGClHSPzRfzEzgC0c7XaHlu9uucG3wW4v3/MKLkBkW3nkDW7UEZJu+SPG tNF8GZhQ0Ozh/7KFU6NlOebN2LulblpDj7Y+Lthk9rw/lBYZt7VDZPN/ienOU+6GDeiu ekai6GDCiluI5ZZ1CanQVGoihPIOLtrun/mRls7/pdtGePRfGt26rn4lNzZ+Wpjh4R3N g5cj/QhYS2KN7Uf20usZtRPwv9qb4HPxaxVB4PMkyHC8ka8y4QRWPEUoiIkfFphS2hbI hUuO6DDfijoDaXMFZKwiXTldDvMQnhQYRmj9t7Nn9eyu3KC8D7B8+3kR8spzzFJw+JaL C/Kw==
Hi,

I have a node that used to crash every day at 6:25am in xfs_cmn_err
(Null pointer dereference). I'm running 2.6.38 (yes I know it's old)
and I'm not reporting a bug as I'm sure the problem has been fixed in
newer kernels :-) The crashes occurred daily when logrotate ran,
/var/log is a separate 100GB XFS volume. I finally took the node down
and ran xfs_check on the the /var/log volume and it did indeed report
some errors/warnings (see end of email). I then ran xfs_repair (see
end of email for output) and the node is stable ever since.

So my questions are:

1) I was under the impression that during the mounting of an XFS
volume some sort of check/repair is performed. How does that differ
from running xfs_check and/or xfs_repair?

2) Any ideas how the filesystem might have gotten into this state? I
don't have the history of that node but it's possible that it crashed
previously due to an unrelated problem. Could this have left the
filesystem is this state?

3) What exactly does the ouput of the xfs_check mean? How serious is
it? Are those warning or errors? Will some of them get cleanup up
during the mounting of the filesystem?

4) We have a whole bunch of production nodes running the same kernel.
I'm more than a little concerned that we might have a ticking timebomb
with some filesystems being in a state that might trigger a crash
eventually. Is there any way to perform a live check on a mounted
filesystem so that I can get an idea of how big of a problem we have
(if any)? i don't claim to know exactly what I'm doing but I picked a
node, froze the filesystem and then ran a modified xfs_check (which
bypasses the is_mounted check and ignores non-committed metadata) and
it did report some issues. At this point I believe those are false
positive. Do you have any suggestions short of rebooting the nodes and
running xfs_check on the unmounted filesystem?

Thanks
....Juerg


(initramfs) xfs_check  /dev/mapper/vg0-varlog
block 0/3538 expected type unknown got free2
block 0/13862 expected type unknown got free2
block 0/13863 expected type unknown got free2

<SNIP>

block 0/16983 expected type unknown got free2
block 0/16984 expected type unknown got free2
block 0/21700 expected type unknown got data
block 0/21701 expected type unknown got data

<SNIP>

block 0/21826 expected type unknown got data
block 0/21827 expected type unknown got data
block 0/21700 claimed by inode 178, previous inum 148
block 0/21701 claimed by inode 178, previous inum 148

<SNIP>

block 0/21826 claimed by inode 178, previous inum 148
block 0/21827 claimed by inode 178, previous inum 148
block 0/1250 expected type unknown got data
block 0/1251 expected type unknown got data

<SNIP>

block 0/1264 expected type unknown got data
block 0/1265 expected type unknown got data
block 0/1250 claimed by inode 1706, previous inum 148
block 0/1251 claimed by inode 1706, previous inum 148

<SNIP>

block 0/1264 claimed by inode 1706, previous inum 148
block 0/1265 claimed by inode 1706, previous inum 148
block 0/16729 expected type unknown got data
block 0/16730 expected type unknown got data

<SNIP>

block 0/16889 expected type unknown got data
block 0/16890 expected type unknown got data
block 0/16729 claimed by inode 1710, previous inum 148
block 0/16730 claimed by inode 1710, previous inum 148

<SNIP>

block 0/16889 claimed by inode 1710, previous inum 148
block 0/16890 claimed by inode 1710, previous inum 148
block 0/3523 expected type unknown got data
block 0/3524 expected type unknown got data

<SNIP>

block 0/3536 expected type unknown got data
block 0/3537 expected type unknown got data
block 0/3523 claimed by inode 1994, previous inum 148
block 0/3524 claimed by inode 1994, previous inum 148

<SNIP>

block 0/3536 claimed by inode 1994, previous inum 148
block 0/3537 claimed by inode 1994, previous inum 148
block 0/510 type unknown not expected
block 0/511 type unknown not expected
block 0/512 type unknown not expected

<SNIP>

block 0/2341 type unknown not expected
block 0/2342 type unknown not expected



(initramfs) xfs_repair /dev/mapper/vg0-varlog
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
data fork in ino 148 claims free block 3538
data fork in ino 148 claims free block 13862
data fork in ino 148 claims free block 13863
data fork in ino 148 claims free block 16891
data fork in ino 148 claims free block 16892
data fork in regular inode 178 claims used block 21700
bad data fork in inode 178
cleared inode 178
data fork in regular inode 1706 claims used block 1250
bad data fork in inode 1706
cleared inode 1706
data fork in regular inode 1710 claims used block 16729
bad data fork in inode 1710
cleared inode 1710
data fork in regular inode 1994 claims used block 3523
bad data fork in inode 1994
cleared inode 1994
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
entry "syslog" at block 0 offset 2384 in directory inode 128
references free inode 178
        - agno = 3
        clearing inode number in entry at offset 2384...
        - agno = 2
data fork in ino 148 claims dup extent, off - 0, start - 1250, cnt 16
bad data fork in inode 148
cleared inode 148
entry "nova-api.log" at block 0 offset 552 in directory inode 165
references free inode 1994
        clearing inode number in entry at offset 552...
entry "nova-network.log.6" at block 0 offset 2608 in directory inode
165 references free inode 1706
        clearing inode number in entry at offset 2608...
entry "nova-api.log.6" at block 0 offset 4032 in directory inode 165
references free inode 1710
        clearing inode number in entry at offset 4032...
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
entry "syslog.1" in directory inode 128 points to free inode 148
bad hash table for directory inode 128 (no data entry): rebuilding
rebuilding directory inode 128
bad hash table for directory inode 165 (no data entry): rebuilding
rebuilding directory inode 165
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done

<Prev in Thread] Current Thread [Next in Thread>