xfs
[Top] [All Lists]

88TB filesystem going off-line without warning

To: xfs@xxxxxxxxxxx
Subject: 88TB filesystem going off-line without warning
From: L Ox <lox8096@xxxxxxxxx>
Date: Tue, 2 Apr 2013 11:44:15 -0700
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=taEgk6jqiLuUpkvRrQMIC1DhUfbkr0FdIdCW1+uTrg4=; b=mBl8LniFiC8EjVmaouPdxTrFpSH3xYOixMV6ekpnol/rXB8jFrAvNu+mkSuwvfSToe x7WaY5BkVo6OyIY8jTj4dy0x6qP8F+FLJBkNWug7jQ+opDqr0zL/o6BA9C61hcLINa2G 3xWUdvO2z5L1AGkZrV9SD9JNN3+M/IjRrvKwNQFf+hQLwKI1GHvLHS9SC91NknZXAjPP ypd26Y9mtI2RSVB72z2YkjkPdHIPiREjSKQ3JnllO1e0bDHLcQu3kdqxyVyG5pTbybBZ c8xdeFjH1BDSHtoCln45IJ7tfM/sdxaVvzgdjfxcJ56BtJ1vtxzY0juFWkpBttBVRpyY 0gsg==
Hi,

We have a new Linux/XFS deployment (about a month old) and randomly without warning the XFS filesystem will go off-line. We are running Scientific Linux release 5.9 with the latest updates.

# uname -a
Linux node24 2.6.18-348.3.1.el5 #1 SMP Mon Mar 11 15:43:13 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/redhat-release
Scientific Linux release 5.9 (Boron)

Here are the errors we see in /var/log/messages after the initial off-line event:

-- snip --

Apr  2 07:50:28 node24 kernel: xfs_iunlink_remove: xfs_inotobp()  returned an error 22 on dm-6.  Returning error.
Apr  2 07:50:28 node24 kernel: xfs_inactive:  xfs_ifree() returned an error = 22 on dm-6
Apr  2 07:50:28 node24 kernel: xfs_force_shutdown(dm-6,0x1) called from line 1419 of file fs/xfs/xfs_vnodeops.c.  Return address = 0xffffffff8855b86b
Apr  2 07:50:28 node24 kernel: Filesystem dm-6: I/O Error Detected.  Shutting down filesystem: dm-6
Apr  2 07:50:28 node24 kernel: Please umount the filesystem, and rectify the problem(s)
Apr  2 07:50:52 node24 kernel: Filesystem dm-6: xfs_log_force: error 5 returned.
Apr  2 07:51:52 node24 last message repeated 2 times

-- snip --

Here are the messages after I umount/xfs_repair/mount the filesystem:

-- snip --

Apr  2 10:23:04 node24 kernel: xfs_force_shutdown(dm-6,0x1) called from line 420 of file fs/xfs/xfs_rw.c.  Return address = 0xffffffff8855c0fe
Apr  2 10:23:07 node24 kernel: Filesystem dm-6: xfs_log_force: error 5 returned.
Apr  2 10:23:07 node24 last message repeated 4 times
Apr  2 10:24:08 node24 kernel: Filesystem dm-6: Disabling barriers, trial barrier write failed
Apr  2 10:24:08 node24 kernel: XFS mounting filesystem dm-6
Apr  2 10:24:08 node24 kernel: Starting XFS recovery on filesystem: dm-6 (logdev: internal)
Apr  2 10:24:10 node24 kernel: Ending XFS recovery on filesystem: dm-6 (logdev: internal)
Apr  2 10:24:17 node24 multipathd: dm-6: umount map (uevent)
Apr  2 10:58:54 node24 kernel: Filesystem dm-6: Disabling barriers, trial barrier write failed
Apr  2 10:58:54 node24 kernel: XFS mounting filesystem dm-6

-- snip --

We are taking 6 devices from a SAN and using LVM to effectively create a RAID0 block devices which XFS is sitting on. We do not see any multipathd errors.

I created the filesystem using this command.

# mkfs.xfs -f -d su=256k,sw=6,sectsize=4096,unwritten=0 -i attr=2 -l sectsize=4096,lazy-count=1 -r extsize=4096 /dev/mapper/vol_d24-root

Here are the mount options:

# cat /etc/fstab | grep xfs
/dev/mapper/vol_d24-root            /archive/d24       xfs defaults,inode64        0 9

# mount | grep xfs
/dev/mapper/vol_d24-root on /archive/d24 type xfs (rw,inode64)

Here is the output of xfs_info:

# xfs_info /dev/mapper/vol_d24-root
meta-data="" isize=256    agcount=88, agsize=268435392 blks
         =                       sectsz=4096  attr=2
data     =                       bsize=4096   blocks=23441774592, imaxpct=25
         =                       sunit=64     swidth=384 blks, unwritten=0
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=32768, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

After the initial off-line event I:
- umount
- ran xfs_repair (it told me to mount/umount and then re-run xfs_repair)
- mount
- umount
- xfs_repair

Here is the output of xfs_repair:

-- snip --

# xfs_repair /dev/mapper/vol_d24-root
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
2acde2416940: Badness in key lookup (length)
bp=(bno 14657493984, len 16384 bytes) key=(bno 14657493984, len 8192 bytes)
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
2acde2416940: Badness in key lookup (length)
bp=(bno 26065183200, len 16384 bytes) key=(bno 26065183200, len 8192 bytes)
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
2acde2e17940: Badness in key lookup (length)
bp=(bno 43039175488, len 16384 bytes) key=(bno 43039175488, len 8192 bytes)
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - agno = 28
        - agno = 29
        - agno = 30
        - agno = 31
        - agno = 32
        - agno = 33
        - agno = 34
        - agno = 35
        - agno = 36
        - agno = 37
        - agno = 38
        - agno = 39
        - agno = 40
        - agno = 41
        - agno = 42
        - agno = 43
        - agno = 44
        - agno = 45
        - agno = 46
        - agno = 47
2acde0613940: Badness in key lookup (length)
bp=(bno 101051527232, len 16384 bytes) key=(bno 101051527232, len 8192 bytes)
2acde0613940: Badness in key lookup (length)
bp=(bno 101081120768, len 16384 bytes) key=(bno 101081120768, len 8192 bytes)
2acde0613940: Badness in key lookup (length)
bp=(bno 102336613216, len 16384 bytes) key=(bno 102336613216, len 8192 bytes)
        - agno = 48
        - agno = 49
2acde2416940: Badness in key lookup (length)
bp=(bno 107185599392, len 16384 bytes) key=(bno 107185599392, len 8192 bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107606543312, len 16384 bytes) key=(bno 107606543312, len 8192 bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107674994560, len 16384 bytes) key=(bno 107674994560, len 8192 bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107675078656, len 16384 bytes) key=(bno 107675078656, len 8192 bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107675078688, len 16384 bytes) key=(bno 107675078688, len 8192 bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107675078720, len 16384 bytes) key=(bno 107675078720, len 8192 bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107675175008, len 16384 bytes) key=(bno 107675175008, len 8192 bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107704942624, len 16384 bytes) key=(bno 107704942624, len 8192 bytes)
2acde1014940: Badness in key lookup (length)
bp=(bno 107763211904, len 16384 bytes) key=(bno 107763211904, len 8192 bytes)
        - agno = 50
2acde1014940: Badness in key lookup (length)
bp=(bno 109436122656, len 16384 bytes) key=(bno 109436122656, len 8192 bytes)
2acde2e17940: Badness in key lookup (length)
bp=(bno 110466056352, len 16384 bytes) key=(bno 110466056352, len 8192 bytes)
2acde2e17940: Badness in key lookup (length)
bp=(bno 110603835392, len 16384 bytes) key=(bno 110603835392, len 8192 bytes)
        - agno = 51
        - agno = 52
        - agno = 53
        - agno = 54
        - agno = 55
        - agno = 56
        - agno = 57
        - agno = 58
        - agno = 59
        - agno = 60
        - agno = 61
2acde2416940: Badness in key lookup (length)
bp=(bno 132435472416, len 16384 bytes) key=(bno 132435472416, len 8192 bytes)
        - agno = 62
2acde2416940: Badness in key lookup (length)
bp=(bno 135330780000, len 16384 bytes) key=(bno 135330780000, len 8192 bytes)
2acde2416940: Badness in key lookup (length)
bp=(bno 135508074496, len 16384 bytes) key=(bno 135508074496, len 8192 bytes)
2acde2416940: Badness in key lookup (length)
bp=(bno 135675982432, len 16384 bytes) key=(bno 135675982432, len 8192 bytes)
        - agno = 63
        - agno = 64
        - agno = 65
        - agno = 66
        - agno = 67
        - agno = 68
        - agno = 69
        - agno = 70
        - agno = 71
        - agno = 72
        - agno = 73
        - agno = 74
        - agno = 75
        - agno = 76
        - agno = 77
        - agno = 78
        - agno = 79
        - agno = 80
        - agno = 81
        - agno = 82
        - agno = 83
        - agno = 84
        - agno = 85
        - agno = 86
        - agno = 87
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 2
        - agno = 3
        - agno = 8
        - agno = 9
        - agno = 4
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - agno = 17
        - agno = 19
        - agno = 20
        - agno = 18
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - agno = 28
        - agno = 29
        - agno = 30
        - agno = 31
        - agno = 32
        - agno = 33
        - agno = 34
        - agno = 35
        - agno = 36
        - agno = 37
        - agno = 38
        - agno = 39
        - agno = 40
        - agno = 41
        - agno = 42
        - agno = 43
        - agno = 44
        - agno = 45
        - agno = 46
        - agno = 47
        - agno = 48
        - agno = 49
        - agno = 50
        - agno = 51
        - agno = 52
        - agno = 53
        - agno = 54
        - agno = 55
        - agno = 56
        - agno = 57
        - agno = 58
        - agno = 59
        - agno = 60
        - agno = 61
        - agno = 62
        - agno = 63
        - agno = 64
        - agno = 65
        - agno = 66
        - agno = 67
        - agno = 68
        - agno = 69
        - agno = 70
        - agno = 71
        - agno = 72
        - agno = 73
        - agno = 74
        - agno = 75
        - agno = 76
        - agno = 77
        - agno = 78
        - agno = 79
        - agno = 80
        - agno = 81
        - agno = 82
        - agno = 83
        - agno = 84
        - agno = 85
        - agno = 86
        - agno = 87
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 202102936036, moving to lost+found
disconnected inode 215350040250, moving to lost+found
disconnected inode 215350208634, moving to lost+found
disconnected inode 271016406074, moving to lost+found
Phase 7 - verify and correct link counts...
done

-- snip --

Any ideas?

Thanks
<Prev in Thread] Current Thread [Next in Thread>