Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?)
Yann Dupont
Yann.Dupont at univ-nantes.fr
Fri Oct 26 05:03:56 CDT 2012
Le 25/10/2012 23:10, Dave Chinner a écrit :
>
> This time, after 3.6.3 boot, one of my xfs volume refuse to mount :
>
> mount: /dev/mapper/LocalDisk-debug--git: can't read superblock
>
> 276596.189363] XFS (dm-1): Mounting Filesystem
> [276596.270614] XFS (dm-1): Starting recovery (logdev: internal)
> [276596.711295] XFS (dm-1): xlog_recover_process_data: bad clientid 0x0
> [276596.711329] XFS (dm-1): log mount/recovery failed: error 5
> [276596.711516] XFS (dm-1): log mount failed
> That's an indication that zeros are being read from the journal
> rather than valid transaction data. It may well be caused by an XFS
> bug, but from experience it is equally likely to be a lower layer
> storage problem. More information is needed.
Hello dave, did you see the next mail ? The fact is that with 3.4.15,
journal is OK, and data is, in fact, intact.
> Firstly:
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
OK, sorry I missed it : here are the informations. Not sure all is
relevant, anyway here we go.
each time I will distinguish between the first reported crash (nodes of
ceph) and the last one, as the setup is quite different.
--------
kernel version (uname -a) : 3.6.1 then 3.6.2, vanilla, hand compiled, no
proprietary modules. Not running it at the moment, can't give you the
exact uname -a
------------
xfs_repair version 3.1.7 on the the third machine,
xfs_repair version 3.1.4 on two first machines (part of ceph)
-----------
cpu : the same for the 3 machines : Dell PowerEdgme M610,
2x Intel(R) Xeon(R) CPU E5649 @ 2.53GHz , Hyper threading
activated (12 physical cores, 24 virtual cores)
-------------
meminfo :
for example, on the 3rd machine :
MemTotal: 41198292 kB
MemFree: 28623116 kB
Buffers: 1056 kB
Cached: 10392452 kB
SwapCached: 0 kB
Active: 180528 kB
Inactive: 10227416 kB
Active(anon): 17476 kB
Inactive(anon): 180 kB
Active(file): 163052 kB
Inactive(file): 10227236 kB
Unevictable: 3744 kB
Mlocked: 3744 kB
SwapTotal: 506040 kB
SwapFree: 506040 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 18228 kB
Mapped: 12688 kB
Shmem: 300 kB
Slab: 1408204 kB
SReclaimable: 1281008 kB
SUnreclaim: 127196 kB
KernelStack: 1976 kB
PageTables: 2736 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 21105184 kB
Committed_AS: 136080 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 398608 kB
VmallocChunk: 34337979376 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 7652 kB
DirectMap2M: 2076672 kB
DirectMap1G: 39845888 kB
----
/proc/mounts:
root at label5:~# cat /proc/mounts
rootfs / rootfs rw 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,relatime,size=20592788k,nr_inodes=5148197,mode=755 0 0
devpts /dev/pts devpts
rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=4119832k,mode=755 0 0
/dev/mapper/LocalDisk-root / xfs rw,relatime,attr2,noquota 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
tmpfs /tmp tmpfs rw,nosuid,nodev,relatime,size=8239660k 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /run/shm tmpfs rw,nosuid,nodev,relatime,size=8239660k 0 0
/dev/sda1 /boot ext2 rw,relatime,errors=continue 0 0
** /dev/mapper/LocalDisk-debug--git /mnt/debug-git xfs
rw,relatime,attr2,noquota 0 0 ** this one was the failing on 3.6.xx
configfs /sys/kernel/config configfs rw,relatime 0 0
ocfs2_dlmfs /dlm ocfs2_dlmfs rw,relatime 0 0
rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
nfsd /proc/fs/nfsd nfsd rw,relatime 0 0
This volume is on RAID1 localdisk.
on one of the first 2 nodes :
root at hanyu:~# cat /proc/mounts
rootfs / rootfs rw 0 0
none /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
none /proc proc rw,nosuid,nodev,noexec,relatime 0 0
none /dev devtmpfs rw,relatime,size=20592652k,nr_inodes=5148163,mode=755 0 0
none /dev/pts devpts
rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
/dev/disk/by-uuid/37dd603c-168c-49de-830d-ef1b5c6982f8 / xfs
rw,relatime,attr2,noquota 0 0
tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
/dev/sdk1 /boot ext2 rw,relatime,errors=continue 0 0
none /var/local/cgroup cgroup
rw,relatime,net_cls,freezer,devices,memory,cpuacct,cpu,debug,cpuset 0 0
** /dev/mapper/xceph--hanyu-data /XCEPH-PROD/data xfs
rw,noatime,attr2,filestreams,nobarrier,inode64,logbsize=256k,noquota 0 0
** This one was the failed volume
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
Please note that on this server, nobarrier is used because the volume is
on a battery-backed fibre channel raid array.
--------------
/proc/partitions :
quite complicated on the ceph node :
root at hanyu:~# cat /proc/partitions
major minor #blocks name
11 0 1048575 sr0
8 32 6656000000 sdc
8 48 5063483392 sdd
8 64 6656000000 sde
8 80 5063483392 sdf
8 96 6656000000 sdg
8 112 5063483392 sdh
8 128 6656000000 sdi
8 144 5063483392 sdj
8 160 292421632 sdk
8 161 273073 sdk1
8 162 530145 sdk2
8 163 2369587 sdk3
8 164 289242292 sdk4
254 0 6656000000 dm-0
254 1 5063483392 dm-1
254 2 5242880 dm-2
254 3 11676106752 dm-3
please note that we use multipath here. 4 Paths for the LUN :
root at hanyu:~# multipath -ll
mpath2 (3600d02310006674500000001414d677d) dm-1 IFT,S16F-R1840-4
size=4.7T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=100 status=active
| |- 0:0:1:96 sdf 8:80 active ready running
| `- 6:0:1:96 sdj 8:144 active ready running
`-+- policy='round-robin 0' prio=20 status=enabled
|- 0:0:0:96 sdd 8:48 active ready running
`- 6:0:0:96 sdh 8:112 active ready running
mpath1 (3600d02310006674500000000414d677d) dm-0 IFT,S16F-R1840-4
size=6.2T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=100 status=active
| |- 0:0:1:32 sde 8:64 active ready running
| `- 6:0:1:32 sdi 8:128 active ready running
`-+- policy='round-robin 0' prio=20 status=enabled
|- 0:0:0:32 sdc 8:32 active ready running
`- 6:0:0:32 sdg 8:96 active ready running
On the 3rd machine, setup is quite simpler
root at label5:~# cat /proc/partitions
major minor #blocks name
8 0 292421632 sda
8 1 257008 sda1
8 2 506047 sda2
8 3 1261102 sda3
8 4 140705302 sda4
254 0 2609152 dm-0
254 1 104857600 dm-1
254 2 31457280 dm-2
--------------
raid layout :
On the first 2 machines (part of ceph cluster), the data is on Raid5 on
a fibre channel raid array, accessed by emulex fibre channel
(lightpulse, lpfc)
On the 3rd, data is on Raid1 accessed by Dell Perc (LSI Logic / Symbios
Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08) driver mptsas)
--------------
LVM config :
root at hanyu:~# vgs
VG #PV #LV #SN Attr VSize VFree
LocalDisk 1 1 0 wz--n- 275,84g 270,84g
xceph-hanyu 2 1 0 wz--n- 10,91t 41,36g
root at hanyu:~# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
log LocalDisk -wi-a- 5,00g
data xceph-hanyu -wi-ao 10,87t
and
root at label5:~# vgs
VG #PV #LV #SN Attr VSize VFree
LocalDisk 1 3 0 wz--n- 134,18g 1,70g
root at label5:~# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
1 LocalDisk -wi-a- 30,00g
debug-git LocalDisk -wi-ao 100,00g
root LocalDisk -wi-ao 2,49g
root at label5:~#
-------------------
type of disks :
on the raid array I'd say not very important (SEAGATE ST32000444SS near
line sas 2TB)
on the 3rd machine : TOSHIBA MBF2300RC DA06
---------------------
write cache status :
on the raid array, write cache is activated globally for the raid array
BUT is explicitely disabled on drives.
on the 3rd machine, it is disabled as far as I know
-------------------
Size of BBWC : 2 or 4 GB on raid arrays. None on the 3rd.
------------------
xfs_info :
root at hanyu:~# xfs_info /dev/xceph-hanyu/data
meta-data=/dev/mapper/xceph--hanyu-data isize=256 agcount=11,
agsize=268435455 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=2919026688, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
(no sunit or swidth on this one)
root at label5:~# xfs_info /dev/LocalDisk/debug-git
meta-data=/dev/mapper/LocalDisk-debug--git isize=256 agcount=4,
agsize=6553600 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=26214400, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=12800, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
-----
dmesg : you already have the informations.
For iostat, etc, I need to try to reproduce the load.
> Secondly, is the system still in this state? If so, dump the log to
No. The first 2 nodes have been xfs_repaired. One was completed and it
was a terrible mess.
The second had xfs_repair segfaulting. Will try with a newer xfs_repair
on a 3.4 kernel.
The 3rd one is now ok, after booting on 3.4 kernel.
> a file using xfs_logprint, zip it up and send it to me so I can have
> a look at where the log is intact (i.e. likely xfs bug) or contains
> zero (likely storage bug).
>
> If the system is not still in this state, then I'm afraid there's
> nothing that can be done to understand the problem.
I'll try to reproduce a similar problem.
> You've had two machines crash with problems in the mm subsystem, and
> one filesystem problem that might be hardware realted. Bit early to
> be blaming XFS for all your problems, I think....
I don't try to blame XFS. I'm very confident in it, and since a long
time. BUT I see a very different behaviour on those 3 cases. Nothing
conclusive yet. I think the problem is related with kernel 3.6, maybe in
dm layer.
I don't think it's hardware related : different disks, differents
controllers, different machines.
The common point is :
-XFS
-Kernel 3.6.xx
-Device Mapper + LVM
>> xfs_repair -n seems to show volume is quite broken :
> Sure, if the log hasn't been replayed then it will be - the
> filesystem will only be consistent after log recovery has been run.
>
Yes, but I had to use xfs_repair -L in the past (power outage, hardware
failures) and never had such disastrous repairs.
At least on the 2 first failures, I can understand : There is lots of
data, Journal is BIG, and I/O transactions in flight are quite high.
on the 3rd failure I'm very septical : low I/O load, little volume.
> You should report the mm problems to linux-mm at kvack.org to make sure
> the right people see them and they don't get lost in the noise of
> lkml....
yes point taken,
I'll try now to reproduce this kind of behaviour on a verry little
volume (10 GB for exemple) so I can confirm or inform the given scenario .
Thanks for your time,
-- Yann Dupont - Service IRTS, DSI Université de Nantes Tel :
02.53.48.49.20 - Mail/Jabber : Yann.Dupont at univ-nantes.fr
More information about the xfs
mailing list