Gents,
I've been seeing some very strange lockups on XFS recently in our deployment
and I'm looking for some help directing me to either a workaround or solution.
When I say 'lockup': the process that is trying to do I/O to the XFS volume
appears to be hung. I have not had a chance to play with a system in this
state yet, as the person who identified the issue just rebooted the machine
instead of attempting to diagnose further. So I'm not sure of the actual
process state when this occurs.
A little background on the environment first:
- We deploy a product running kernel 3.16.7 (debian 8's kernel), xfsprogs
4.3.0, and large storage arrays (40Tb x 14 raid volumes).
- We have large data (PB) and high throughput (2+ GB/sec) using async IO to
several large volumes.
- I think it is important to note that these servers (Dell R720s & R730s) are
actually net-booting and tend to be ungracefully power cycled at any time which
we try to mitigate at all costs. That said, the systems seem to run fine for
days before hitting this issue.
- The below error occurred on two different machines about 1 day apart. I only
copy/pasted one of them here. The system was recovered via reboot so I don't
have all the debugging you might hope for.
- The volumes are configured in a raid 6 using NLSAS drives with 128k chunk
size.
The error:
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.788188] XFS: Assertion failed:
!list_empty(&cil->xc_cil), file: fs/xfs/xfs_log_cil.c, line: 666
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.799247] ------------[ cut here
]------------
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.804862] kernel BUG at
fs/xfs/xfs_message.c:107!
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.810773] invalid opcode: 0000
[#1] PREEMPT SMP
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.816603] Modules linked in:
mst_pciconf(O) mst_pci(O) tun ib_uverbs(O) fuse xfs libcrc32c mptctl mptbase
ipmi_poweroff mlx4_ib(O) mlx4_en(O) dm_round_robin ib_sa(O) ib_mad(O)
ib_core(O) ib_addr(O) mlx4_core(O) mlx_compat(O) mpt2sas(O) ahci megaraid_sas
libahci scsi_transport_sas bridge stp llc bonding igb(O) ptp pps_core
usb_storage dm_multipath dm_mod edd ipmi_devintf ipmi_si ipmi_msghandler
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.858837] CPU: 6 PID: 59124 Comm:
pio_writer_12 Tainted: G O 3.16.7.TEK.V7.16.1.0.1b11 #1
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.870142] Hardware name: NETSCOUT
SSU2000/04N3DF, BIOS 1.3.6 06/03/2015
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.878156] task: ffff882dd397e000
ti: ffff882db3e50000 task.ti: ffff882db3e50000
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.887471] RIP:
0010:[<ffffffffa0497f56>] [<ffffffffa0497f56>] assfail+0x22/0x28 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.897461] RSP:
0018:ffff882db3e53b38 EFLAGS: 00010292
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.903806] RAX: 0000000000000057
RBX: ffff883e1c7c9e00 RCX: 0000000000000000
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.912227] RDX: ffff882db3e53978
RSI: ffff88407eece1c8 RDI: 0000000000000246
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.920659] RBP: ffff882db3e53b38
R08: 0000000000000000 R09: 0000000000000cc7
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.929032] R10: 0000000000000000
R11: 0000000000000005 R12: ffff883ef69bd800
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.937439] R13: ffff883f768bea40
R14: ffff883f768bea00 R15: ffff882db3e53ba0
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.945858] FS:
00007fffdec51700(0000) GS:ffff88407eec0000(0000) knlGS:0000000000000000
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.955775] CS: 0010 DS: 0000 ES:
0000 CR0: 0000000080050033
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.962609] CR2: 000000001d593000
CR3: 0000003f7cac3000 CR4: 00000000001407e0
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.971003] Stack:
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.973679] ffff882db3e53b88
ffffffffa04ea310 00000000b3e53b88 ffff883f768bea00
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.982930] ffff883d3556b920
ffff883e1c7c9e00 0000000000000041 ffff8840608be000
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.992164] 0000000000000000
0000000000000000 ffff882db3e53bd8 ffffffffa049ef7b
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.001348] Call Trace:
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.004576] [<ffffffffa04ea310>]
xfs_log_commit_cil+0xcd/0x145 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.012232] [<ffffffffa049ef7b>]
xfs_trans_commit+0xc3/0x1fd [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.019669] [<ffffffffa0496bba>]
xfs_vn_update_time+0x16f/0x181 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.027427] [<ffffffff8114711c>]
update_time+0x25/0xac
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.033763] [<ffffffffa047e19a>] ?
ftrace_raw_event_xfs_lock_class+0x64/0x73 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.043270] [<ffffffff81147230>]
file_update_time+0x8d/0xb0
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.050011] [<ffffffffa048c441>] ?
xfs_rw_ilock+0x2c/0x33 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.057178] [<ffffffffa048c940>]
xfs_file_aio_write_checks+0xae/0xcf [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.065414] [<ffffffffa048ca9d>]
xfs_file_dio_aio_write+0x116/0x270 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.073593] [<ffffffffa048dae7>] ?
xfs_file_buffered_aio_write+0x176/0x176 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.082804] [<ffffffffa048db61>]
xfs_file_write_iter+0x7a/0x105 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.090579] [<ffffffff8116ac40>]
aio_run_iocb+0x1dc/0x2dd
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.097151] [<ffffffff8116af02>]
io_submit_one+0x1c1/0x201
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.103812] [<ffffffff8116b02c>]
do_io_submit+0xea/0x13f
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.110326] [<ffffffff8116b091>]
SyS_io_submit+0x10/0x12
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.116775] [<ffffffff815de6e9>]
system_call_fastpath+0x16/0x1b
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.123899] Code: 45 c8 e8 4a fc ff
ff c9 c3 55 48 89 e5 0f 1f 44 00 00 31 c0 48 89 f1 41 89 d0 48 c7 c6 89 84 50
a0 48 89 fa 31 ff e8 89 ff ff ff <0f> 0b eb fe 90 90 55 48 89 e5 53 0f 1f 44 00
00 48 8b 8f c0 02
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.147242] RIP
[<ffffffffa0497f56>] assfail+0x22/0x28 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.154142] RSP <ffff882db3e53b38>
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.161239] ---[ end trace
dbd3bab9608b5697 ]---
2016 May 14 17:31:17 g512-ssu-2 kernel: [85687.891368] XFS: Assertion failed:
!list_empty(&cil->xc_cil), file: fs/xfs/xfs_log_cil.c, line: 666 <---
appropriate line number
2016 May 14 17:31:17 g512-ssu-2 kernel: [85687.902402] ------------[ cut here
]------------
HOST INFO:
Proc info:
processor : 47
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2658 v3 @ 2.20GHz
Meminfo:
root@g512-ssu2000-2: ~ # cat /proc/meminfo
MemTotal: 264516692 kB
MemFree: 181040604 kB
MemAvailable: 200066392 kB
Buffers: 32 kB
Cached: 3722024 kB
SwapCached: 0 kB
Active: 31687480 kB
Inactive: 3307212 kB
Active(anon): 31300740 kB
Inactive(anon): 2534776 kB
Active(file): 386740 kB
Inactive(file): 772436 kB
Unevictable: 5188 kB
Mlocked: 5188 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 31303028 kB
Mapped: 2647080 kB
Shmem: 2535460 kB
Slab: 19452480 kB
SReclaimable: 18120028 kB
SUnreclaim: 1332452 kB
KernelStack: 20704 kB
PageTables: 83504 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 125966888 kB
Committed_AS: 42757972 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 1263588 kB
VmallocChunk: 34358261760 kB
AnonHugePages: 28125184 kB
HugePages_Total: 6144
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 12080 kB
DirectMap2M: 1986560 kB
DirectMap1G: 268435456 kB
root@g512-ssu2000-2: ~ # cat /proc/mounts
rootfs / rootfs rw,size=132244472k,nr_inodes=33061118 0 0
mbc:/iris/opt/TEKplat/V7.16.1.0.1b11/x86_64_root / nfs
ro,noatime,vers=3,rsize=4096,wsize=4096,namlen=255,acregmin=300,acregmax=300,acdirmin=300,acdirmax=300,hard,nolock,proto=udp,port=200
none /mnt tmpfs rw,relatime 0 0
dev /dev tmpfs rw,relatime,size=10240k 0 0
none /tmp tmpfs rw,relatime 0 0
proc /proc proc rw,relatime 0 0
sys /sys sysfs rw,relatime 0 0
none /local-blade tmpfs rw,relatime 0 0
mbc:/iris /iris nfs
ro,noatime,vers=3,rsize=4096,wsize=4096,namlen=255,acregmin=300,acregmax=300,acdirmin=300,acdirmax=300,hard,nolock,proto=udp,port=2049,timeo=7,retrans=3,sec=sys,local_0
none /mnt/initramfs tmpfs rw,relatime 0 0
tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
tek.iris.mbc:/iris/scratch /local-blade/scratch nfs
rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nolock,proto=tcp,timeo=5,retrans=2,sec=sys,mountaddr=192.168.0.1,mountve0
/dev/mapper/tek_ssu_sdb-SCRATCH /mnt/scratch xfs
rw,relatime,attr2,inode64,noquota 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
nodev /local-blade/hugepages hugetlbfs rw,relatime,mode=1770,gid=1000,size=12G
0 0
/dev/mapper/3600c0ff0001deaddd676a45501000000 /mnt/storage07 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
0 0
/dev/mapper/3600c0ff0001deaddc378a45501000000 /mnt/storage11 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
0 0
/dev/mapper/3600c0ff0001dcf727774a45501000000 /mnt/storage02 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
0 0
/dev/mapper/3600c0ff0001deadde175a45501000000 /mnt/storage05 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
0 0
/dev/mapper/3600c0ff0001dcf723b7aa45501000000 /mnt/storage14 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
0 0
/dev/mapper/3600c0ff0001dcf724878a45501000000 /mnt/storage10 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
0 0
/dev/mapper/3600c0ff0001deaddfe73a45501000000 /mnt/storage01 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
0 0
/dev/mapper/3600c0ff0001dcf724279a45501000000 /mnt/storage12 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
0 0
/dev/mapper/3600c0ff0001deaddbd79a45501000000 /mnt/storage13 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
0 0
/dev/mapper/3600c0ff0001dcf726875a45501000000 /mnt/storage04 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
0 0
/dev/mapper/3600c0ff0001deaddcb77a45501000000 /mnt/storage09 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
0 0
/dev/mapper/3600c0ff0001deadd9a73a45501000000 /mnt/index01 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=1280,noquota
0 0
/dev/mapper/3600c0ff0001dcf72cd73a45501000000 /mnt/index02 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=1280,noquota
0 0
/dev/mapper/3600c0ff0001dcf725c76a45501000000 /mnt/storage06 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
0 0
/dev/mapper/3600c0ff0001dcf724f77a45501000000 /mnt/storage08 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
0 0
/dev/mapper/3600c0ff0001deaddee74a45501000000 /mnt/storage03 xfs
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
0 0
I am not sure how to effectively reproduce the issue (and to be frank, am far
from being knowledgable about the innerds of XFS) and would like to understand
the essence of the problem here. While that rarely leads to root cause,
knowing what the failure is trying to tell me would be nice.
Jason Rich | Senior Platform Engineer
NETSCOUT
O: 469.330.3678
|