xfs
[Top] [All Lists]

XFS kernel bug 3.16.7 assertion failed xfs_log_cil.c

To: "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Subject: XFS kernel bug 3.16.7 assertion failed xfs_log_cil.c
From: "Rich, Jason" <Jason.Rich@xxxxxxxxxxxx>
Date: Mon, 16 May 2016 15:04:28 +0000
Accept-language: en-US
Delivered-to: xfs@xxxxxxxxxxx
Thread-index: AdGvg91bxzUZ8dq7RQGmUeOAljH1Vg==
Thread-topic: XFS kernel bug 3.16.7 assertion failed xfs_log_cil.c
Gents,

I've been seeing some very strange lockups on XFS recently in our deployment 
and I'm looking for some help directing me to either a workaround or solution.  
When I say 'lockup':  the process that is trying to do I/O to the XFS volume 
appears to be hung.  I have not had a chance to play with a system in this 
state yet, as the person who identified the issue just rebooted the machine 
instead of attempting to diagnose further.  So I'm not sure of the actual 
process state when this occurs.

A little background on the environment first:
- We deploy a product running kernel 3.16.7 (debian 8's kernel), xfsprogs 
4.3.0, and large storage arrays (40Tb x 14 raid volumes). 
- We have large data (PB) and high throughput (2+ GB/sec) using async IO to 
several large volumes. 
- I think it is important to note that these servers (Dell R720s & R730s) are 
actually net-booting and tend to be ungracefully power cycled at any time which 
we try to mitigate at all costs.  That said, the systems seem to run fine for 
days before hitting this issue.
- The below error occurred on two different machines about 1 day apart.  I only 
copy/pasted one of them here.  The system was recovered via reboot so I don't 
have all the debugging you might hope for.
- The volumes are configured in a raid 6 using NLSAS drives with 128k chunk 
size.

The error:
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.788188] XFS: Assertion failed: 
!list_empty(&cil->xc_cil), file: fs/xfs/xfs_log_cil.c, line: 666
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.799247] ------------[ cut here 
]------------
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.804862] kernel BUG at 
fs/xfs/xfs_message.c:107!
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.810773] invalid opcode: 0000 
[#1] PREEMPT SMP
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.816603] Modules linked in: 
mst_pciconf(O) mst_pci(O) tun ib_uverbs(O) fuse xfs libcrc32c mptctl mptbase 
ipmi_poweroff mlx4_ib(O) mlx4_en(O) dm_round_robin ib_sa(O) ib_mad(O) 
ib_core(O) ib_addr(O) mlx4_core(O) mlx_compat(O) mpt2sas(O) ahci megaraid_sas 
libahci scsi_transport_sas bridge stp llc bonding igb(O) ptp pps_core 
usb_storage dm_multipath dm_mod edd ipmi_devintf ipmi_si ipmi_msghandler
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.858837] CPU: 6 PID: 59124 Comm: 
pio_writer_12 Tainted: G           O  3.16.7.TEK.V7.16.1.0.1b11 #1
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.870142] Hardware name: NETSCOUT 
SSU2000/04N3DF, BIOS 1.3.6 06/03/2015
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.878156] task: ffff882dd397e000 
ti: ffff882db3e50000 task.ti: ffff882db3e50000
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.887471] RIP: 
0010:[<ffffffffa0497f56>]  [<ffffffffa0497f56>] assfail+0x22/0x28 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.897461] RSP: 
0018:ffff882db3e53b38  EFLAGS: 00010292
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.903806] RAX: 0000000000000057 
RBX: ffff883e1c7c9e00 RCX: 0000000000000000
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.912227] RDX: ffff882db3e53978 
RSI: ffff88407eece1c8 RDI: 0000000000000246
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.920659] RBP: ffff882db3e53b38 
R08: 0000000000000000 R09: 0000000000000cc7
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.929032] R10: 0000000000000000 
R11: 0000000000000005 R12: ffff883ef69bd800
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.937439] R13: ffff883f768bea40 
R14: ffff883f768bea00 R15: ffff882db3e53ba0
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.945858] FS:  
00007fffdec51700(0000) GS:ffff88407eec0000(0000) knlGS:0000000000000000
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.955775] CS:  0010 DS: 0000 ES: 
0000 CR0: 0000000080050033
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.962609] CR2: 000000001d593000 
CR3: 0000003f7cac3000 CR4: 00000000001407e0
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.971003] Stack:
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.973679]  ffff882db3e53b88 
ffffffffa04ea310 00000000b3e53b88 ffff883f768bea00
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.982930]  ffff883d3556b920 
ffff883e1c7c9e00 0000000000000041 ffff8840608be000
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.992164]  0000000000000000 
0000000000000000 ffff882db3e53bd8 ffffffffa049ef7b
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.001348] Call Trace:
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.004576]  [<ffffffffa04ea310>] 
xfs_log_commit_cil+0xcd/0x145 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.012232]  [<ffffffffa049ef7b>] 
xfs_trans_commit+0xc3/0x1fd [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.019669]  [<ffffffffa0496bba>] 
xfs_vn_update_time+0x16f/0x181 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.027427]  [<ffffffff8114711c>] 
update_time+0x25/0xac
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.033763]  [<ffffffffa047e19a>] ? 
ftrace_raw_event_xfs_lock_class+0x64/0x73 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.043270]  [<ffffffff81147230>] 
file_update_time+0x8d/0xb0
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.050011]  [<ffffffffa048c441>] ? 
xfs_rw_ilock+0x2c/0x33 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.057178]  [<ffffffffa048c940>] 
xfs_file_aio_write_checks+0xae/0xcf [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.065414]  [<ffffffffa048ca9d>] 
xfs_file_dio_aio_write+0x116/0x270 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.073593]  [<ffffffffa048dae7>] ? 
xfs_file_buffered_aio_write+0x176/0x176 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.082804]  [<ffffffffa048db61>] 
xfs_file_write_iter+0x7a/0x105 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.090579]  [<ffffffff8116ac40>] 
aio_run_iocb+0x1dc/0x2dd
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.097151]  [<ffffffff8116af02>] 
io_submit_one+0x1c1/0x201
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.103812]  [<ffffffff8116b02c>] 
do_io_submit+0xea/0x13f
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.110326]  [<ffffffff8116b091>] 
SyS_io_submit+0x10/0x12
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.116775]  [<ffffffff815de6e9>] 
system_call_fastpath+0x16/0x1b
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.123899] Code: 45 c8 e8 4a fc ff 
ff c9 c3 55 48 89 e5 0f 1f 44 00 00 31 c0 48 89 f1 41 89 d0 48 c7 c6 89 84 50 
a0 48 89 fa 31 ff e8 89 ff ff ff <0f> 0b eb fe 90 90 55 48 89 e5 53 0f 1f 44 00 
00 48 8b 8f c0 02
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.147242] RIP  
[<ffffffffa0497f56>] assfail+0x22/0x28 [xfs]
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.154142]  RSP <ffff882db3e53b38>
2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.161239] ---[ end trace 
dbd3bab9608b5697 ]---

2016 May 14 17:31:17 g512-ssu-2 kernel: [85687.891368] XFS: Assertion failed: 
!list_empty(&cil->xc_cil), file: fs/xfs/xfs_log_cil.c, line: 666   <--- 
appropriate line number
2016 May 14 17:31:17 g512-ssu-2 kernel: [85687.902402] ------------[ cut here 
]------------

HOST INFO:
Proc info:
processor       : 47
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2658 v3 @ 2.20GHz

Meminfo:
root@g512-ssu2000-2: ~ # cat /proc/meminfo
MemTotal:       264516692 kB
MemFree:        181040604 kB
MemAvailable:   200066392 kB
Buffers:              32 kB
Cached:          3722024 kB
SwapCached:            0 kB
Active:         31687480 kB
Inactive:        3307212 kB
Active(anon):   31300740 kB
Inactive(anon):  2534776 kB
Active(file):     386740 kB
Inactive(file):   772436 kB
Unevictable:        5188 kB
Mlocked:            5188 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:      31303028 kB
Mapped:          2647080 kB
Shmem:           2535460 kB
Slab:           19452480 kB
SReclaimable:   18120028 kB
SUnreclaim:      1332452 kB
KernelStack:       20704 kB
PageTables:        83504 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    125966888 kB
Committed_AS:   42757972 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     1263588 kB
VmallocChunk:   34358261760 kB
AnonHugePages:  28125184 kB
HugePages_Total:    6144
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       12080 kB
DirectMap2M:     1986560 kB
DirectMap1G:    268435456 kB

root@g512-ssu2000-2: ~ # cat /proc/mounts
rootfs / rootfs rw,size=132244472k,nr_inodes=33061118 0 0
mbc:/iris/opt/TEKplat/V7.16.1.0.1b11/x86_64_root / nfs 
ro,noatime,vers=3,rsize=4096,wsize=4096,namlen=255,acregmin=300,acregmax=300,acdirmin=300,acdirmax=300,hard,nolock,proto=udp,port=200
none /mnt tmpfs rw,relatime 0 0
dev /dev tmpfs rw,relatime,size=10240k 0 0
none /tmp tmpfs rw,relatime 0 0
proc /proc proc rw,relatime 0 0
sys /sys sysfs rw,relatime 0 0
none /local-blade tmpfs rw,relatime 0 0
mbc:/iris /iris nfs 
ro,noatime,vers=3,rsize=4096,wsize=4096,namlen=255,acregmin=300,acregmax=300,acdirmin=300,acdirmax=300,hard,nolock,proto=udp,port=2049,timeo=7,retrans=3,sec=sys,local_0
none /mnt/initramfs tmpfs rw,relatime 0 0
tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
tek.iris.mbc:/iris/scratch /local-blade/scratch nfs 
rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nolock,proto=tcp,timeo=5,retrans=2,sec=sys,mountaddr=192.168.0.1,mountve0
/dev/mapper/tek_ssu_sdb-SCRATCH /mnt/scratch xfs 
rw,relatime,attr2,inode64,noquota 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
nodev /local-blade/hugepages hugetlbfs rw,relatime,mode=1770,gid=1000,size=12G 
0 0
/dev/mapper/3600c0ff0001deaddd676a45501000000 /mnt/storage07 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
 0 0
/dev/mapper/3600c0ff0001deaddc378a45501000000 /mnt/storage11 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
 0 0
/dev/mapper/3600c0ff0001dcf727774a45501000000 /mnt/storage02 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
 0 0
/dev/mapper/3600c0ff0001deadde175a45501000000 /mnt/storage05 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
 0 0
/dev/mapper/3600c0ff0001dcf723b7aa45501000000 /mnt/storage14 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
 0 0
/dev/mapper/3600c0ff0001dcf724878a45501000000 /mnt/storage10 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
 0 0
/dev/mapper/3600c0ff0001deaddfe73a45501000000 /mnt/storage01 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
 0 0
/dev/mapper/3600c0ff0001dcf724279a45501000000 /mnt/storage12 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
 0 0
/dev/mapper/3600c0ff0001deaddbd79a45501000000 /mnt/storage13 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
 0 0
/dev/mapper/3600c0ff0001dcf726875a45501000000 /mnt/storage04 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
 0 0
/dev/mapper/3600c0ff0001deaddcb77a45501000000 /mnt/storage09 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
 0 0
/dev/mapper/3600c0ff0001deadd9a73a45501000000 /mnt/index01 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=1280,noquota
 0 0
/dev/mapper/3600c0ff0001dcf72cd73a45501000000 /mnt/index02 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=1280,noquota
 0 0
/dev/mapper/3600c0ff0001dcf725c76a45501000000 /mnt/storage06 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
 0 0
/dev/mapper/3600c0ff0001dcf724f77a45501000000 /mnt/storage08 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
 0 0
/dev/mapper/3600c0ff0001deaddee74a45501000000 /mnt/storage03 xfs 
rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
 0 0


I am not sure how to effectively reproduce the issue (and to be frank, am far 
from being knowledgable about the innerds of XFS) and would like to understand 
the essence of the problem here.  While that rarely leads to root cause, 
knowing what the failure is trying to tell me would be nice.

Jason Rich  |  Senior Platform Engineer
NETSCOUT
O: 469.330.3678 

<Prev in Thread] Current Thread [Next in Thread>