On Mon, May 16, 2016 at 03:04:28PM +0000, Rich, Jason wrote:
> Gents,
>
> I've been seeing some very strange lockups on XFS recently in our deployment
> and I'm looking for some help directing me to either a workaround or
> solution.
> When I say 'lockup': the process that is trying to do I/O to the XFS volume
> appears to be hung. I have not had a chance to play with a system in this
> state yet, as the person who identified the issue just rebooted the machine
> instead of attempting to diagnose further. So I'm not sure of the actual
> process state when this occurs.
>
Are you running with XFS_DEBUG enabled in the kernel config? If so, this
mode converts any assertion failure to an explicit BUG(), which is
effectively a kernel crash. You probably don't want to run a production
system with XFS_DEBUG enabled unless you're actually trying to debug
something.
> A little background on the environment first:
> - We deploy a product running kernel 3.16.7 (debian 8's kernel), xfsprogs
> 4.3.0, and large storage arrays (40Tb x 14 raid volumes).
> - We have large data (PB) and high throughput (2+ GB/sec) using async IO to
> several large volumes.
> - I think it is important to note that these servers (Dell R720s & R730s) are
> actually net-booting and tend to be ungracefully power cycled at any time
> which we try to mitigate at all costs. That said, the systems seem to run
> fine for days before hitting this issue.
> - The below error occurred on two different machines about 1 day apart. I
> only copy/pasted one of them here. The system was recovered via reboot so I
> don't have all the debugging you might hope for.
> - The volumes are configured in a raid 6 using NLSAS drives with 128k chunk
> size.
>
> The error:
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.788188] XFS: Assertion failed:
> !list_empty(&cil->xc_cil), file: fs/xfs/xfs_log_cil.c, line: 666
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.799247] ------------[ cut here
> ]------------
You're on an older kernel and I don't recall how far back the original
issue goes, but this looks familiar to 4703da7b7877 ("xfs: close xc_cil
list_empty() races with cil commit sequence") from v4.3. Can you try
with that patch applied (it's basically a one-liner)?
Brian
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.804862] kernel BUG at
> fs/xfs/xfs_message.c:107!
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.810773] invalid opcode: 0000
> [#1] PREEMPT SMP
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.816603] Modules linked in:
> mst_pciconf(O) mst_pci(O) tun ib_uverbs(O) fuse xfs libcrc32c mptctl mptbase
> ipmi_poweroff mlx4_ib(O) mlx4_en(O) dm_round_robin ib_sa(O) ib_mad(O)
> ib_core(O) ib_addr(O) mlx4_core(O) mlx_compat(O) mpt2sas(O) ahci megaraid_sas
> libahci scsi_transport_sas bridge stp llc bonding igb(O) ptp pps_core
> usb_storage dm_multipath dm_mod edd ipmi_devintf ipmi_si ipmi_msghandler
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.858837] CPU: 6 PID: 59124
> Comm: pio_writer_12 Tainted: G O 3.16.7.TEK.V7.16.1.0.1b11 #1
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.870142] Hardware name:
> NETSCOUT SSU2000/04N3DF, BIOS 1.3.6 06/03/2015
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.878156] task: ffff882dd397e000
> ti: ffff882db3e50000 task.ti: ffff882db3e50000
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.887471] RIP:
> 0010:[<ffffffffa0497f56>] [<ffffffffa0497f56>] assfail+0x22/0x28 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.897461] RSP:
> 0018:ffff882db3e53b38 EFLAGS: 00010292
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.903806] RAX: 0000000000000057
> RBX: ffff883e1c7c9e00 RCX: 0000000000000000
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.912227] RDX: ffff882db3e53978
> RSI: ffff88407eece1c8 RDI: 0000000000000246
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.920659] RBP: ffff882db3e53b38
> R08: 0000000000000000 R09: 0000000000000cc7
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.929032] R10: 0000000000000000
> R11: 0000000000000005 R12: ffff883ef69bd800
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.937439] R13: ffff883f768bea40
> R14: ffff883f768bea00 R15: ffff882db3e53ba0
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.945858] FS:
> 00007fffdec51700(0000) GS:ffff88407eec0000(0000) knlGS:0000000000000000
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.955775] CS: 0010 DS: 0000 ES:
> 0000 CR0: 0000000080050033
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.962609] CR2: 000000001d593000
> CR3: 0000003f7cac3000 CR4: 00000000001407e0
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.971003] Stack:
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.973679] ffff882db3e53b88
> ffffffffa04ea310 00000000b3e53b88 ffff883f768bea00
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.982930] ffff883d3556b920
> ffff883e1c7c9e00 0000000000000041 ffff8840608be000
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4123.992164] 0000000000000000
> 0000000000000000 ffff882db3e53bd8 ffffffffa049ef7b
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.001348] Call Trace:
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.004576] [<ffffffffa04ea310>]
> xfs_log_commit_cil+0xcd/0x145 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.012232] [<ffffffffa049ef7b>]
> xfs_trans_commit+0xc3/0x1fd [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.019669] [<ffffffffa0496bba>]
> xfs_vn_update_time+0x16f/0x181 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.027427] [<ffffffff8114711c>]
> update_time+0x25/0xac
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.033763] [<ffffffffa047e19a>]
> ? ftrace_raw_event_xfs_lock_class+0x64/0x73 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.043270] [<ffffffff81147230>]
> file_update_time+0x8d/0xb0
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.050011] [<ffffffffa048c441>]
> ? xfs_rw_ilock+0x2c/0x33 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.057178] [<ffffffffa048c940>]
> xfs_file_aio_write_checks+0xae/0xcf [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.065414] [<ffffffffa048ca9d>]
> xfs_file_dio_aio_write+0x116/0x270 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.073593] [<ffffffffa048dae7>]
> ? xfs_file_buffered_aio_write+0x176/0x176 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.082804] [<ffffffffa048db61>]
> xfs_file_write_iter+0x7a/0x105 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.090579] [<ffffffff8116ac40>]
> aio_run_iocb+0x1dc/0x2dd
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.097151] [<ffffffff8116af02>]
> io_submit_one+0x1c1/0x201
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.103812] [<ffffffff8116b02c>]
> do_io_submit+0xea/0x13f
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.110326] [<ffffffff8116b091>]
> SyS_io_submit+0x10/0x12
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.116775] [<ffffffff815de6e9>]
> system_call_fastpath+0x16/0x1b
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.123899] Code: 45 c8 e8 4a fc
> ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 31 c0 48 89 f1 41 89 d0 48 c7 c6 89 84
> 50 a0 48 89 fa 31 ff e8 89 ff ff ff <0f> 0b eb fe 90 90 55 48 89 e5 53 0f 1f
> 44 00 00 48 8b 8f c0 02
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.147242] RIP
> [<ffffffffa0497f56>] assfail+0x22/0x28 [xfs]
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.154142] RSP <ffff882db3e53b38>
> 2016 May 13 18:51:30 g512-ssu-1 kernel: [ 4124.161239] ---[ end trace
> dbd3bab9608b5697 ]---
>
> 2016 May 14 17:31:17 g512-ssu-2 kernel: [85687.891368] XFS: Assertion failed:
> !list_empty(&cil->xc_cil), file: fs/xfs/xfs_log_cil.c, line: 666 <---
> appropriate line number
> 2016 May 14 17:31:17 g512-ssu-2 kernel: [85687.902402] ------------[ cut here
> ]------------
>
> HOST INFO:
> Proc info:
> processor : 47
> vendor_id : GenuineIntel
> cpu family : 6
> model : 63
> model name : Intel(R) Xeon(R) CPU E5-2658 v3 @ 2.20GHz
>
> Meminfo:
> root@g512-ssu2000-2: ~ # cat /proc/meminfo
> MemTotal: 264516692 kB
> MemFree: 181040604 kB
> MemAvailable: 200066392 kB
> Buffers: 32 kB
> Cached: 3722024 kB
> SwapCached: 0 kB
> Active: 31687480 kB
> Inactive: 3307212 kB
> Active(anon): 31300740 kB
> Inactive(anon): 2534776 kB
> Active(file): 386740 kB
> Inactive(file): 772436 kB
> Unevictable: 5188 kB
> Mlocked: 5188 kB
> SwapTotal: 0 kB
> SwapFree: 0 kB
> Dirty: 0 kB
> Writeback: 0 kB
> AnonPages: 31303028 kB
> Mapped: 2647080 kB
> Shmem: 2535460 kB
> Slab: 19452480 kB
> SReclaimable: 18120028 kB
> SUnreclaim: 1332452 kB
> KernelStack: 20704 kB
> PageTables: 83504 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 125966888 kB
> Committed_AS: 42757972 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 1263588 kB
> VmallocChunk: 34358261760 kB
> AnonHugePages: 28125184 kB
> HugePages_Total: 6144
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 12080 kB
> DirectMap2M: 1986560 kB
> DirectMap1G: 268435456 kB
>
> root@g512-ssu2000-2: ~ # cat /proc/mounts
> rootfs / rootfs rw,size=132244472k,nr_inodes=33061118 0 0
> mbc:/iris/opt/TEKplat/V7.16.1.0.1b11/x86_64_root / nfs
> ro,noatime,vers=3,rsize=4096,wsize=4096,namlen=255,acregmin=300,acregmax=300,acdirmin=300,acdirmax=300,hard,nolock,proto=udp,port=200
> none /mnt tmpfs rw,relatime 0 0
> dev /dev tmpfs rw,relatime,size=10240k 0 0
> none /tmp tmpfs rw,relatime 0 0
> proc /proc proc rw,relatime 0 0
> sys /sys sysfs rw,relatime 0 0
> none /local-blade tmpfs rw,relatime 0 0
> mbc:/iris /iris nfs
> ro,noatime,vers=3,rsize=4096,wsize=4096,namlen=255,acregmin=300,acregmax=300,acdirmin=300,acdirmax=300,hard,nolock,proto=udp,port=2049,timeo=7,retrans=3,sec=sys,local_0
> none /mnt/initramfs tmpfs rw,relatime 0 0
> tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
> tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
> debugfs /sys/kernel/debug debugfs rw,relatime 0 0
> devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
> tek.iris.mbc:/iris/scratch /local-blade/scratch nfs
> rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nolock,proto=tcp,timeo=5,retrans=2,sec=sys,mountaddr=192.168.0.1,mountve0
> /dev/mapper/tek_ssu_sdb-SCRATCH /mnt/scratch xfs
> rw,relatime,attr2,inode64,noquota 0 0
> none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
> nodev /local-blade/hugepages hugetlbfs
> rw,relatime,mode=1770,gid=1000,size=12G 0 0
> /dev/mapper/3600c0ff0001deaddd676a45501000000 /mnt/storage07 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
> 0 0
> /dev/mapper/3600c0ff0001deaddc378a45501000000 /mnt/storage11 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
> 0 0
> /dev/mapper/3600c0ff0001dcf727774a45501000000 /mnt/storage02 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
> 0 0
> /dev/mapper/3600c0ff0001deadde175a45501000000 /mnt/storage05 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
> 0 0
> /dev/mapper/3600c0ff0001dcf723b7aa45501000000 /mnt/storage14 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
> 0 0
> /dev/mapper/3600c0ff0001dcf724878a45501000000 /mnt/storage10 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
> 0 0
> /dev/mapper/3600c0ff0001deaddfe73a45501000000 /mnt/storage01 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
> 0 0
> /dev/mapper/3600c0ff0001dcf724279a45501000000 /mnt/storage12 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
> 0 0
> /dev/mapper/3600c0ff0001deaddbd79a45501000000 /mnt/storage13 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
> 0 0
> /dev/mapper/3600c0ff0001dcf726875a45501000000 /mnt/storage04 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
> 0 0
> /dev/mapper/3600c0ff0001deaddcb77a45501000000 /mnt/storage09 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
> 0 0
> /dev/mapper/3600c0ff0001deadd9a73a45501000000 /mnt/index01 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=1280,noquota
> 0 0
> /dev/mapper/3600c0ff0001dcf72cd73a45501000000 /mnt/index02 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=1280,noquota
> 0 0
> /dev/mapper/3600c0ff0001dcf725c76a45501000000 /mnt/storage06 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
> 0 0
> /dev/mapper/3600c0ff0001dcf724f77a45501000000 /mnt/storage08 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
> 0 0
> /dev/mapper/3600c0ff0001deaddee74a45501000000 /mnt/storage03 xfs
> rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=128k,sunit=256,swidth=3072,noquota
> 0 0
>
>
> I am not sure how to effectively reproduce the issue (and to be frank, am far
> from being knowledgable about the innerds of XFS) and would like to
> understand the essence of the problem here. While that rarely leads to root
> cause, knowing what the failure is trying to tell me would be nice.
>
> Jason Rich | Senior Platform Engineer
> NETSCOUT
> O: 469.330.3678
>
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs
|