pcp
[Top] [All Lists]

Re: [pcp] [performancecopilot/pcp] pmcd causes complete system lockup on

To: "performancecopilot/pcp" <reply+00bd08b65447cd9ca8b0550f4b312c6e546648bf11a7f67d92cf0000000113cb42db92a169ce0a350886@xxxxxxxxxxxxxxxx>
Subject: Re: [pcp] [performancecopilot/pcp] pmcd causes complete system lockup on CentOS 7 on VMware (#107)
From: Mark Goodwin <mgoodwin@xxxxxxxxxx>
Date: Wed, 17 Aug 2016 08:00:45 +1000
Cc: "performancecopilot/pcp" <pcp@xxxxxxxxxxxxxxxxxx>, Comment <comment@xxxxxxxxxxxxxxxxxx>, pcpemail <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <performancecopilot/pcp/issues/107/240239580@xxxxxxxxxx>
References: <performancecopilot/pcp/issues/107@xxxxxxxxxx> <performancecopilot/pcp/issues/107/240239580@xxxxxxxxxx>
The vmcore shows It's the perfevent PMDA - this is a VMware guest and there are known kernel issues withÂx86_perf_event_update() callingÂnative_read_pmc() .. but VMware doesn't (apparently) implement all the h/w events. Your actual crash was triggered by 'salt-minion', which is also tripping up inÂnative_read_pmc().

I guess h/w perf events are probably not much use in a virtual machine, so either manually comment out 'perfevent' in pmcd.conf, or run /var/lib/pcp/pmdas/perfevent/Remove. You may also have to turn off 'salt-minion'.

For PCP, the perf event PMDA should probably detect it's in a guest and not run unless forced or something. Not sure of a programmatic way to determine that, but there will be some way for sure. This particular issue has been reported before, see BZ 1178606 - 'general protection fault in native_read_pmc while running perf on VMware guest', which was posted against RHEL6.

[ Â134.800273] general protection fault: 0000 [#1] SMPÂ
[ Â134.800304] Modules linked in: ext4 mbcache jbd2 rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_en(OE) vxlan ip6_udp_tunnel udp_tunnel ptp pps_core mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) mlx4_core(OE) mlx_compat(OE) coretemp ppdev sg vmw_balloon pcspkr shpchp parport_pc i2c_piix4 parport vmw_vmci nfsd knem(OE) auth_rpcgss ip_tables nfsv3 nfs_acl nfs lockd grace fscache sd_mod crc_t10dif crct10dif_generic sr_mod cdrom ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel vmwgfx aesni_intel lrw gf128mul glue_helper ablk_helper cryptd serio_raw drm_kms_helper ttm vmxnet3 ahci vmw_pvscsi libahci drm ata_piix libata i2c_core floppy sunrpc dm_mirror dm_region_hash
[ Â134.800645] Âdm_log dm_mod
[ Â134.800656] CPU: 0 PID: 2934 Comm: salt-minion Tainted: G Â Â Â Â Â OE Â------------ Â 3.10.0-327.13.1.el7.x86_64 #1
[ Â134.800691] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/17/2015
[ Â134.800735] task: ffff880073299700 ti: ffff88007977c000 task.ti: ffff88007977c000
[ Â134.800762] RIP: 0010:[<ffffffff81058d66>] Â[<ffffffff81058d66>] native_read_pmc+0x6/0x20
[ Â134.800796] RSP: 0000:ffff88007ce03ef0 ÂEFLAGS: 00010083
[ Â134.800814] RAX: ffffffff81957ee0 RBX: 0000000000000000 RCX: 0000000040000002
[ Â134.800838] RDX: 0000000051c31ddb RSI: ffff88007ce17fa8 RDI: 0000000040000002
[ Â134.800863] RBP: ffff88007ce03ef0 R08: 000000000000001b R09: 00007fff2dc71714
[ Â134.800887] R10: 0000000000000001 R11: 00007ff043fd9c40 R12: ffffffff80000001
[ Â134.800910] R13: ffff880077763400 R14: ffff880077763578 R15: 0000000000000010
[ Â134.800934] FS: Â00007ff045117740(0000) GS:ffff88007ce00000(0000) knlGS:0000000000000000
[ Â134.800960] CS: Â0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ Â134.800979] CR2: 0000000001342220 CR3: 0000000077c78000 CR4: 00000000001407f0
[ Â134.801030] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ Â134.801082] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ Â134.801106] Stack:
[ Â134.801115] Âffff88007ce03f28 ffffffff81029e03 0000000000000000 ffff880077763400
[ Â134.801146] Âffff88007ce17fb4 00007ff044ee9540 00007ff044ee9710 ffff88007ce03f38
[ Â134.801175] Âffffffff8102a079 ffff88007ce03f60 ffffffff811591fe ffff88007323fd90
[ Â134.801205] Call Trace:
[ Â134.801215] Â<IRQ>Â
[ Â134.801223]Â
[ Â134.801234] Â[<ffffffff81029e03>] x86_perf_event_update+0x43/0x90
[ Â134.801252] Â[<ffffffff8102a079>] x86_pmu_read+0x9/0x10
[ Â134.801272] Â[<ffffffff811591fe>] __perf_event_read+0xfe/0x110
[ Â134.801294] Â[<ffffffff810e6b3d>] flush_smp_call_function_queue+0x5d/0x130
[ Â134.801318] Â[<ffffffff810e7213>] generic_smp_call_function_single_interrupt+0x13/0x30
[ Â134.801345] Â[<ffffffff81046c77>] smp_call_function_single_interrupt+0x27/0x40
[ Â134.801371] Â[<ffffffff81646f9d>] call_function_single_interrupt+0x6d/0x80
[ Â134.801393] Â<EOI>Â
[ Â134.801401] Code:Â
[ Â134.801411] c0 48 c1 e2 20 89 0e 48 09 c2 48 89 d0 5d c3 66 0f 1f 44 00 00 55 89 f0 89 f9 48 89 e5 0f 30 31 c0 5d c3 66 90 55 89 f9 48 89 e5 <0f> 33 89 c0 48 c1 e2 20 48 09 c2 48 89 d0 5d c3 66 2e 0f 1f 84Â
[ Â134.801600] RIP Â[<ffffffff81058d66>] native_read_pmc+0x6/0x20
[ Â134.801633] ÂRSP <ffff88007ce03ef0>

Interestingly, both the perfevent PMDA and salt-minion were running when the crash occurred, and both were reading a perf event :

crash> ps | grep '^>'
> Â2934 Â Â Â1 Â 0 Âffff880073299700 ÂRU Â 3.3 Â712948 Â69680 Âsalt-minion
> Â4958 Â 4950 Â 1 Âffff880073355c00 ÂRU Â 0.2 Â 76468 Â 3412 Âpmdaperfevent

crash> bt 4958
PID: 4958 Â TASK: ffff880073355c00 ÂCPU: 1 Â COMMAND: "pmdaperfevent"
Â#0 [ffff88007cf05e70] crash_nmi_callback at ffffffff810458f2
Â#1 [ffff88007cf05e80] nmi_handle at ffffffff8163e8d9
Â#2 [ffff88007cf05ec8] do_nmi at ffffffff8163e9f0
Â#3 [ffff88007cf05ef0] nmi_restore at ffffffff8163dd13
  [exception RIP: generic_exec_single+314]
  RIP: ffffffff810e687a ÂRSP: ffff88007323fd90 ÂRFLAGS: 00000202
  RAX: 00000000000008fb ÂRBX: ffff88007323fd90 ÂRCX: 0000000000000000
  RDX: 00000000000008fb ÂRSI: 00000000000000fb ÂRDI: 0000000000000286
  RBP: ffff88007323fdd8  R8: 0000000000000001  R9: 0000000000000000
  R10: 0000000000000000 ÂR11: 0000000000000293 ÂR12: 0000000000000000
  R13: 0000000000000001 ÂR14: ffff880077763400 ÂR15: ffff88007323fea0
  ORIG_RAX: ffffffffffffffff ÂCS: 0010 ÂSS: 0018
--- <NMI exception stack> ---
Â#4 [ffff88007323fd90] generic_exec_single at ffffffff810e687a
Â#5 [ffff88007323fde0] smp_call_function_single at ffffffff810e697f
Â#6 [ffff88007323fe10] perf_event_read_value at ffffffff811584e2
Â#7 [ffff88007323fe40] perf_event_read_value at ffffffff81158533
Â#8 [ffff88007323fe80] perf_read at ffffffff81158cf0
Â#9 [ffff88007323ff08] vfs_read at ffffffff811de4ec
#10 [ffff88007323ff38] sys_write at ffffffff811df03f
#11 [ffff88007323ff80] sysret_check at ffffffff81645ec9
  RIP: 00007f72e884222d ÂRSP: 00007fff33242fd8 ÂRFLAGS: 00010206
  RAX: 0000000000000000 ÂRBX: ffffffff81645ec9 ÂRCX: 0000000000000001
  RDX: 0000000000000018 ÂRSI: 0000000000789250 ÂRDI: 0000000000000006
  RBP: 0000000000000000  R8: 0000000000000000  R9: 0000000051c2fbf5
  R10: 0000000000000000 ÂR11: 0000000000000293 ÂR12: 0000000000789b48
  R13: 0000000000000000 ÂR14: 0000000000789568 ÂR15: 0000000000789250
  ORIG_RAX: 0000000000000000 ÂCS: 0033 ÂSS: 002b






On Wed, Aug 17, 2016 at 7:08 AM, Ken McDonell <notifications@xxxxxxxxxx> wrote:

Screencast suggests hang is about 20secs after pmcd start, which is interesting and suggests it is NOT an initialization error, but possibly pmFetch related or some self-timer driven event in a PMDA.

Was pmlogger enabled on this system?

Another possible approach is trying to find the PMDA that is responsible (it is unlikely to be pmcd itself). You have 11 PMDAs in /etc/pcp/pmcd/pmcd.conf ... I'd start by commenting about half of them out (insert a # at the start of the line) especially the ones with low-level hardware contact or deep kernel contact, e.g. perfevent, jbd2, nvidia, slurm, xfs, linux, proc. Then try again.

If this survives, you may be able to binary-chop your way to identifying which PMDA is the culprit.

â
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.


_______________________________________________
pcp mailing list
pcp@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/pcp


<Prev in Thread] Current Thread [Next in Thread>