| To: | performancecopilot/pcp <pcp@xxxxxxxxxxxxxxxxxx> |
|---|---|
| Subject: | [performancecopilot/pcp] pmcd causes complete system lockup on CentOS 7 on VMware (#107) |
| From: | Jeff White <notifications@xxxxxxxxxx> |
| Date: | Mon, 15 Aug 2016 12:54:22 -0700 |
| Delivered-to: | pcp@xxxxxxxxxxx |
| Dkim-signature: | v=1; a=rsa-sha256; c=relaxed/relaxed; d=github.com; s=pf2014; t=1471290862; bh=F3TLnbA9x2ccMInTcozpV166qkqTLN+senkzs+MFRho=; h=From:Reply-To:To:Subject:List-ID:List-Archive:List-Post: List-Unsubscribe:From; b=chUs6pyQvHWJ4N/VybGHztsf5vbHmmQcTBff8tzUkRMtWDPA0b+6KrD3ydk2W1loZ tHOJrgHu0HJuIbDy2HYGZD5zazHfVpfqWgAzpeHg5YLnI8461raLyQRggsJ+eWpwlY 96+rFo/o02sn9HzAGGIaHoEdo5EzEQYGczYBg5pw= |
| List-archive: | https://github.com/performancecopilot/pcp |
| List-id: | performancecopilot/pcp <pcp.performancecopilot.github.com> |
| List-post: | <mailto:reply+00bd08b675bcdf00b51c5f138cdfee88455c95cbb6fe59e292cf0000000113c9dfee92a169ce0a350886@reply.github.com> |
| List-unsubscribe: | <mailto:unsub+00bd08b675bcdf00b51c5f138cdfee88455c95cbb6fe59e292cf0000000113c9dfee92a169ce0a350886@reply.github.com>, <https://github.com/notifications/unsubscribe/AL0ItrJ4fOmQ5Vw2OOl-QN6_xFm9ifFOks5qgMPugaJpZM4JktgK> |
| Reply-to: | performancecopilot/pcp <reply+00bd08b675bcdf00b51c5f138cdfee88455c95cbb6fe59e292cf0000000113c9dfee92a169ce0a350886@xxxxxxxxxxxxxxxx> |
|
Let me start off by saying I know nothing about PCP. I installed PCP on about 80 compute nodes of an HPC cluster. Most of these are working fine but I noticed 4 virtual machines completely lock up and die shortly after starting pmcd after booting or starting it during boot. The physical systems, which are configured exactly the same, do not lock up. By lock up I mean the machine is completely dead. Not 100% CPU busy, not out of memory, etc. but completely unresponsive. When this happens the system no longer even responds to ping so the kernel itself (or its networking) is dead. However, I do not see a kernel panic and can't get a crash dump so I can't see what is happening. The fact that this userland daemon is somehow killing the kernel but not triggering a kernel panic is very odd and worrying. Here's how I am using PCP. I am using XDMod which has a plugin, SUPReMM, which requires PCP. I installed and configured PCP via this SaltStack config: I then enabled and started pmcd, pmlogger, and pmie. At this point the VMs will hang in as little as under a minute. Since there are four machines effected and three daemons in the mix, I set the following for what daemon to start on boot (left side being the names of the VMs) to narrow down the issue:
After a number of reboots only dn2 will hang. I also was able to hang a machine by starting pmcd via systemctl after the system booted. Oddly, I have not been able to hit this issue every time I start the daemon. Any clue what is going on or how we can proceed with this issue? — |
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| ||
| Previous by Date: | [Bug 1350816] proc.psinfo.rss is incorrect with threads enabled, bugzilla |
|---|---|
| Next by Date: | pcp updates: sample PMDA and pmie and configure.ac for systemd PMDA, Ken McDonell |
| Previous by Thread: | [Bug 1350816] proc.psinfo.rss is incorrect with threads enabled, bugzilla |
| Next by Thread: | Re: [performancecopilot/pcp] pmcd causes complete system lockup on CentOS 7 on VMware (#107), Ken McDonell |
| Indexes: | [Date] [Thread] [Top] [All Lists] |