pcp
[Top] [All Lists]

pmlogger_check stuck if host is down

To: pcp@xxxxxxxxxxx
Subject: pmlogger_check stuck if host is down
From: Rares Vernica <rvernica@xxxxxxxxx>
Date: Thu, 28 Apr 2016 15:46:14 -0700
Delivered-to: pcp@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to; bh=k1E2ctMAfUUC24v+7oPIxPfbnv9Mekul/6oesxoEBPA=; b=JP5TkSFP86RdKxhYh5xPCCiIFa/Y3YgYwoH3/+mgX0m1duPDlJKvrH2P85ip/mIkEJ rM1Tc9dv7ZxMcp3ygPzBSwIlVgFl9msFxdiF8Z5dA5fImW+u+HXLbyFIpA94PXicet2f RzeMEHqvqoPBmekgtroHs4idxuc0bTN3hnODG9BDu9mhOhfp/2DEVTt5zjCErfTDqht7 8yV3mpJMvYGrxxANXeIyDdzheu3XHTapGcnAH0VTeVylJ29oUwid4QjUeJsjm5K55UcJ t+gxQwBmpyw/x3l7nZatDP4v1v56ZwZ77Jc1dnUcdCTByyAnlTM9Co9lrmeQZxbhsw5o uaJg==
Hello,

I have pmlogger collect logs from multiple hosts. My control file looks something like this:

LOCALHOSTNAME y  n PCP_LOG_DIR/pmlogger/LOCALHOSTNAME -r -T24h10m -c config.server

b-01 n  n PCP_LOG_DIR/pmlogger/b-01 -r -T24h10m -c config.remote
b-02 n  n PCP_LOG_DIR/pmlogger/b-02 -r -T24h10m -c config.remote
b-03 n  n PCP_LOG_DIR/pmlogger/b-03 -r -T24h10m -c config.remote
b-12 n  n PCP_LOG_DIR/pmlogger/b-12 -r -T24h10m -c config.remote
b-13 n  n PCP_LOG_DIR/pmlogger/b-13 -r -T24h10m -c config.remote
b-14 n  n PCP_LOG_DIR/pmlogger/b-14 -r -T24h10m -c config.remote

It works fine if all the remote hosts are up. I am using the default cron.d/pcp-pmlogger file which has pmlogger_check running twice an hour.

25,55 Â* Â* Â* Â* Âpcp Â/usr/libexec/pcp/bin/pmlogger_check -C

If one of the remote hosts is down, pmlogger_check gets stuck on that host and takes about 30 min to move on. I ran pmlogger_check with -VV and the output looks like:

Check pmlogger -h b-02 ... in /var/log/pcp/pmlogger/b-02 ...
... try /var/lib/pcp/tmp/pmlogger/13865 host=b-13 arch=/var/log/pcp/pmlogger/b-13/20160428.12.50: match=0 different directory, skip
... try /var/lib/pcp/tmp/pmlogger/16486 host=b-13 arch=/var/log/pcp/pmlogger/b-13/20160428.09.17: match=0 different directory, skip
... try /var/lib/pcp/tmp/pmlogger/18026 host=it arch=/var/log/pcp/pmlogger/it/20160428.12.37: match=0 different directory, skip
... try /var/lib/pcp/tmp/pmlogger/21828 host=b-01 arch=/var/log/pcp/pmlogger/b-01/20160428.12.37: match=0 different directory, skip
... try /var/lib/pcp/tmp/pmlogger/2324 host=b-01 arch=/var/log/pcp/pmlogger/b-01/20160428.12.49: match=0 different directory, skip
... try /var/lib/pcp/tmp/pmlogger/26261 host=it arch=/var/log/pcp/pmlogger/it/20160428.12.37: match=0 different directory, skip
... try /var/lib/pcp/tmp/pmlogger/28642 host=b-14 arch=/var/log/pcp/pmlogger/b-14/20160428.12.50: match=0 different directory, skip
... try /var/lib/pcp/tmp/pmlogger/30914 host=it arch=/var/log/pcp/pmlogger/it/20160428.12.49: match=0 different directory, skip
... try /var/lib/pcp/tmp/pmlogger/6162 host=b-03 arch=/var/log/pcp/pmlogger/b-03/20160428.12.49: match=0 different directory, skip
[stuck for 30min here]

The list of processes running looks like this:

> ps ax | grep pml
Â1758 pts/0 Â ÂS+ Â Â 0:00 sudo su -s /bin/bash -c /usr/libexec/pcp/bin/pmlogger_check -C -VV pcp
Â1759 pts/0 Â ÂS+ Â Â 0:00 su -s /bin/bash -c /usr/libexec/pcp/bin/pmlogger_check -C -VV pcp
Â1760 ?    ÂSs   0:00 /bin/sh /usr/libexec/pcp/bin/pmlogger_check -C -VV
Â1796 ? Â Â Â ÂS Â Â Â0:00 /bin/sh /usr/libexec/pcp/bin/pmlogger_check -C -VV
Â2211 ? Â Â Â ÂS Â Â Â0:00 /bin/sh /usr/libexec/pcp/bin/pmlogconf -r -c -q -h b-02 /tmp/pcp.19CoEE3M7/pmlogger
Â3743 ? Â Â Â ÂS Â Â Â0:00 /bin/sh /usr/libexec/pcp/bin/pmlogconf-setup -h b-02 /var/lib/pcp/config/pmlogconf/apache/uptime
Â3757 ? Â Â Â ÂS Â Â Â0:00 /bin/sh /usr/libexec/pcp/bin/pmlogconf-setup -h b-02 /var/lib/pcp/config/pmlogconf/apache/uptime

Here is the version info:

> yum info pcp
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
Â* extras: mirror.compevo.com
Â* updates: mirrors.cat.pdx.edu
Installed Packages
Name    Â: pcp
Arch    Â: x86_64
Version   : 3.10.6
Release   : 2.el7
Size    Â: 2.9 M
Repo    Â: installed
From repo  : base
Summary   : System-level performance monitoring and performance management
URL Â Â Â Â : http://www.pcp.io
License   : GPLv2+ and LGPLv2.1+ and CC-BY
Description : Performance Co-Pilot (PCP) provides a framework and services to support
      : system-level performance monitoring and performance management.
      :Â
      : The PCP open source release provides a unifying abstraction for all of
      : the interesting performance data in a system, and allows client
      : applications to easily retrieve and process any subset of that data.


It seems that pmlogconf is causing the delay. I am not sure what is happening but it does not look right. Any thoughts?

Thanks!
Rares



<Prev in Thread] Current Thread [Next in Thread>