pcp
[Top] [All Lists]

Re: [pcp] pmlogger_check stuck if host is down

To: Rares Vernica <rvernica@xxxxxxxxx>
Subject: Re: [pcp] pmlogger_check stuck if host is down
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Thu, 28 Apr 2016 23:03:56 -0400 (EDT)
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <CALQ9KxCa75FNi0RY7rfSrQjJh=L33mPQWZpQpgGy2quPE+cimQ@xxxxxxxxxxxxxx>
References: <CALQ9KxCa75FNi0RY7rfSrQjJh=L33mPQWZpQpgGy2quPE+cimQ@xxxxxxxxxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: t+8CiteE2lqeteNfyiikp0dZNCnGPQ==
Thread-topic: pmlogger_check stuck if host is down
Hi Rares,

----- Original Message -----
> [...]
> If one of the remote hosts is down, pmlogger_check gets stuck on that host
> and takes about 30 min to move on. I ran pmlogger_check with -VV and the
> output looks like:
> 
> [...]
> > ps ax | grep pml

(any pmprobe processes running OOC?  that grep would have excluded 'em, but
I wonder if thats where the blockage is)

> 1758 pts/0 S+ 0:00 sudo su -s /bin/bash -c
> /usr/libexec/pcp/bin/pmlogger_check -C -VV pcp
> 1759 pts/0 S+ 0:00 su -s /bin/bash -c /usr/libexec/pcp/bin/pmlogger_check -C
> -VV pcp
> 1760 ? Ss 0:00 /bin/sh /usr/libexec/pcp/bin/pmlogger_check -C -VV
> 1796 ? S 0:00 /bin/sh /usr/libexec/pcp/bin/pmlogger_check -C -VV
> 2211 ? S 0:00 /bin/sh /usr/libexec/pcp/bin/pmlogconf -r -c -q -h b-02
> /tmp/pcp.19CoEE3M7/pmlogger
> 3743 ? S 0:00 /bin/sh /usr/libexec/pcp/bin/pmlogconf-setup -h b-02
> /var/lib/pcp/config/pmlogconf/apache/uptime
> 3757 ? S 0:00 /bin/sh /usr/libexec/pcp/bin/pmlogconf-setup -h b-02
> /var/lib/pcp/config/pmlogconf/apache/uptime
> [...]
> It seems that pmlogconf is causing the delay. I am not sure what is happening
> but it does not look right. Any thoughts?

Yep, its definitely not right.  Looks like the "probe" clause in an apache
logconf template has got stuck - it might just be the first template though.
Probably in the pmprobe(1) use ... could you see if there are some pmprobe
processes running, and if so which syscall they are blocked in?  (strace).

The should timeout on the attempt to connect to pmcd, but appears thats not
happening, or that we are getting stuck in a loop in pmlogconf-setup trying
to probe repeatedly.

Thanks.

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>