pcp
[Top] [All Lists]

Re: [pcp] pmcd gets stuck with pmda kill

To: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>, pcp@xxxxxxxxxxx
Subject: Re: [pcp] pmcd gets stuck with pmda kill
From: Martins Innus <minnus@xxxxxxxxxxx>
Date: Wed, 28 Jan 2015 14:43:57 -0500
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <54C80E1F.1010909@xxxxxxxxxxxxxxxx>
References: <54C7FF66.5090503@xxxxxxxxxxx> <54C80E1F.1010909@xxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
Ken,

On 1/27/15 5:15 PM, Ken McDonell wrote:

How is the proc PMDA installed (process or dso)? This suggests pmcd <--> client timeout, not pmcd <--> pmda timeout (which cannot happen for dso pmdas!).

There are two different timeouts in play here: -t or pmcd.control.timeout for pmcd and the $PMCD_*_TIMEOUT family. Which are you using and what values are it/they set to?
Sorry for the lack of background, I had spoken to Nathan and Frank about this previously. All timeouts are set to defaults, running the pmda as a daemon.

The main issue I'm trying to solve is that for us, when a system gets heavily loaded (seems to correlate to high I/O) and we have pmlogger grabbing metrics from the proc pmda at regular intervals, we get the following in the pmcd.log:


[Thu Jan 15 10:29:25] pmcd(15873) Warning: pduread: timeout (after 5.000 sec) while attempting to read 12 bytes out of 12 in HDR on fd=11
[Thu Jan 15 10:29:25] pmcd(15873) Info: CleanupAgent ...
Cleanup "proc" agent (dom 3): protocol failure for fd=11


I'd like to not increase the timeout since then we are reporting incorrect timestamps for collected data, so I was going to use pmie to restart pmcd when the pmda dies. Since there is no way to abandon a request, I think we are better off getting no data and then trying again after a restart.

Turns out, there is no problem with pmie, just my understanding of its interaction with pmcd. Although, I did find an issue that i describe in my other email in this thread.

But this is the basic timeout problem we are trying to solve.

Thanks

Martins

<Prev in Thread] Current Thread [Next in Thread>