pcp
[Top] [All Lists]

pmda timeout workarounds

To: pcp@xxxxxxxxxxx
Subject: pmda timeout workarounds
From: Martins Innus <minnus@xxxxxxxxxxx>
Date: Thu, 07 May 2015 11:02:36 -0400
Delivered-to: pcp@xxxxxxxxxxx
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0
HI,
I'd like to come back to this question that's been floating around for a while. When pmdas take too long to respond to a client request, pmcd gives up on them and closes the pipe. Roughly once or twice a day, usually coinciding with high I/O, we have pmdas that run into this situation:


[Thu Apr 23 12:01:40] pmcd(1470) Warning: pduread: timeout (after 5.000 sec) while attempting to read 12 bytes out of 12 in HDR on fd=16
[Thu Apr 23 12:01:40] pmcd(1470) Info: CleanupAgent ...
Cleanup "slurm" agent (dom 23): protocol failure for fd=16, exit(1)
[Fri May 1 13:30:50] pmcd(1470) Warning: pduread: timeout (after 5.000 sec) while attempting to read 12 bytes out of 12 in HDR on fd=30
[Fri May  1 13:30:52] pmcd(1470) Info: CleanupAgent ...
Cleanup "gpfs" agent (dom 135): protocol failure for fd=30, exit(1)
[Fri May 1 13:31:20] pmcd(1470) Warning: pduread: timeout (after 5.000 sec) while attempting to read 12 bytes out of 12 in HDR on fd=12
[Fri May  1 13:31:22] pmcd(1470) Info: CleanupAgent ...
Cleanup "proc" agent (dom 3): protocol failure for fd=12


We'd like to not increase the timeout, since in some cases, late data is worse than no data.

There was some some discussion here that it should be handled on the pmda side:

http://oss.sgi.com/bugzilla/show_bug.cgi?id=1036

Later on, some discussion on providing some new functionality:

http://www.pcp.io/pipermail/pcp/2014-March/004586.html

The slow startup has already been handled, but now I'm thinking about how to deal with slow responses when the pmda is already running.

Barring any other suggestions, my plan would be to handle these issues on a case by case basis depending on the semantics of the pmda data.

For the slurm pmda, the data is tolerant to some reporting drift so I would have a separate thread that updates a shared data structure periodically, and any fetch just reports the most recent information.

For gpfs, we really care about getting the counters in a timely fashion. So can I just return PM_ERR_AGAIN to a client request for a pmda fetch if the request is taking too long, and the client will do "the right thing"? Where the right thing for a pmlogger instance is to probably record no value for that timestep.

Proc probably would be a mix of these 2 solutions depending on the metric.

Is that the right way forward, or any other suggestions? This has started to occur more and more for us as we develop pmdas that interact with systems that may introduce delays that we cannot control.

Thanks

Martins

<Prev in Thread] Current Thread [Next in Thread>