pcp
[Top] [All Lists]

Re: [pcp] pmda timeout workarounds

To: Martins Innus <minnus@xxxxxxxxxxx>
Subject: Re: [pcp] pmda timeout workarounds
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Fri, 8 May 2015 03:02:48 -0400 (EDT)
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <554B7E8C.5040603@xxxxxxxxxxx>
References: <554B7E8C.5040603@xxxxxxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: gXxm6tcgPOTDgUK4KF+LGz6l1T0Oxw==
Thread-topic: pmda timeout workarounds
Hi Martins,

----- Original Message -----
> [...]
> 
> Barring any other suggestions,  my plan would be to handle these issues
> on a case by case basis depending on the semantics of the pmda data.
> 
> For the slurm pmda, the data is tolerant to some reporting drift so I
> would have a separate thread that updates a shared data structure
> periodically, and any fetch just reports the most recent information.

Sounds fine & similar sorts of approaches other PMDAs are taking.

> For gpfs, we really care about getting the counters in a timely
> fashion.  So can I just return PM_ERR_AGAIN to a client request for a
> pmda fetch if the request is taking too long, and the client will do
> "the right thing"?

Yes.

> Where the right thing for a pmlogger instance is to
> probably record no value for that timestep.

It'll record the exact error code for that metric, within the pmResult,
IIRC.

> Proc probably would be a mix of these 2 solutions depending on the metric.
> 
> Is that the right way forward, or any other suggestions? 

I think your approach is good.  Do these PMDAs run as root, pcp user,
or something else?  If pcp user, have you had any success with those
earlier experiments with pmie auto-restarting timed out PMDAs?  (not
a general solution, I know, just want to know if there's any new info
there - thanks)

> This has
> started to occur more and more for us as we develop pmdas that interact
> with systems that may introduce delays that we cannot control.

*nod* - I remember having this issue with extracting NFS mount point
usage stats at one production site; once identified and resolved, and
using the background thread approach, stability was achieved (even on
systems under significant io+net+mem+cpu load 24x7).  So based on that
past experience, I think your approach will prove sound.

cheers.

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>