pcp
[Top] [All Lists]

Re: Suggested way of monitoring processes?

To: Alan Bailey <abailey@xxxxxxxxxxxxx>, Nathan Scott <nathans@xxxxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: Suggested way of monitoring processes?
From: kenmcd@xxxxxxxxxxxxxxxxx
Date: Thu, 9 Nov 2000 17:10:25 +1100 (EST)
Cc: pcp@xxxxxxxxxxx
In-reply-to: <10011081002.ZM116762@wobbly.melbourne.sgi.com>
Reply-to: kenmcd@xxxxxxxxxxxxxxxxx
Sender: owner-pcp@xxxxxxxxxxx
On Tue, 7 Nov 2000, Alan Bailey wrote:
>
> I've been messing around with pmie now.  As a first try, I'm writing a
> little rule to monitor an sshd process.  Here it is:
> 
> delta = 3 seconds;
> sshd =
> some_inst match_inst "sshd" (
>   proc.psinfo.pid > 0
> ) -> shell 60 seconds "echo 'it exists' | mail -s 'it exists' abailey"
>
> ...
>
> So, there are ?'s appearing in places where I think they shouldn't.
> First, do ?'s occur when the instance that I'm looking for does not exist?
> Why is there always one during each transition, and why do they appear in
> the middle of streams of 'true's?
> 
> Does anyone have any insight in this problem, and possibly how I could get
> around it?

On Wed, 8 Nov 2000, Nathan Scott wrote:
>
> hi Alan,
> ...
> in this case, what i think is happening (from some experiments
> using "sleep" in place of "sshd") is that whenever the set of
> instances coming back from match_inst changes, pmie throws its
> hands up in disgust, resets itself for the next metric fetch
> and gives up on the current one.
> 
> this (i believe, Ken knows this code better than i do ;) is why
> we get one '?' after each state change (sshd stop/start) and then
> good data.  i don't really agree this is the correct behavior for
> this situation, but i'll defer to Ken - perhaps there's something
> i've missed.

pmie is a brave little camper, but sometimes the semantics of the rule
evaluation are so complex that it must abandon partial results and cached
state and start again when the set of instances for a particular metric
is found to change (this is the technical version of "throws its hands up
in disgust").  I've tried more aggressive schemes but they unfortunately
produce incorrect results for more complex predicates and/or metrics
with different semantics.

In Alan's case
        true - means I am sure the predicate is true
        false - means I am sure the predicate is false
        ? - means I am not sure

Fortunately in all the circumstances I've analyzed "not sure" is
transient and the vast majority of the rule evaluations unambiguously
return either true or false.

> so, i don't think theres any situation where pmie is lying to you,
> its just a little indecisive at times :-)... it may be possible to
> improve this.  for the purpose of tracking long-running processes
> this shouldn't be too much of a problem (with relatively small
> metric fetch deltas), but its certainly annoying though.

Remember the design goal for PCP 7+ years ago was multiple distributed
hosts each with 100+ CPUs and 1+ Terabyte of disk ... to manage this
sort of environment we've consistently opted for scalability over
micro accuracy, because these large and complex systems cannot be
turned around quickly ... in this environment, instance domains change
infrequently, and so the protocols and architecure are biased towards
this state of affairs.

If the number and/or pids of processes matching the name sshd varies
dramatically in your production environment, then there are other
solutions that can be applied (this is an ideal fit for the shping PMDA)
... let me know if this is the case. 

Note that if _all_ the sshd processes die (the case that is really of
interest I presume) the _worst_ sequence you will see is:

        true
        ?
        false

so the detection is delayed for at most two pmie rule evaluation
intervals` and on average 1.5 times the evaluation interval, or 4.5 sec
in your example).


<Prev in Thread] Current Thread [Next in Thread>