pcp
[Top] [All Lists]

Re: Prepare to be assimilated^Wanalysed; resistance is futile

To: "Frank Ch. Eigler" <fche@xxxxxxxxxx>
Subject: Re: Prepare to be assimilated^Wanalysed; resistance is futile
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Wed, 17 Jul 2013 00:29:02 -0400 (EDT)
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <y0moba71pao.fsf@xxxxxxxx>
References: <1715044262.9523595.1372389213645.JavaMail.root@xxxxxxxxxx> <y0m4ncfiq4h.fsf@xxxxxxxx> <51D08DEE.6030209@xxxxxxxxxxxxxxxx> <406338386.10303545.1372630273147.JavaMail.root@xxxxxxxxxx> <1251717658.10534278.1372672990990.JavaMail.root@xxxxxxxxxx> <20130702160444.GD19454@xxxxxxxxxx> <399367999.12169937.1372810670160.JavaMail.root@xxxxxxxxxx> <y0moba71pao.fsf@xxxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: zyTrGSiAc0vjAIeWmiCSv4rjn0Dnkg==
Thread-topic: Prepare to be assimilated^Wanalysed; resistance is futile

----- Original Message -----
> 
> nathans wrote:
> 
> > [...]
> > this server process would not need to run pmlogconf/pmieconf, I think.
> 
> Considering kenj's problems, I believe that running pm*conf from the
> cron FOO_check.sh is not a good idea after all (and have a patch in
> pcpfans.git fche/dev to take that part back out).  So that suggests

I think we should take caution from Kens experience, for sure.  It has
definitely shown how problems can propagate outward rapidly to quickly
affect many hosts - changes in this area need to be well tested.  I'm
not happy with throwing in the towel on generating good configuration
files by default though.

Your proposed commit has a bit of a problem - it has removed the only
way of generating config.default files, yet it continues to refer to
them for the local logger and pmie entries.

> > It'd just update the control file(s) and the crontab-driven existing
> > pm{ie,logger}_check functionality takes it from there.  With that
> > control.d addition, it'd just be creating a one-line file for each
> > new host found, in the /etc/pcp/{pmie,pmlogger}/control.d directory.
> 
> There are at least two problems with this scheme.
> 
> First, the _check* scripts run too infrequently.  For a machine that

That is configurable though - IIRC, in the Aconex production environment
they were being run every five minutes or so, and this proved just fine
for them.  Different people have different requirements - lots of people
are happy with sysstat sar's sampling once every ten minutes.

The crontab file lives below /etc and local customisations can be made
to suit each environment.  The *check scripts are not overly expensive,
based on production experience with them for many years (using 10s of
monitored host per pmlogger/pmie controller host, which I guess is about
the size Avahi might be used to configure).  Beyond that, I think people
would be using puppet/chef/mcollective/... anyway.

> comes up, we'd like to start logging it within (say) seconds, rather
> than up to 30 minutes.  (This could be worked around by hand-invoking
> the _check* routine upon the arrival of new hosts, though then we have
> a lot more cpu consumption, and a lot more busy-work checking on other
> pmloggers.)

Not convinced its going to cost a whole lot - new hosts do not arrive
that often - this is a once-in-a-while thing, so your poke-it-directly
solution above would indeed work in practice.

> Second, there is nothing that handles the disappearance of remote
> nodes, or equivalently, a sysadmin commenting out lines in
> pm{logger|ie}/config.default.  The _check* scripts may notice them but
> don't consider it their problem to kill them.

*nod* - this problem I have seen in real production environments, and it
is sorta-handled in a non-intuitive way - as soon as the remote host goes
away, pmlogger loses the connection and it exits (control file keeps on
trying to restart though, which never happens, but that is more of an
annoyance than a big problem).

Another corner case to  worry about is a pmlogger entry that was in the
control file, but later removed (via sysadmin) - this process no longer
tracked and will not be stopped/log-rotated.  In the Aconex environment,
this potential issue was combated via a dead-hand timer approach using
the -T option to pmlogger.


Moving along with all this, my current thinking is to continue on with
testing the code in the dev branch, and use that as the basis of the
next release.  Keep in mind we have safe-guards in place there - if for
some reason we find the on-by-default pmlogconf/pmieconf invocations to
be disastrous, we can always fall back to putting in place hand-made,
non-pmlogconf and non-pmieconf config.default files (even simply using
a quick specfile update, for example).

cheers.

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>