pcp
[Top] [All Lists]

Re: [pcp] Prepare to be assimilated^Wanalysed; resistance is futile

To: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Subject: Re: [pcp] Prepare to be assimilated^Wanalysed; resistance is futile
From: Jeff Hanson <jhanson@xxxxxxx>
Date: Wed, 17 Jul 2013 16:18:21 -0400
Cc: <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <51E6FABB.3020104@xxxxxxxxxxxxxxxx>
References: <1715044262.9523595.1372389213645.JavaMail.root@xxxxxxxxxx> <y0m4ncfiq4h.fsf@xxxxxxxx> <51D08DEE.6030209@xxxxxxxxxxxxxxxx> <406338386.10303545.1372630273147.JavaMail.root@xxxxxxxxxx> <1251717658.10534278.1372672990990.JavaMail.root@xxxxxxxxxx> <20130702160444.GD19454@xxxxxxxxxx> <399367999.12169937.1372810670160.JavaMail.root@xxxxxxxxxx> <y0moba71pao.fsf@xxxxxxxx> <444804824.2373005.1374035342123.JavaMail.root@xxxxxxxxxx> <20130717131537.GA14710@xxxxxxxxxx> <51E6FABB.3020104@xxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130621 Thunderbird/17.0.7
On 07/17/2013 04:12 PM, Ken McDonell wrote:
> On 17/07/13 23:15, Frank Ch. Eigler wrote:
>>  ..
>> OK, as long as we observe the requirement that we do not accidentally
>> regenerate / modify any files that a sysadmin has created (whether
>> that was by hand or by a prior interactive pm*conf run).
> 
> Just to reinforce the mail exchange Nathan and I had yesterday, I now believe 
> there to be no evidence of regeneration or modification of files in the case 
> I reported that triggered this whole discussion.  I did not follow a 
> sanctioned upgrade path which let to version mismatched pieces being 
> installed ... the cause was idiot user error, not PCP error.
> 
>>> *nod* - this problem I have seen in real production environments, and it
>>> is sorta-handled in a non-intuitive way - as soon as the remote host goes
>>> away, pmlogger loses the connection and it exits [...]
>>
>> This sounds like an unfortunate policy, if for example there are
>> temporary network glitches or a quick reboot.  A 30-minute re-poll is
>> IMO too slow.
> 
> Let me outline the constraints here, and maybe we can brainstorm a better 
> approach.
> 
> 1. When a host is indeed rebooted, we have to start a new PCP archive ... the 
> hw config may be different (even without human intervention in some HA 
> worlds), instance domains may be different, etc. so all of the PCP metadata 
> needs to be written afresh and the "log once" metrics written to the new 
> archive.
> 
> 2. pmlogger cannot tell the difference between a network outage and remote 
> host reboot so if the connection to pmcd is closed, or a PDU get/put 
> timesout, then pmlogger must finish the current PCP archive. But we 
> could/should consider setting $PMCD_REQUEST_TIMEOUT to be something larger 
> than the default 10 seconds as pmlogger in particular is tolerant of
> delayed PDUs coming back from pmcd, so that pmlogger is less exposed to 
> short-term network glitches.
> 

Just a note that in Ron Kerry's how to setup PCP (in a wiki at SGI) he suggests 
120 seconds
for the timeout value.

> 3. pmlogger knows nothing of the date+timestamp[+sequence#] naming convention 
> that the scripts around pmlogger use to name the PCP archives
> 
> _______________________________________________
> pcp mailing list
> pcp@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/pcp


-- 
-----------------------------------------------------------------------
Jeff Hanson - jhanson@xxxxxxx - Field Technical Analyst

You can choose a ready guide in some celestial voice.
If you choose not to decide, you still have made a choice.
You can choose from phantom fears and kindness that can kill;
I will choose a path that's clear
I will choose freewill. - Lee/Lifeson/Peart

<Prev in Thread] Current Thread [Next in Thread>