pcp
[Top] [All Lists]

Re: [pcp] pmie and system clock changes

To: kenj@xxxxxxxxxxxxxxxx
Subject: Re: [pcp] pmie and system clock changes
From: nathans@xxxxxxxxxx
Date: Tue, 27 Jul 2010 10:04:26 +1000 (EST)
Cc: pcp@xxxxxxxxxxx, Martin Hicks <mort@xxxxxxx>
In-reply-to: <720719558.1243211280188742893.JavaMail.root@xxxxxxxxxxxxxxxxxx>
Sender: nscott@xxxxxxxxxx
----- "Ken McDonell" <kenj@xxxxxxxxxxxxxxxx> wrote:

> OK, I've done some digging on this.
> 
> Given the things Nathan has excluded, I think the remaining options
> are:
> 
> 1. heavy load (or worse memory thrashing) and pmie is not getting
> run ... unlikely for delays of 20-30 secs

I can confirm thats not the case.

> 2. the waitpid() call in sleepTight() is getting hung up on some
> strange kernel lock

Nor that (these machines are fairly idle, nothing indicating
any kernel resource contention).

> 3. one of the actions is taking a long time to complete

I can't see one of them taking multiple seconds, they're all pretty
simple actions.  But, I wont say its impossible, could be that.

> 4. the _previous_ delay was prior to a reconnect retry, the retry
> timesout and the _next_ scheduled task is the victim
> 
> I'm voting for 4. as the most likely ... Nathan, any sign of one or
> more of the monitoring hosts going down when all of this happened
> (I know the lack of real timestamps make that difficult).

This is likely - pretty sure we've most commonly seen these right
after a pcp upgrade (which sweeps across many machines, upgrading
rpms and restarting pmcd's left right and centre).

> Any way, the attached patch reduces the need to call waitpid() in
> sleepTight() and will help diagnose the real cause ... Nathan if you
> could apply this and observe what happens I'd be very interested.

OK, will do - thanks!

cheers.

-- 
Nathan

<Prev in Thread] Current Thread [Next in Thread>