----- "Ken McDonell" <kenj@xxxxxxxxxxxxxxxx> wrote:
> OK, I've done some digging on this.
>
> Given the things Nathan has excluded, I think the remaining options
> are:
>
> 1. heavy load (or worse memory thrashing) and pmie is not getting
> run ... unlikely for delays of 20-30 secs
I can confirm thats not the case.
> 2. the waitpid() call in sleepTight() is getting hung up on some
> strange kernel lock
Nor that (these machines are fairly idle, nothing indicating
any kernel resource contention).
> 3. one of the actions is taking a long time to complete
I can't see one of them taking multiple seconds, they're all pretty
simple actions. But, I wont say its impossible, could be that.
> 4. the _previous_ delay was prior to a reconnect retry, the retry
> timesout and the _next_ scheduled task is the victim
>
> I'm voting for 4. as the most likely ... Nathan, any sign of one or
> more of the monitoring hosts going down when all of this happened
> (I know the lack of real timestamps make that difficult).
This is likely - pretty sure we've most commonly seen these right
after a pcp upgrade (which sweeps across many machines, upgrading
rpms and restarting pmcd's left right and centre).
> Any way, the attached patch reduces the need to call waitpid() in
> sleepTight() and will help diagnose the real cause ... Nathan if you
> could apply this and observe what happens I'd be very interested.
OK, will do - thanks!
cheers.
--
Nathan
|