OK, I've done some digging on this.
Given the things Nathan has excluded, I think the remaining options are:
1. heavy load (or worse memory thrashing) and pmie is not getting
run ... unlikely for delays of 20-30 secs
2. the waitpid() call in sleepTight() is getting hung up on some strange
kernel lock
3. one of the actions is taking a long time to complete
4. the _previous_ delay was prior to a reconnect retry, the retry
timesout and the _next_ scheduled task is the victim
I'm voting for 4. as the most likely ... Nathan, any sign of one or more
of the monitoring hosts going down when all of this happened (I know the
lack of real timestamps make that difficult).
Any way, the attached patch reduces the need to call waitpid() in
sleepTight() and will help diagnose the real cause ... Nathan if you
could apply this and observe what happens I'd be very interested.
On Thu, 2010-07-22 at 12:05 +1000, nathans@xxxxxxxxxx wrote:
> Hi guys,
>
> ----- "Ken McDonell" <kenj@xxxxxxxxxxxxxxxx> wrote:
>
> > I suspect we can solve this problem with a large hammer that will
> > work on all platforms (Greg's suggestion won't apparently work on
> > Mac OS X).
> >
> > If the time ever goes backwards (not a small ntp adjustment, I mean
> > now - last_delay is < now at the last sample, so time went back more
> > than 2*last_delay), there is no point keeping any history and any
> > notion of a "rate" over the last interval is pretty wobbly.
> > ..
>
> We've now observed a few occurrences of the log entries below in our
> production environment... this is a localhost pmie (i.e. running on
> the host its monitoring, no pmcd restart or anything here ... it just
> spontaneously fails, occassionally, afaict):
>
> Log for pmie on [HOST] rotated Thu Jul 22 00:14:01 2010
>
> pmie: PID = 25944, default host = [HOST]
>
> Thu Jul 22 08:14:13 2010: High load: 5.8 seconds/seconds CPU time
> sleepTight: negative delay (-14.907716). sched=1279750458.049223,
> cur=1279750472.956939
> sleepTight: invalid args: -14 -907716035
> sleepTight: negative delay (-9.929606). sched=1279750463.049223,
> cur=1279750472.978829
> sleepTight: invalid args: -9 -929605960
> sleepTight: negative delay (-4.950649). sched=1279750468.049223,
> cur=1279750472.999872
> sleepTight: invalid args: -4 -950649023
> Thu Jul 22 09:50:33 2010: High load: 4.35 seconds/seconds CPU time
> sleepTight: negative delay (-17.284786). sched=1279756238.049223,
> cur=1279756255.334009
> sleepTight: invalid args: -17 -284785985
> sleepTight: negative delay (-12.302463). sched=1279756243.049223,
> cur=1279756255.351686
> sleepTight: invalid args: -12 -302463054
> sleepTight: negative delay (-7.319972). sched=1279756248.049223,
> cur=1279756255.369195
> sleepTight: invalid args: -7 -319972038
> sleepTight: negative delay (-2.337748). sched=1279756253.049223,
> cur=1279756255.386971
> sleepTight: invalid args: -2 -337748050
> Thu Jul 22 09:55:48 2010: Viewer conversion completed, cleaning up
> Thu Jul 22 09:56:48 2010: Key Lock wait time 1.48 6.9 1.17 3.75 0.0032164
> sleepTight: negative delay (-2.482367). sched=1279756613.049223,
> cur=1279756615.531590
> sleepTight: invalid args: -2 -482367038
> Thu Jul 22 10:32:48 2010: Key Lock wait time 4.88 2.28 0.97 8.9 2.72
> sleepTight: negative delay (-2.600906). sched=1279758773.049223,
> cur=1279758775.650129
> sleepTight: invalid args: -2 -600906133
> Thu Jul 22 10:51:18 2010: High load: 4.25 seconds/seconds CPU time
> sleepTight: negative delay (-21.399701). sched=1279759883.049223,
> cur=1279759904.448924
> sleepTight: invalid args: -21 -399701118
> sleepTight: negative delay (-16.418470). sched=1279759888.049223,
> cur=1279759904.467693
> sleepTight: invalid args: -16 -418470144
> sleepTight: negative delay (-11.437375). sched=1279759893.049223,
> cur=1279759904.486598
> sleepTight: invalid args: -11 -437375068
> sleepTight: negative delay (-6.456028). sched=1279759898.049223,
> cur=1279759904.505251
> sleepTight: invalid args: -6 -456027984
> sleepTight: negative delay (-1.475107). sched=1279759903.049223,
> cur=1279759904.524330
> sleepTight: invalid args: -1 -475106954
>
> Its a shame theres no timestamp with those sleepTight lines. This is on
> 3.3.3.
> This host is using ntp, theres no way this is some wierd large intermittent
> multi-second changes in the system clock or anything like that.
>
> cheers.
>
patch.pcp
Description: Text Data
|