pcp
[Top] [All Lists]

Re: pmmgr and qa/666

To: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Subject: Re: pmmgr and qa/666
From: "Frank Ch. Eigler" <fche@xxxxxxxxxx>
Date: Tue, 14 Apr 2015 11:32:16 -0400
Cc: pcp developers <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <009801d07567$cb543880$61fca980$@internode.on.net>
References: <009801d07567$cb543880$61fca980$@internode.on.net>
User-agent: Mutt/1.4.2.2i
Hi, Ken -

> I'm trying to diagnose a systemic failure in qa/666 that seems to have crept
> in over the past month or two.
> At step 8. (recheck the directories past retain/merge) we're seeing 5
> archives remaining, not 2 as expected. [...]

The key seems to be a couple of archives created by pmloggers, where
the pmlogger processes were killed (with SIGTERM then failing that, a
SIGKILL).  In this case, pmloggers left behind 3 archives that were
deemed bad by pmlogcheck (see the "corrupt" lines in 666.full, and the
history of the relevant archive files), and thus not eligible for
merging/etc. processing.

So, the question is why those files were deemed rejected by
pmlogcheck; whether pmlogger was interrupted with SIGKILL (because it
failed to respond to SIGTERM soon enough - 250ms in the code atm);
whether pmlogger SIGKILL should be expected to possibly leave the
archives corrupt; whether the 250ms SIGKILL timeout is too short.

(That last one's easy to bump up (pmmgr.cxx:658), and adding a
diagnostic at line :666 could help too.)

- FChE

<Prev in Thread] Current Thread [Next in Thread>
  • Re: pmmgr and qa/666, Frank Ch. Eigler <=