Hi, Ken -
> I'm trying to diagnose a systemic failure in qa/666 that seems to have crept
> in over the past month or two.
> At step 8. (recheck the directories past retain/merge) we're seeing 5
> archives remaining, not 2 as expected. [...]
The key seems to be a couple of archives created by pmloggers, where
the pmlogger processes were killed (with SIGTERM then failing that, a
SIGKILL). In this case, pmloggers left behind 3 archives that were
deemed bad by pmlogcheck (see the "corrupt" lines in 666.full, and the
history of the relevant archive files), and thus not eligible for
merging/etc. processing.
So, the question is why those files were deemed rejected by
pmlogcheck; whether pmlogger was interrupted with SIGKILL (because it
failed to respond to SIGTERM soon enough - 250ms in the code atm);
whether pmlogger SIGKILL should be expected to possibly leave the
archives corrupt; whether the 250ms SIGKILL timeout is too short.
(That last one's easy to bump up (pmmgr.cxx:658), and adding a
diagnostic at line :666 could help too.)
- FChE
|