[pcp] Log rotation issue

Nathan Scott nscott at aconex.com
Mon Mar 30 17:21:39 CDT 2009


----- "Ken McDonell" <kenj at internode.on.net> wrote:

> The message more than likely means the label record has not yet been
> written because the first dummy pmlogger record (pmcd.pmlogger.host,
> pmcd.pmlogger.port and pmcd.pmlogger.archive) has not yet been
> written.
> There is a very small window between the fopen() and the fwrite() in
> pmlogger. The existence of the archive files is checked in the guard
> and
> retry loop of pmnewlog (which is where mkaf is likely being called
> from
> in this scenario) so you'd have to be hitting that pmlogger window.

Just looking at pmnewlog... we're talking about the chunk of code at
around line 600 - with then up-to-ten second loop - I guess that part
succeeds, we just get these spurious warnings from early mkaf failures.
But we have done a successful pmlc connect just prior to the loop, so
it would seem if we could force out the log labels before we accept any
pmlc connections.

Should pmlogger be doing a fflush and/or fsync after the fopen/fwrite
to remove this race?

> I am guessing this is a logger farm, with lots of pmlogger's being
> turned over at the same time.

Yep... between ten and twenty hosts per farm.  5 or 6 farms atm.

> Any way the Latest archive folio being messed up is not going to have
> any long-term bad effects ... just as long as the archives look ok in
> the morning.
> 
> Fixing this in pmnewlog is not too difficult, but testing the fix is

In pmnewlog or pmlogger/libpcp?  I'd have though the latter?

> going to be tricky, as you (Nathan) are the only one with an apparent
> non-deterministic test environment ... 8^)>

Happy to test stuff and/or add the flushing calls into pmlogger/libpcp
if you think they will help.  I seem to hit it relatively frequently on
some hosts, so we should be able to build up a fair level of confidence
with any code change relatively quickly here.

thanks!

-- 
Nathan



More information about the pcp mailing list