pcp
[Top] [All Lists]

Re: [pcp] Log rotation issue

To: Nathan Scott <nscott@xxxxxxxxxx>
Subject: Re: [pcp] Log rotation issue
From: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date: Tue, 31 Mar 2009 18:29:07 +1100
Cc: pcp@xxxxxxxxxxx
In-reply-to: <721112441.1850331238451699804.JavaMail.root@xxxxxxxxxxxxxxxxxx>
References: <721112441.1850331238451699804.JavaMail.root@xxxxxxxxxxxxxxxxxx>
Reply-to: kenj@xxxxxxxxxxxxxxxx
On Tue, 2009-03-31 at 09:21 +1100, Nathan Scott wrote:
> ----- "Ken McDonell" <kenj@xxxxxxxxxxxxxxxx> wrote:
> 
> > The message more than likely means the label record has not yet been
> > written because the first dummy pmlogger record (pmcd.pmlogger.host,
> > pmcd.pmlogger.port and pmcd.pmlogger.archive) has not yet been
> > written.
> > There is a very small window between the fopen() and the fwrite() in
> > pmlogger. The existence of the archive files is checked in the guard
> > and
> > retry loop of pmnewlog (which is where mkaf is likely being called
> > from
> > in this scenario) so you'd have to be hitting that pmlogger window.
> 
> Just looking at pmnewlog... we're talking about the chunk of code at
> around line 600 - with then up-to-ten second loop - I guess that part
> succeeds, we just get these spurious warnings from early mkaf failures.
> But we have done a successful pmlc connect just prior to the loop, so
> it would seem if we could force out the log labels before we accept any
> pmlc connections.

Yes.

> Should pmlogger be doing a fflush and/or fsync after the fopen/fwrite
> to remove this race?

I don't think pmlogger gets a chance ... until the first fetch has been
done and the first pmResult is ready to go to the archive, it cannot
write the label record (which is the missing piece of magic here).

I've been able to reproduce the symptom with a hard loop test case
launching pmlogger and creating a folio in two shell scripts (not really
suitable for pcpqa I'm afraid).  This proves my theory, so the attached
patch for pmnewlog should fix the problem, I believe.

This patch passes qa, as in ...
        $ check -g pmnewlog -g logutil -x remote

> > I am guessing this is a logger farm, with lots of pmlogger's being
> > turned over at the same time.
> 
> Yep... between ten and twenty hosts per farm.  5 or 6 farms atm.
> 
> > Any way the Latest archive folio being messed up is not going to have
> > any long-term bad effects ... just as long as the archives look ok in
> > the morning.
> > 
> > Fixing this in pmnewlog is not too difficult, but testing the fix is
> 
> In pmnewlog or pmlogger/libpcp?  I'd have though the latter?

No it is a pmnewlog problem IMO (see above).

> > going to be tricky, as you (Nathan) are the only one with an apparent
> > non-deterministic test environment ... 8^)>
> 
> Happy to test stuff and/or add the flushing calls into pmlogger/libpcp
> if you think they will help.  I seem to hit it relatively frequently on
> some hosts, so we should be able to build up a fair level of confidence
> with any code change relatively quickly here.
> 
> thanks!
> 

Attachment: patch
Description: Text Data

<Prev in Thread] Current Thread [Next in Thread>