----- "Ken McDonell" <kenj@xxxxxxxxxxxxxxxx> wrote:
> The message more than likely means the label record has not yet been
> written because the first dummy pmlogger record (pmcd.pmlogger.host,
> pmcd.pmlogger.port and pmcd.pmlogger.archive) has not yet been
> written.
> There is a very small window between the fopen() and the fwrite() in
> pmlogger. The existence of the archive files is checked in the guard
> and
> retry loop of pmnewlog (which is where mkaf is likely being called
> from
> in this scenario) so you'd have to be hitting that pmlogger window.
Just looking at pmnewlog... we're talking about the chunk of code at
around line 600 - with then up-to-ten second loop - I guess that part
succeeds, we just get these spurious warnings from early mkaf failures.
But we have done a successful pmlc connect just prior to the loop, so
it would seem if we could force out the log labels before we accept any
pmlc connections.
Should pmlogger be doing a fflush and/or fsync after the fopen/fwrite
to remove this race?
> I am guessing this is a logger farm, with lots of pmlogger's being
> turned over at the same time.
Yep... between ten and twenty hosts per farm. 5 or 6 farms atm.
> Any way the Latest archive folio being messed up is not going to have
> any long-term bad effects ... just as long as the archives look ok in
> the morning.
>
> Fixing this in pmnewlog is not too difficult, but testing the fix is
In pmnewlog or pmlogger/libpcp? I'd have though the latter?
> going to be tricky, as you (Nathan) are the only one with an apparent
> non-deterministic test environment ... 8^)>
Happy to test stuff and/or add the flushing calls into pmlogger/libpcp
if you think they will help. I seem to hit it relatively frequently on
some hosts, so we should be able to build up a fair level of confidence
with any code change relatively quickly here.
thanks!
--
Nathan
|