pcp
[Top] [All Lists]

Re: [pcp] braindump on unified-context / live-logging

To: Greg Banks <gbanks@xxxxxxx>
Subject: Re: [pcp] braindump on unified-context / live-logging
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Tue, 14 Jan 2014 19:58:22 -0500 (EST)
Cc: pcp developers <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <52D49301.2000403@xxxxxxx>
References: <20140108013956.GG15448@xxxxxxxxxx> <21198.38090.179929.552608@xxxxxxxxxxxx> <20140110190525.GA28062@xxxxxxxxxx> <0a923e$520gar@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <52D4666E.7030601@xxxxxxx> <21204.32676.163457.438142@xxxxxxxxxxxx> <52D49301.2000403@xxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: fwPb8Tp6R1t3Jd8T+OE7fQVgNUH0RA==
Thread-topic: braindump on unified-context / live-logging
Hi Greg,

----- Original Message -----
> On 13/01/14 16:07, Max Matveev wrote:
> > On Mon, 13 Jan 2014 14:19:26 -0800, Greg Banks wrote:
> >
> >   gnb> While I designed and wrote the thing, I was never happy with any
> >   gnb> of the iterations of the architecture and I wouldn't recommend
> >   gnb> to anyone that they copy it. Some of the problems were:
> >
> >   gnb>   * it was both a client of pmcd and a PMDA, which led to
> >   interesting
> >   gnb> deadlocks with the single-threaded pmcd
> >
> > That was the "second" pass with nasavg pmda. I thought there was a
> > first version which only used archives but it had to be abandoned
> > because tailing of archive being written wasn't working reliably.
> >
> 
> Yes, the first design iteration tailed archives and was horribly
> unreliable.  Pmarchive was writing to the various files of an archive in

(pmlogger)

> such a way that there was a race window where the archive reading code
> in libpcp would see an inconsistent archive and barf. Plus, there was an
> inconvenient amount of lag, up to 30 seconds, in pmarchive and in the
> tailer.

OOC, what approaches were tried to address these reliability issues?
Given that the original libpcp design wasn't trying to service this
kind of log access, its not really surprising it didn't work first
go.  Max's ordered log label update mechanism sounded interesting
- was that implemented and if so, did it improve reliability?

The 30 second lag will possibly be a lack of pmlogger fflush'ing its
buffered writes I guess - although I see the code is sprinkled with
them nowadays.  Was that on IRIX or Linux, OOC?  Some coordination
mechanism (like the pmlc flush command) for coordinating access may
help - was anything attempted there?  If so, did anything work or
not work well that you recall?

thanks Greg!

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>