pcp
[Top] [All Lists]

Re: [pcp] braindump on unified-context / live-logging

To: Nathan Scott <nathans@xxxxxxxxxx>
Subject: Re: [pcp] braindump on unified-context / live-logging
From: Greg Banks <gbanks@xxxxxxx>
Date: Tue, 14 Jan 2014 18:07:08 -0800
Cc: pcp developers <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <440877259.2327550.1389747502556.JavaMail.root@xxxxxxxxxx>
References: <20140108013956.GG15448@xxxxxxxxxx> <21198.38090.179929.552608@xxxxxxxxxxxx> <20140110190525.GA28062@xxxxxxxxxx> <0a923e$520gar@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <52D4666E.7030601@xxxxxxx> <21204.32676.163457.438142@xxxxxxxxxxxx> <52D49301.2000403@xxxxxxx> <440877259.2327550.1389747502556.JavaMail.root@xxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130510 Thunderbird/17.0.6
On 14/01/14 16:58, Nathan Scott wrote:
Hi Greg,

----- Original Message -----
On 13/01/14 16:07, Max Matveev wrote:
On Mon, 13 Jan 2014 14:19:26 -0800, Greg Banks wrote:

   gnb> While I designed and wrote the thing, I was never happy with any
   gnb> of the iterations of the architecture and I wouldn't recommend
   gnb> to anyone that they copy it. Some of the problems were:

   gnb>   * it was both a client of pmcd and a PMDA, which led to
   interesting
   gnb> deadlocks with the single-threaded pmcd

That was the "second" pass with nasavg pmda. I thought there was a
first version which only used archives but it had to be abandoned
because tailing of archive being written wasn't working reliably.

Yes, the first design iteration tailed archives and was horribly
unreliable.  Pmarchive was writing to the various files of an archive in
(pmlogger)

Yep.  That squeaky sound you hear is mental rusty hinges.



such a way that there was a race window where the archive reading code
in libpcp would see an inconsistent archive and barf. Plus, there was an
inconvenient amount of lag, up to 30 seconds, in pmarchive and in the
tailer.
OOC, what approaches were tried to address these reliability issues?
Given that the original libpcp design wasn't trying to service this
kind of log access, its not really surprising it didn't work first
go.  Max's ordered log label update mechanism sounded interesting
- was that implemented and if so, did it improve reliability?

Ken did something to the guts of pmlogger which made it write the files in the correct order, with an fsync(). It worked and I think it was checked in, but I'm not sure. We couldn't use it because our design relied on a stock PCP and there was no way to ship an update.

Plus it only solved half the problem, the lag being the other half.

Plus again, it meant that any metric we wanted to present an average for had to be logged at sufficient frequency to make the average reasonably responsive; that frequency is a lot more than you really want in historical records and it chewed up a lot of archive disk space. Disk space was always a problem, we spent a lot of time wrestling with pmlogger metrics and frequencies.




The 30 second lag will possibly be a lack of pmlogger fflush'ing its
buffered writes I guess - although I see the code is sprinkled with
them nowadays.

That, plus the polling time in the tailing process and the polling time in pmlogger itself.

  Was that on IRIX or Linux, OOC?

Both IIRC.dev-melb:nasmgr:31806a

  Some coordination
mechanism (like the pmlc flush command) for coordinating access may
help - was anything attempted there?  If so, did anything work or
not work well that you recall?


So using pmlc flush reduces one (the largest) of the three sources of lag, but not the other two. And see my comments on disk space.


I took a look at the comments in the historic source code (nostalgia, heh). Some of the problems mentioned are:

* libpcp reads the latest metadata of an archive at open time only, so if we are reading an archive before it's been finished and new metrics appear, libpcp won't notice. We worked around this by closing and re-opening the archive if pmNameID() failed.

 * the function pmGetArchiveEnd() seems to have been broken at some time

* pmLookupDesc() is one of the failure cases when you lose the update race with pmlogger

* when you have historical archives, it's really really important to have a stable mapping of instance names to numbers - hence pmdaCacheOp() et al

* sometimes this can't be helped, e.g. disk minor numbers across reboots, and the only way out is to trash the entire remembered instance domain

* a PMDA has to begin responding to packets from PMCD quite quickly - within 5 seconds. But it can take a lot longer than that to scan historical archives to fill the average buffer, so the PMDA has to multiplex reading archives and responding to PMCD, and until it has finished reading data it has to return a well-formed but empty result to FETCHes.

Hope this helps.

--
Greg.

<Prev in Thread] Current Thread [Next in Thread>