pcp
[Top] [All Lists]

Re: [pcp] Multi-Archive Contexts: Scaling and Consistency

To: Dave Brolley <brolley@xxxxxxxxxx>
Subject: Re: [pcp] Multi-Archive Contexts: Scaling and Consistency
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Wed, 11 Nov 2015 00:03:23 -0500 (EST)
Cc: PCP Mailing List <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <564258F5.20309@xxxxxxxxxx>
References: <564258F5.20309@xxxxxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: FIUKsOADxEfPn+oWHwpX2Aa95FQqlQ==
Thread-topic: Multi-Archive Contexts: Scaling and Consistency
Hi Dave,

----- Original Message -----
> Hello All,
> 
> Most of you are probably aware that I have been (slowly) working on
> multi-archive contexts for some time now.

It's a very hard problem, so expect it to take a long time - don't worry
about it.

> The set of archives defined using these methods are then treated as a single
> archive within the context with no additional effort required on the part of
> the tool.
> 
> [...]
> Tools which use (opts->flags & PM_OPTFLAG_MULTI) continue to work as before.

I wonder if this flag should be propagated more (as in, almost everywhere) - to
anticipate cases where it might not be desirable for certain tools to support
multiple-archives (pmdumplog?  pmlogcheck?  some PMAPI 3rd party?  not sure).

> Single PMNS for the entire context
> This is needed for those APIs which have a need to examine the entire PMNS of
> the context. Examples include pmTraversePMNS(3) and pmLookup*(3).
> 
> I propose that this PMNS be built up as each individual archive is accessed.
> 
> The main reason is that consistency checking can then also be performed as
> each archive is accessed. In the case of a consistency issue, it is then
> possible (even probable) that useful data will have been provided to the
> client before the problem is encountered. The label of each archive still
> needs to be examined when the context is opened, in order to determine their
> ordering in overall the time line, but it is not necessary to examine the
> PMNS (.meta) of each until the metrics within them are to be examined.
> 
> In the case of an API call, like pmTraversePMNS(3), we can bite the bullet
> and complete the PMNS of the entire archive set as needed.

There might be quite few API calls in that set - pmLookupName, pmLookupInDom
- but that may be OK as long as not every client always issues one of those
calls as the first thing it does.  :)

It might not be a problem to read all the .meta files, they tend to be fairly
small (heh, except in certain unusual situations - hi Martins!).

> From an implementation point of view the __pmLogCtl->l_pmns of each
> individual archive will reference the global PMNS instead of each
> maintaining their own PMNS as is done today.

Makes sense to me.

> The design of data structures and policies for retention could potentially
> depend on what kinds of usage scenarios we envision. We must also keep in
> mind that, in the case of directories of archives, new archives could be
> dynamically appearing via an active pmlogger or via some other means. They
> could also be dynamically disappearing, however this is just as easy to
> detect and should probably be treated as an error situation.

These things are guaranteed, and will be the normal situation (archives both
appearing and disappearing) - i.e. in the daily log rotation situations.  So
I think these need to work seamlessly (i.e. its not an error when an archive
is removed - its expected, every night, in steady state operation).

> In all use scenarios, we need to maintain the entire active set of archives
> for the purpose of maintaining their order within the time line. In

*nod*

> Scaling Possibilities
> 
> 
>     1. Keep one archive open at a time with no caching of any data from
>     previously accessed archives
> 
> 
>         * must re-read .index and .meta each time we return to the same
>         archive
> 
> 
>             * can still avoid redoing consistency checks
>         * no danger of potentially unused resource build up         *
>         optimized for single direction traversal
>         * potentially slow for tools which transition back and forth between
>         archives
> 
> 
>             * but not slower than the initial transition or than each
>             transition in a uni-directional traversal
>     2. Keep one archive open at a time but retain all .index, .meta, caches
>     and all other __pmLogCtl data for previously accessed archives
> 
> 
>         * need only re-open .index and .meta files, no need to re-read
>         * optimized for traversal back and forth between archives from
>         beginning to end
>         * prone to build up of large amounts of potentially
>         never-to-be-used-again data
>     3. Keep one archive open at a time but retain limited .index, .meta,
>     caches and all other __pmLogCtl data for previously accessed archives
> 
> 
>         * keep a cache of this data for the most recently accessed archives
> 
> 
>             * 2 or 3 previous archives might be sufficient
>         * could leave archives in the cache open, including fds OR need only
>         re-open .index and .meta files, no need to re-read
>         * optimized for traversal back and forth between recent archives but
>         would not slow down a uni-directional traversal
>         * not prone to build up of large amounts of potentially
>         never-to-be-used-again data.
> 
> My feeling is that 1) is the simplest and is optimized for what I believe is
> the most common use case, which is to read the archives in one direction
> from beginning to end or vice-versa. Changing direction across archive
> boundaries would be slower than if we were to cache some data, but no slower
> than a single direction traversal for any of the 3 suggestions. 1) could
> also be easily extended to become 3) should we discover that the performance
> of re-crossing archive boundaries is inadequate.

Makes sense also.  I suspect implementing 3 is going to be required, for doing
the metadata consistency checking in all cases ... but start with 1 and go from
there I guess.

> I feel that we should cater to the possibility of new archives being created
> within directories but only at the end of the time line for each directory
> (if any). pmlogger(1) would create new archives in this way. I believe that
> handing the creation of new archives at random points in the time line
> whenever an arbitrary archive boundary is crossed, would be a waste of time.

I'm not following what makes it more difficult to handle this case than just
appearing at the end of the timeline?  It might mean using a tree structure,
perhaps, rather than a simple list/array ... or is there more to it?

Thinking out loud - all archives will have a fixed start time, which *cannot*
ever change - so perhaps that could be the key used to index the (tree?) data
structure.  Since the end point can change (and there's no simple way to tell
if an archive is actively being written), we can't rely on that at all - but
start time may be enough to quickly find candidate archives.

Tough case to keep in mind: think of two active pmloggers, both recording for
the same host, in the same directory, one logging once a second, the other
once an hour.  At PMAPI-client startup time, the two might appear to not be
overlapping... (and the initial consistency checks might all pass), but that
would change later when the long sampling interval elapses once more.  (ow!)

IOW, its not easy (or even possible?) to tell if any archive is "finished"
being logged to, so I'd focus away from that as a key/accessor for any data
structures you're using.

> The algorithms below will check for new archives within a directory only
> when a request for data is made for a time just after the end of the time
> line of a given directory. If archives disappear while the context is open,
> then I believe that the errors which occur if/when we attempt to read the
> files will be sufficient.

For directories, I would recommend a combination of an inotify(7) model to
pick up creates/removes and update the data structure on-the-fly, coupled
with an initial readdir to find that initial set of archives (labels, meta,
etc).

> Here are some algorithms for handling various events associated with a
> multi-archive context:
> [...]
> 
> If the request is for a time just beyond the end of the final archive within
> a directory (marked above)
> re-check the directory for new archives and add them to the active set (see

(with inotify, this re-check-the-directory operation would not be needed,
because we'd be notified of new arrivals/removals as they happen, for any
directories)

> check the consistency of the PMNS of the new archive with the
> existing global PMNS
> unmanageable differences are an error

And the above part would happen as any new arrivals/removals happen.

> If a request requires the entire PMNS of the context:

As mentioned earlier, this might be very common, as pmLookupName may need
to be satisfied by a metric (name) being present in any of the archives.

> These algorithms should minimize and even eliminate multi-archive overhead
> [...]
> These are accomplished by using the fact that each archive in the
> set is temporally distinct and that we only need to check for new archives
> when traversing past the end of the final archive in any given directory.
> 

With the inotify addition, I think it gets better (less overhead) than you
are expecting, too.  Ideally, I would not want to scan an entire directory
more than once - well, hopefully.  Often that initial expensive scan will be
at program startup time too, so for long-running processes it's often OK to
take that hit once at startup but not later when servicing requests or GUI-
interacting with a user.

> Questions, concerns, ideas, comments ..... please!

Its all sounding promising to me - nothing sticking out as being a problem
with those algorithms beyond the minor tweaks suggested above.

cheers.

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>