pcp
[Top] [All Lists]

Re: PR1073 - pmlogger PID lifetime-matched logging

To: "Frank Ch. Eigler" <fche@xxxxxxxxxx>, Lukas Berk <lberk@xxxxxxxxxx>
Subject: Re: PR1073 - pmlogger PID lifetime-matched logging
From: Martins Innus <minnus@xxxxxxxxxxx>
Date: Thu, 26 Feb 2015 16:55:05 -0500
Cc: pcp@xxxxxxxxxxx, jpwhite4@xxxxxxxxxxx, tyearke@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <y0ma900csfc.fsf@xxxxxxxx>
References: <87sids32ao.fsf@xxxxxxxxxx> <y0ma900csfc.fsf@xxxxxxxx>
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
Lukas and Frank,

On 2/26/2015 3:46 PM, Frank Ch. Eigler wrote:
Please see commit 7347927a67849a74b67d8b25fb58c033ee79042d on
git://sourceware.org/git/pcpfans.git lberk/dev
Thanks, nice work!  Included some Buffalo folks on cc:, because the
idea for this PR [1] came from their site needs at CCR, to have
job-specific pmlogger data.  It would be nice to know whether, with
this facility, a "pmlc one-shot METRIC" widget would be helpful or
redundant.

Thanks for this! This will be useful to track pids on nodes running single jobs. I think it would still be useful to have a "pmlc one-shot set-of-metrics/configfile". That interleaves the results into a currently running primary logger. For now we are planning on setting up separate "once" loggers that we can fire off as needed and then merge the files afterwards. This may be good enough, but we haven't implemented it yet to be sure there are no issues.

The use case is to, as exactly as possible, annotate times that may be "interesting" in some way. For instance during a single job, we may want to indicate the time and collect stats on the boundaries of preprocessing/compute/post processing that may be part of the same job, but we want to have a record at the exact moment these occur regardless of default sampling interval. We are able to run an arbitrary shell script at these times.

We can't just increase the default logger interval, since we have a mix of jobs that run for less than a minute and others that run for days or weeks on the same nodes, and logging at a high enough frequency for the short jobs would generate too much data overall. We already log at 30 sec and miss some information. With these one-shot type events, we could probably decrease our default logging interval. If we didn't have shared resources, this would be much easier, but we could have 10-20 jobs per node, and running a separate resolution logger for each would create too much data.

I think this solution is very useful, but we would also use the one-shot facility if it existed.

Thanks.

Martins

<Prev in Thread] Current Thread [Next in Thread>