pcp
[Top] [All Lists]

Re: PMDAs for lm_sensors, HDD SMART monitoringâ

To: "David O'Shea" <dcoshea@xxxxxxxxx>
Subject: Re: PMDAs for lm_sensors, HDD SMART monitoringâ
From: "Frank Ch. Eigler" <fche@xxxxxxxxxx>
Date: Sat, 2 Jan 2016 10:34:16 -0500
Cc: pcp developers <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <CAN0DjM0VEmykF15F=ZfmRGjpog0knCyvnx_YrmuAik979huW5w@xxxxxxxxxxxxxx>
References: <CAN0DjM1GGZJ2MOdDohbaf7WZ25j3g_7CxzfWxVvKH=a2pKcLAw@xxxxxxxxxxxxxx> <y0md1tpqmoe.fsf@xxxxxxxx> <CAN0DjM0VEmykF15F=ZfmRGjpog0knCyvnx_YrmuAik979huW5w@xxxxxxxxxxxxxx>
User-agent: Mutt/1.4.2.2i
Hi, David -


> [...]  If I had a thread which fetches the values based on a timer,
> which is configurable but has a default (say 5 minutes), and then
> stores these in memory, and then requests from PMCD return the
> in-memory values, would that be appropriate/consistent with other
> PMDAs?

That would be a reasonable approach.  One thing to watch out for
though is the load imposed on the system by this machinery, even if no
client is actually collecting that data, and/or if no underlying
conditions are changing.  It's worth measuring impact to see if it
could possibly be a problem.

For comparison, the pmdarpm collector thread also runs in the
background, and is triggered by *notify events to rescan a changed rpm
database - it doesn't do it by timer.  OTOH it does that whether or
not a client is currently interested in the rpm metrics.

For another comparison, pmdapapi starts collecting perfctr data only
after/while clients fetch those specific metrics.  It continues
collecting those as long as a clients keep polling (with a timeout),
then releases the perfctr resources back to the kernel.

(For another comparison, see the manual pmstore style control also in
pmdapapi, but IMHO that's not a model that should be followed.)


> > [...]  At the minimum, of course, we need persistence during a
> > single connection.  The common level of effort seems to be
> > persistent across restarts of the PMDA on the same system/uptime.

> So not necessarily persistent across system restarts?

Well, the more persistent, the better, in the sense that it allows
client-side tooling the most opportunity to be simpleminded. :-) 
Using pmdacache is one way to make that more likely.

(Consider also not just system restarts, but system software updates
and hardware changes (=> different SMART variables/state).)


> [...]  So, just to be clear, given that for example
> Reallocated_Sector_Ct is attribute #5, then for a given drive, both
> .reallocated_sector_ct and .number_5 would give the same value,
> i.e. there would be different ways (aliases) to get to the same
> actual metric (although from PCP's point of view they would be
> different metrics)?

Yup, that would be fine.  For configuration convenience, it may be
helpful to separate the low level numbered attributes from the aliases
by PMNS nesting, so that a generic pmlogger configuration can choose
one set or the other (so as to reduce stored data duplication).


> [...]  I think that so long as I can determine the available metrics
> at runtime (which sounds like it is possible, but I haven't tried
> it), I don't need to parse the drivedb.h, I can just parse the
> output of 'smartctl' to work out those mappings.  

Yeah, if you're planning to do it by running smartctl and scraping
its output, sure.


> My problem is giving them unique item numbers, which I don't think
> drivedb.h will help with.

(Well, drivedb.h could give you a unique ordinal number for the
attribute name string.  'course drivedb.h itself may change over
time!)


> I don't suppose there's another "cache" to help with this, is there?  

You can use multiple caches if you need them.

> If not, maybe I could hash the attribute name [...]

(Let's hope that heuristics like that are not necessary.)


> [...]
> Incidentally, since you are from Red Hat, I have hit some issues with PCP
> on CentOS 7.2: [...]
>
> What is the most effective thing I could do about these issues - is posting
> about them here useful?

Sure; outright bugs might as well go to bugzilla.redhat.com (or
perhaps bugs.centos.org, though I've never been there, and don't
know the details of that information flow.)


- FChE

<Prev in Thread] Current Thread [Next in Thread>