pcp
[Top] [All Lists]

Re: PMDAs for lm_sensors, HDD SMART monitoringâ

To: "Frank Ch. Eigler" <fche@xxxxxxxxxx>
Subject: Re: PMDAs for lm_sensors, HDD SMART monitoringâ
From: "David O'Shea" <dcoshea@xxxxxxxxx>
Date: Sun, 3 Jan 2016 20:45:57 +1030
Cc: pcp developers <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=STJy81p6ny1SiCZZYbZchSCcoy7UfdrJqRL2JYKMn6s=; b=nucxsRXaNjbBUG9SXB/sqK8ScnJlfAwJMzmb/jePoWbiHlgoTCXbWsKaWoM112Eeiw /BrssJCGExgaNtCwy/1o8tWiaAund58funoNSehKV4AO2jYairZ/IbujSHichhD6VPUw WVrjNKSeE6CHyoGvSeXE02LmScvYq4lyuM3sD6SFbIiawuxhVraWTXS1bKB/bZ0z3KJ4 A2fuqMpt5xNIFvQaSl3rdPWEuoplTvIkICZMu/oVzpibaa3tcTf5TqYhi/0nNwMAEojj XMR7IariHjhMsUnMf+T2lOiAW+ewLDUCMKiWrMtv4Q+7OtgjGrRG07qq0/N00toz/FdW FPOA==
In-reply-to: <20160102153416.GC13026@xxxxxxxxxx>
References: <CAN0DjM1GGZJ2MOdDohbaf7WZ25j3g_7CxzfWxVvKH=a2pKcLAw@xxxxxxxxxxxxxx> <y0md1tpqmoe.fsf@xxxxxxxx> <CAN0DjM0VEmykF15F=ZfmRGjpog0knCyvnx_YrmuAik979huW5w@xxxxxxxxxxxxxx> <20160102153416.GC13026@xxxxxxxxxx>
Hi Frank,

On Sun, Jan 3, 2016 at 2:04 AM, Frank Ch. Eigler <fche@xxxxxxxxxx> wrote:
Hi, David -


> [...]Â If I had a thread which fetches the values based on a timer,
> which is configurable but has a default (say 5 minutes), and then
> stores these in memory, and then requests from PMCD return the
> in-memory values, would that be appropriate/consistent with other
> PMDAs?

That would be a reasonable approach. One thing to watch out for
though is the load imposed on the system by this machinery, even if no
client is actually collecting that data, and/or if no underlying
conditions are changing. It's worth measuring impact to see if it
could possibly be a problem.

Will do.
Â
For comparison, the pmdarpm collector thread also runs in the
background, and is triggered by *notify events to rescan a changed rpm
database - it doesn't do it by timer. OTOH it does that whether or
not a client is currently interested in the rpm metrics.

By "*notify events" you mean e.g. inotify, rather than any kind of PCP thing, right? Unfortunately I can't find any evidence of a scheme for finding out about changes in SMART attributes via notifications from the disk itself.

Perhaps I could use some scheme like this to find out when disks are added to/removed from the system in order to avoid running 'smartctl --scan-open' all the time to detect that, but then I was hoping for this PMDA to be usable on Windows too.
Â
For another comparison, pmdapapi starts collecting perfctr data only
after/while clients fetch those specific metrics. It continues
collecting those as long as a clients keep polling (with a timeout),
then releases the perfctr resources back to the kernel.

If I was to use such a scheme for SMART, I assume I'd still need to read attributes once at startup, and also whenever a new disk appears, since I need to be able to correctly report all the available metrics? That sounds a bit more complex.

Also, wouldn't it be silly to install a PMDA you're never going to retrieve metrics from anyway?

> [...]Â So, just to be clear, given that for example
> Reallocated_Sector_Ct is attribute #5, then for a given drive, both
> .reallocated_sector_ct and .number_5 would give the same value,
> i.e. there would be different ways (aliases) to get to the same
> actual metric (although from PCP's point of view they would be
> different metrics)?

Yup, that would be fine. For configuration convenience, it may be
helpful to separate the low level numbered attributes from the aliases
by PMNS nesting, so that a generic pmlogger configuration can choose
one set or the other (so as to reduce stored data duplication).

Something like:

smart.attr.by_number.99
smart.attr.by_name.reallocated_sector_ct

(ignoring how I'll deal with vendor-specific attributes for now)?


> [...]Â I think that so long as I can determine the available metrics
> at runtime (which sounds like it is possible, but I haven't tried
> it), I don't need to parse the drivedb.h, I can just parse the
> output of 'smartctl' to work out those mappings.

Yeah, if you're planning to do it by running smartctl and scraping
its output, sure.

Yeah, I figured the other option is to use libatasmart, but that would make it harder to port to Windows, whereas I'm using pySMART to scrape the 'smartctl' output and it already claims to support Windows to some extent.
Â
> My problem is giving them unique item numbers, which I don't think
> drivedb.h will help with.

(Well, drivedb.h could give you a unique ordinal number for the
attribute name string. 'course drivedb.h itself may change over
time!)

Oh I see, you're suggesting that if I see attribute 009 "Power_On_Minutes", I could look in this file and work out that these are all the unique names for attribute 009 in the order they appear in the file:

(1) Power_On_Hours_and_Msec
(2) Power_On_Hours
(3) Proprietary_9
(4) Power_On_Seconds
(5) Power_On_Minutes
(6) Power_On_Half_Minutes

(I hope that is the worst example :) ) so I can use ordinal 5 to make a unique number for the metric? This sounds good.

A comment in the file says:

Â* The table will be searched from the start to end or until the first match,
Â* so the order in the table is important for distinct entries that could match
Â* the same drive.

so I guess we can't hope that new entries are always added at the end of the file, but hopefully in general the opportunities for ambiguity - where the order really matters - are only between drives from the same manufacturer, so in general this scheme would generate stable metric numbers.

I note that I just found that the 'smartctl --presets=showall' command gives a dump of this information which is a lot easier to parse, and which is based on not just /usr/share/smartmontools/drivedb.h but also /etc/smartmontools/smart_drivedb.h (or some alternative file(s) as specified on the command line).

I take it that this solution, whilst not guaranteeing that metric numbers will be stable forever, is probably good enough?

> [...]
> Incidentally, since you are from Red Hat, I have hit some issues with PCP
> on CentOS 7.2: [...]
>
> What is the most effective thing I could do about these issues - is posting
> about them here useful?

Sure; outright bugs might as well go to bugzilla.redhat.com (or
perhaps bugs.centos.org, though I've never been there, and don't
know the details of that information flow.)

Thanks! I gather that sometimes people aren't happy about bugs filed against RHEL that have only been found in CentOS, but I'll give it a try.

Regards,
David
<Prev in Thread] Current Thread [Next in Thread>