pcp
[Top] [All Lists]

Fwd: PMDAs for lm_sensors, HDD SMART monitoringâ

To: pcp@xxxxxxxxxxx
Subject: Fwd: PMDAs for lm_sensors, HDD SMART monitoringâ
From: "David O'Shea" <dcoshea@xxxxxxxxx>
Date: Sat, 2 Jan 2016 19:46:14 +1030
Delivered-to: pcp@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Mm0hIysbun37doGOr2jf7OifQQXXOFBVs9u+1Hp2bHs=; b=KkkMq5e17fAYbF6IlasMLsdD2QdtXf9APOu7JAwq7W5hGJ07c9U/qn6IWNFEtiy8UD iTsxmVNphBYV15GgQ1SUSJyX7Yz2E5vmlDI6nfDRZMA/owS5F1FqGvxjV2/ofyCsQPSY kWeWmy10JWUKS9qGugWSD5SZfoBHDG94YDDZ9vUTKpHTSrzqYMZ60u77wDNV/QUDkMn3 30kvdY1YXGfhhnYOHvQ5uolHiIQS0xtBXQiZswdiGOYvk07JmYu/DffHJc5VAfKlioCY zWyMtsrfamJaOzgAEuTEHewUKpCnH3CobxXPgdaVag4SRngeDirdQVzzcsglfLbj35Rz CfcA==
In-reply-to: <CAN0DjM0VEmykF15F=ZfmRGjpog0knCyvnx_YrmuAik979huW5w@xxxxxxxxxxxxxx>
References: <CAN0DjM1GGZJ2MOdDohbaf7WZ25j3g_7CxzfWxVvKH=a2pKcLAw@xxxxxxxxxxxxxx> <y0md1tpqmoe.fsf@xxxxxxxx> <CAN0DjM0VEmykF15F=ZfmRGjpog0knCyvnx_YrmuAik979huW5w@xxxxxxxxxxxxxx>
Oops, I failed to send this to the list:

---------- Forwarded message ----------
From: David O'Shea <dcoshea@xxxxxxxxx>
Date: Wed, Dec 30, 2015 at 12:08 PM
Subject: Re: PMDAs for lm_sensors, HDD SMART monitoringâ
To: "Frank Ch. Eigler" <fche@xxxxxxxxxx>


Hi Frank,

Thanks for your mail, please see below:

On Wed, Dec 30, 2015 at 2:20 AM, Frank Ch. Eigler <fche@xxxxxxxxxx> wrote:

> As for HDD SMART, I managed to get a Python PMDA working which can collect a
> few metrics, but I have a lot of questions and comments (but I'll save some for
> later):
>
> - When I use dbpmda's timer, it takes 500 milliseconds for a response to be
> returned, is that too long?

Not too long individually, but longer than PCP clients like to wait.
We may need to use a background-thread kind of processing where
smartctl latency does not need to be paid by the clients.

Okay, thanks, I figured that would probably be the case, so I already thought about this a bit. I noticed that every time I run 'smartctl' it does seem to access the drive, so I already figured I probably don't want to run it once a second if pmchart happens to be on its default settings and is pointed at these attributes. If I had a thread which fetches the values based on a timer, which is configurable but has a default (say 5 minutes), and then stores these in memory, and then requests from PMCD return the in-memory values, would that be appropriate/consistent with other PMDAs?

> - http://www.pcp.io/books/PCP_PG/html/id5190481.html (pcp-programmers-guide
> Section 2.3.4.1 "Instance Identification") says "It is preferable, although not
> mandatory, for the association between and external instance name (string) and
> internal instance identifier (numeric) to be persistent."Â Does this mean
> persistent while the PMDA is running or persistent across restarts of the PMDA
> or the machine it is running on?Â

Yeah, the documentation should be more clear in its terminology. We
have not been clear as to what sort of persistence a client is
entitled to assume. (Thus e.g. see SGI PCP PR 1131.) At the minimum,
of course, we need persistence during a single connection. The common
level of effort seems to be persistent across restarts of the PMDA on
the same system/uptime.

So not necessarily persistent across system restarts?
Â
> If it means persistent across restarts, does pmdaCache help with
> that?

Yes, that's what it's for, but even that cannot provide indefinite
persistence, as the cache is a cache, and may be flushed.

Thanks, now I found that there is a pmdaCache man page which points to where the cache is stored.

> I assume I should have a configuration file for creating metrics
> from attributes so users can choose to map them both to
> "Unknown_Attribute_16" or perhaps have model-specific attributes
> "Unknown_Attribute_16_WD..." and "Unknown_Attribute_16_HGST...".Â
> Does this sound reasonable?

IMHO we should do whatever we can to avoid having to have a
configuration file, and instead have the pmda do a Sensible Thing
automatically if at all possible. In this case, for example we could
have

 Âsmartd["device"].attribute.number_1{,.max,.threshold,.etc.?}
 Â...
 Âsmartd["device"].attribute.number_255
 Âsmartd["device"].health

for low-level portable access, and

 Âsmartd["device"].attribute.seek_error_rate

for general ones, and per-device specialized ones

 Âsmartd["device"].attribute.wd_power_off_retract_count

So, just to be clear, given that for example Reallocated_Sector_Ct is attribute #5, then for a given drive, both .reallocated_sector_ct and .number_5 would give the same value, i.e. there would be different ways (aliases) to get to the same actual metric (although from PCP's point of view they would be different metrics)?
Â
Looking closer at how smartctl does it, they reference a
centrally-distributed header file to compute the equivalent of the
latter. See [man update-smart-drivedb] and
/usr/share/smartmontools/drivedb.h - the new pmda could use that same
header file. (If the pmda were written in C, the header could be
compiled-in; if it were python it could parse it.)Â So maybe a
configuration file is not that bad - especially if we can offload it
to another package instead of to a pcp sysadmin.

I think that so long as I can determine the available metrics at runtime (which sounds like it is possible, but I haven't tried it), I don't need to parse the drivedb.h, I can just parse the output of 'smartctl' to work out those mappings. My problem is giving them unique item numbers, which I don't think drivedb.h will help with.

I don't suppose there's another "cache" to help with this, is there? If not, maybe I could hash the attribute name to try to come up with an item number that is less likely to change as disks are added or removed. I assume it's not critical for the item numbers to be persistent forever given that I note pmchart at least saves the metric names rather than the numbers.
Â
> Â I assume the second element in the array - 0 or 1 in these examples -
> corresponds to [PMDA_FETCH_*] definitions from pmda.h?
> [...]
> Â If so, it'd be nice if pmda.py defined those constants itself (or possibly
> they could be extracted using something like SWIG but I have never tried using
> that myself), as I struggled to work this out.

Yeah - they're already in at least one dictionary in the
src/python/pmda.c binding; we're just not using it.

Thanks, I'll see if I can find them.
Â
> - It would be nice if there was a sequence diagram (generated using e.g. http:/
> /www.mcternan.me.uk/mscgen/ ) showing how PDUs being sent to the PMDA get
> translated into various calls, and what order they are in. I think I know how
> this works but I'm not totally sure yet!

FWIW, I've used systemtap in the past to trace dynamic call graphs
related to pmda/pdu processing.

Thanks, that sounds good, I'll put learning how to do this on my to do list :)


Incidentally, since you are from Red Hat, I have hit some issues with PCP on CentOS 7.2:

- SELinux problems with the nVidia PMDA: I sent a few emails to the CentOS list, no response so far: https://lists.centos.org/pipermail/centos/2015-December/156952.html
- If I recall correctly, I found a package dependency missing with the SNMP PMDA (I think it was lacking perl(Net::SNMP)).

What is the most effective thing I could do about these issues - is posting about them here useful?

Thanks in advance,
David

<Prev in Thread] Current Thread [Next in Thread>