pcp
[Top] [All Lists]

Re: =?utf-8?q?PMDAs_for_lm=5Fsensors=2C_HDD_SMART_monitoring?= =?utf-8?b

To: "David O'Shea" <dcoshea@xxxxxxxxx>
Subject: Re: =?utf-8?q?PMDAs_for_lm=5Fsensors=2C_HDD_SMART_monitoring?= =?utf-8?b?4oCP?=
From: fche@xxxxxxxxxx (Frank Ch. Eigler)
Date: Tue, 29 Dec 2015 10:50:09 -0500
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <CAN0DjM1GGZJ2MOdDohbaf7WZ25j3g_7CxzfWxVvKH=a2pKcLAw@xxxxxxxxxxxxxx> (David O'Shea's message of "Tue, 29 Dec 2015 11:30:48 +1030")
References: <CAN0DjM1GGZJ2MOdDohbaf7WZ25j3g_7CxzfWxVvKH=a2pKcLAw@xxxxxxxxxxxxxx>
User-agent: Gnus/5.1008 (Gnus v5.10.8) Emacs/21.4 (gnu/linux)
Hi, David -

> I'm new to PCP

Welcome.  You're asking all the right questions.


> I assume the existing lmsensors PMDA must be [... old].  I'm
> thinking about doing a re-write in Python that dynamically creates
> the metrics based on the sensors that are available.ï Perhaps if it
> uses libsensors it will work with various versions of the kernel.ï
> Any comments on this?

That would make a lot of sense. 


> As for HDD SMART, I managed to get a Python PMDA working which can collect a
> few metrics, but I have a lot of questions and comments (but I'll save some 
> for
> later):
>
> - When I use dbpmda's timer, it takes 500 milliseconds for a response to be
> returned, is that too long?

Not too long individually, but longer than PCP clients like to wait.
We may need to use a background-thread kind of processing where
smartctl latency does not need to be paid by the clients.


> - http://www.pcp.io/books/PCP_PG/html/id5190481.html (pcp-programmers-guide
> Section 2.3.4.1 "Instance Identification") says "It is preferable, although 
> not
> mandatory, for the association between and external instance name (string) and
> internal instance identifier (numeric) to be persistent."ï Does this mean
> persistent while the PMDA is running or persistent across restarts of the PMDA
> or the machine it is running on?ï 

Yeah, the documentation should be more clear in its terminology.  We
have not been clear as to what sort of persistence a client is
entitled to assume.  (Thus e.g. see SGI PCP PR 1131.)  At the minimum,
of course, we need persistence during a single connection.  The common
level of effort seems to be persistent across restarts of the PMDA on
the same system/uptime.

> If it means persistent across restarts, does pmdaCache help with
> that?

Yes, that's what it's for, but even that cannot provide indefinite
persistence, as the cache is a cache, and may be flushed.


> - I'm most interested in SMART attributes, e.g. reallocated sectors,
> temperatures.ï A drive might have around 20 of these [...]
> Is there such a thing as exposing too many metrics?

Not at the hundreds range - go for it.  If you're talking millions yes.
It will take some thought to come up with a future-proof nomenclature
though:

> - A further complication is that if I have two drives with different
> models, they might not have the same attributes.ï [...]

Exactly.

> I assume I should have a configuration file for creating metrics
> from attributes so users can choose to map them both to
> "Unknown_Attribute_16" or perhaps have model-specific attributes
> "Unknown_Attribute_16_WD..." and "Unknown_Attribute_16_HGST...".ï
> Does this sound reasonable?

IMHO we should do whatever we can to avoid having to have a
configuration file, and instead have the pmda do a Sensible Thing
automatically if at all possible.  In this case, for example we could
have

   smartd["device"].attribute.number_1{,.max,.threshold,.etc.?}
   ...
   smartd["device"].attribute.number_255
   smartd["device"].health

for low-level portable access, and

   smartd["device"].attribute.seek_error_rate

for general ones, and per-device specialized ones

   smartd["device"].attribute.wd_power_off_retract_count


Looking closer at how smartctl does it, they reference a
centrally-distributed header file to compute the equivalent of the
latter.  See [man update-smart-drivedb] and
/usr/share/smartmontools/drivedb.h - the new pmda could use that same
header file.  (If the pmda were written in C, the header could be
compiled-in; if it were python it could parse it.)  So maybe a
configuration file is not that bad - especially if we can offload it
to another package instead of to a pcp sysadmin.


> - In pmdasimple.python, simple_fetch_times_callback() for example includes 
> this
> code:
> ïïïïïïïïïïï return [valuep.contents.value, 1]
> ïïïïïïï return [c_api.PM_ERR_PMID, 0]
>
> ï I assume the second element in the array - 0 or 1 in these examples -
> corresponds to [PMDA_FETCH_*] definitions from pmda.h?
> [...]
> ï If so, it'd be nice if pmda.py defined those constants itself (or possibly
> they could be extracted using something like SWIG but I have never tried using
> that myself), as I struggled to work this out.

Yeah - they're already in at least one dictionary in the
src/python/pmda.c binding; we're just not using it.


> - It would be nice if there was a sequence diagram (generated using e.g. 
> http:/
> /www.mcternan.me.uk/mscgen/ ) showing how PDUs being sent to the PMDA get
> translated into various calls, and what order they are in.ï I think I know how
> this works but I'm not totally sure yet!

FWIW, I've used systemtap in the past to trace dynamic call graphs
related to pmda/pdu processing.


- FChE

<Prev in Thread] Current Thread [Next in Thread>