pcp
[Top] [All Lists]

Re: [pcp] nvidia/nvml pmda

To: Martins Innus <minnus@xxxxxxxxxxx>
Subject: Re: [pcp] nvidia/nvml pmda
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Thu, 26 Jun 2014 03:36:16 -0400 (EDT)
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <53A995C8.5020904@xxxxxxxxxxx>
References: <53A995C8.5020904@xxxxxxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: Jc6EhR7087fI/5ASr7p7uGeSfuliew==
Thread-topic: nvidia/nvml pmda
Hi Martins,

----- Original Message -----
> Hi,
>      Attached is a nvidia/nvml pmda for general review and suggestions.
> I still need to do some error checking on metrics that may not be
> available on all cards, but we have been using it for quite a while and
> it seems to be working fine.
> 
> I'd appreciate any general feedback.
> 
> I've also included the rpm spec file.  This pmda depends on both the
> nvidia driver being installed as well as the NVIDIA GPU deployment kit
> (which provides NVML).  As these are not available as RPMs, let me know
> if you prefer a different way of generating the spec file from what we
> have done with respect to specifying the dependencies.
> 

Hmmm, this ones a tricky case.  I can see one possible way to proceed (no
doubt, one of many) ... but where would you like this to go?  There are a
few files missing here (PMDA Install/Remove/help files, makefile etc), so
its not clear what you'd like to do yet...

Here's one possible path, which would be one way we could include this
PMDA into the master PCP.  For the dependence on the kinda-oddball shared
library, we could use shared-library function replacement (LD_LIBRARY_PATH
kind of deal) to provide a no-op implementation of the nvidia library (as
there's not a huge number of function calls in the PMDA, this should be a
fairly straightforward undertaking).  This would allow a pcp-pmda-nvidia
package to be produced as part of the regular PCP builds, with a PMDA that
functions in the presence of no nvidia code.  If someone chooses to then
install the nvidia drivers and GPU kit, at PMDA ./Install time they could
then specify the location to the "official" nvidia-supplied shared library
which would then get swapped in to replace the no-op variant.  So instead
of reporting "No values available" for metrics, instances and metrics via
the nvml interfaces could then be reported.

Lemme know if you'd like this to become part of PCP (you might have meant
this just for general review, and not want it included? - thats fine too,
and up to you) - happy to work with you to make the above happen if it'll
suit your needs.  Or something else...?

(the code looks fine BTW - could use the newer pmdaGetOptions interface,
but the general approach is sound AFAICT).

cheers.

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>