pcp
[Top] [All Lists]

Re: [pcp] nvidia/nvml pmda

To: Nathan Scott <nathans@xxxxxxxxxx>
Subject: Re: [pcp] nvidia/nvml pmda
From: Martins Innus <minnus@xxxxxxxxxxx>
Date: Tue, 01 Jul 2014 13:38:22 -0400
Cc: PCP <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <1626126016.1041444.1404199831629.JavaMail.zimbra@xxxxxxxxxx>
References: <53A995C8.5020904@xxxxxxxxxxx> <1015758147.33977940.1403768176429.JavaMail.zimbra@xxxxxxxxxx> <1626126016.1041444.1404199831629.JavaMail.zimbra@xxxxxxxxxx>
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
Nathan,

On 7/1/14, 3:30 AM, Nathan Scott wrote:
I added the error checking, long option support, some QA testing, and
the dso independence discussed earlier.  Oh - also work a closer look,
I added a fetch PDU handler to accompany your existing fetch callback.
This means when we get a fetch request, we only refresh the values at
the start (once), not once for each metric/instance pair.  This is a
small efficiency improvement with how we interact with the cards, the
functionality is exactly the same as before.
Wow!  looks great.

Not essential, but it'd also be nice to have a man page as well, if you
wouldn't mind writing one?  I didn't tackle that side of things, though
I did update the help text a bit based on my readings of the API docs
(please review?  that, and all the code changes :) - heh, thanks!).
OK, attached.

I've tested this only on "faked-out" hardware - see qa/744 - and not
using the real nvidia libraries (I have neither).  Any chance you could
hack on those aspects before the next release Martins?  (after the IB
updates you mentioned privately would be perfect, if you can).  We aim
for a point release in the middle of each month, so there's a couple of
weeks until then.  Especially the "using real hardware" side of testing
would be really good, since I can't do that here.  :)
OK, included are new files that have changes to make this work on real hardware. A typo fix in the dlopen and then a change in the error handling. I'm not entirely happy with my error changes, but I think its in a direction to get it working. Cards may not neccesarrily support all the possible metrics so the pmda should fail to return any metrics. For instance we have older Tesla GPU cards that don't return anything for the temp and fan metrics, but everything else is ok.

The other aspect I could look at is the spec file - you sent through an
initial version, which would provide a separate pcp-pmda-nvidia package.
However with the way the code is now, we could include this PMDA in the
main pcp package if you'd prefer that?  (I would, it makes life easier
for users - this can be done since the PMDA now functions gracefully
with and without the nvidia library, even if the library appears while
its already running, it should handle that nicely too.
Yeah, I think including it in the main package would be preferable. I just sent along our existing spec file for completeness.

The only change we would need would be a one-line spec update for your
existing installations, to ensure existing pcp-pmda-nvidia RPMs you have
are replaced by pcp-3.9.7 or later (I'll go and hack on that when I hear
back re your packaging preference).

Sounds Good, Thanks!

Martins

Attachment: localnvml.c
Description: Text document

Attachment: nvidia.c
Description: Text document

Attachment: pmdanvidia.1
Description: Text document

<Prev in Thread] Current Thread [Next in Thread>