On Thu, 2009-02-26 at 15:17 -0600, Martin Hicks wrote:
I've pushed changes to the "master" branch in my git repository on oss.
> This is my first crack at issuing a stable 2.7.9 release.
>
Great!
> I'd still like to merge in the lustre PMDA and the cluster PMDA that
> are still on my "dev" branch.
>
> If I can get a quick review on the two other PMDAs I think I'll be
> ready to merge them into the dev branch and cherry pick them over
> to master.
I've looked over the Lustre PMDA - here's some notes and a patch
to cleanup some, but further work is in order here I think before
merging:
- lustrecomm/help needs some love - no text for any metrics yet?
- theres no indom, so that first help line can go [patched]
- the main() line is a bit dated now - should use __pmSetProgname,
the OS-independent path separator stuff (even though Linux-only,
someone else may copy from here someday), the DSO side of things
is missing a bit of code [patched]
- when installing on my laptop, SIGSEGV received from the PMDA
during "pminfo -v lustrecomm" - doesn't handle no-procfs files?
[patched]
- error handling should use the __pmNotify API for consistent and
timestamped messages [one example patched, more remain]
- last four metrics in metrictab listed as 64 bits, but when we
extract values via file_indexed() we pass in PM_TYPE_32.
- the fetch routine comment about PCP values not matching kernel
values is worrying - more details there? these sometimes need
to be matching. If size changes depending on 32/64 bit kernel
for counter metrics, then the agent needs to deal with that
(e.g. as pmdalinux does)
- "aux" in fetch routine is always 10 - eh?
- fetch_callback routine malloc has no error guard.
- file_indexed malloc has no error guard
- would be better if we had no mallocs on the fetch_callback code
path - see last paragraph of "design notes" below.
- file_indexed SIGSEGV on the strcpy if refresh_file succeeds but
f_s->datap is still NULL (observed on my laptop) [patched]
- filestatetab - only first entry used? (v4?)
- the fetch callback initial check for invalid PMIDs should be
done differently - pass back PM_ERR_PMID after each switch(item)
and in the final else branch at the end of going through the
recognised idp->cluster values.
- when no Lustre is installed, we should get PM_ERR_APPVERSION
instead of zero for the metric values. A few of the values
returned aren't zero, but look like uninitialised data too
(timeout, and the 64 bit ones). Lustre can probably be loaded
(and unloaded) as a kernel module, so this should get fixed.
- General design notes:
- pmns.v4 - should be just one pmns file, and metrics that dont
exist for earlier versions should return PM_ERR_APPVERSION,
once the v4 support is done.
- the type abstraction in file_indexed/single seems like extra
complexity for no reason? esp. since we only ever pass 32 in
here as the type... (and should be doing 64 too I think, but
no more).
- the use of sub-second file_time_offset to guard against fast
repeated lookups (I guess?) adds a fair bit of code ... is
this needed really? The kernel agent doesn't even do this.
It was the underlying cause of the SIGSEGV, ... so complexity
is reducing resilience there. Consider removing it?
- oh, wait - are you doing this because you don't want to read
the files multiple times per fetch? If so, other PMDAs handle
this differently, using an (initial) fetch routine separate to
the fetch_callback - do the read once in the initial fetch,
into a global struct holding correctly typed values, and then
just copy those values into the pmAtomValue in the callback.
That might simplify things alot here. I'd recommend using
the Linux kernel PMDA as a reference here - it uses lots of
procfs files of course and has the read-once-per-fetch trick
(see need_refresh[] in there).
cheers.
--
Nathan
|