pcp
[Top] [All Lists]

Re: Oracle connection debugging (was Re: [pcp] Handling Oracle PMDA Late

To: Marko Myllynen <myllynen@xxxxxxxxxx>
Subject: Re: Oracle connection debugging (was Re: [pcp] Handling Oracle PMDA Latencies)
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Wed, 20 Apr 2016 23:21:37 -0400 (EDT)
Cc: pcp developers <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <57175FC8.2000600@xxxxxxxxxx>
References: <56F25541.9020602@xxxxxxxxxx> <570D1333.2040109@xxxxxxxxxx> <899654573.39808794.1460523158800.JavaMail.zimbra@xxxxxxxxxx> <570F511E.5000605@xxxxxxxxxx> <1512930308.40394593.1460673441009.JavaMail.zimbra@xxxxxxxxxx> <57108708.3080906@xxxxxxxxxx> <571092DF.8050409@xxxxxxxxxx> <57175FC8.2000600@xxxxxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: QeYYWUl4ojbZDz1QqgTQJNBcYsDmHg==
Thread-topic: Oracle connection debugging (was Re: [pcp] Handling Oracle PMDA Latencies)
Hi Marko,

----- Original Message -----
> On 2016-04-15 10:06, Marko Myllynen wrote:
> > On 2016-04-15 09:15, Marko Myllynen wrote:
> > [...]
> > To follow-up our IRC discussion:
> > 
> >> And finally this:
> >>
> >> [Fri Apr 15 09:08:48] pmdaoracle(125624) Error: pmdaFetch: Unavailable
> >> metric PMID 32.12.4[1]
> >> [Fri Apr 15 09:08:48] pmdaoracle(125624) Error: pmdaFetch: Unavailable
> >> metric PMID 32.12.4[3]
> >> [Fri Apr 15 09:08:48] pmdaoracle(125624) Error: pmdaFetch: Unavailable
> >> metric PMID 32.12.4[7]

cluster 12 is v$librarycache - but these messages will be lesser issues, I
think, possibly not related to the fetch timeout.

> >> [Fri Apr 15 09:08:48] pmdaoracle(125624) Error: pmdaFetch: Unavailable
> >> metric PMID 32.0.73[0]
> >> [Fri Apr 15 09:08:48] pmdaoracle(125624) Error: pmdaFetch: Unavailable
> >> metric PMID 32.0.79[0]

And misc. missing v$sysstat metrics - also probably benign at this stage.

> > After "service pmcd restart" (I'm on RHEL 6.8 Beta so HUPing pmcd is not
> > an option) I see some metrics being available, however there's steady
> > flow of the above kind of errors printed in the log (with occasional
> > errors from the line 430 - DBI->connect()).

I was suggesting to try to pinpoint specific problem clusters (each cluster
has unique SQL statements associated, and its likely one is problematic in
your Oracle version/setup - e.g. the Intel folk found v$filestat to have
occasional extreme (multiple minutes long) latencies, depending on various
factors.

To do that, you could use a scripted probe like this... (when this stops,
the last cluster indicated is likely problematic - this assumes theres not
some pathological issue affecting all SQL statements there - historically,
thats less likely to be the case).

pminfo oracle | \
    awk -F. '{ printf "%s.%s\n", $1, $2 }' | \
    sort -u | \
    while read cluster
    do
        echo "Probing $cluster"
        pminfo -v $cluster
        echo "Checked $cluster"
    done

If you find a specific cluster, check its SQL in pmdaoracle.pl and see if
the query is long-running when invoked from sqlplus too.


> FWIW, the test lab I used is now being reinstalled so I can test again
> next week or so if there's any ideas what to test next. But I think I
> should mention that the Oracle instances in the lab are not that big
> (few hundred GBs), outside of these test labs we'd be talking about DB
> instances of several TBs. (It's unclear to me at this point does / how
> much the size of the DB affects to the Oracle PMDA.)

In theory, not at all, and I've seen a number of quite large Oracle installs
where the PMDA functions perfectly.  At those sizes (if using many db files)
you might come across the v$filestat query problem too though (if so, please
let Oracle know - nothing we'll be able to do about that one unfortunately).

cheers.

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>