pcp
[Top] [All Lists]

Re: Oracle connection debugging (was Re: [pcp] Handling Oracle PMDA Late

To: Nathan Scott <nathans@xxxxxxxxxx>
Subject: Re: Oracle connection debugging (was Re: [pcp] Handling Oracle PMDA Latencies)
From: Marko Myllynen <myllynen@xxxxxxxxxx>
Date: Fri, 20 May 2016 12:56:08 +0300
Cc: pcp developers <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <626822210.48972762.1463726815586.JavaMail.zimbra@xxxxxxxxxx>
Organization: Red Hat
References: <56F25541.9020602@xxxxxxxxxx> <57108708.3080906@xxxxxxxxxx> <571092DF.8050409@xxxxxxxxxx> <57175FC8.2000600@xxxxxxxxxx> <1558022602.42320984.1461208897951.JavaMail.zimbra@xxxxxxxxxx> <57395F04.2090909@xxxxxxxxxx> <1695396289.47966126.1463381940778.JavaMail.zimbra@xxxxxxxxxx> <573D897A.5070804@xxxxxxxxxx> <626822210.48972762.1463726815586.JavaMail.zimbra@xxxxxxxxxx>
Reply-to: Marko Myllynen <myllynen@xxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.8.0
Hi,

On 2016-05-20 09:46, Nathan Scott wrote:
> ----- Original Message -----
>> [...]
>>> I wonder if the best we can do here is something like:
>>> - disable these two clusters by default
>>> - add oracle.control metrics for each
>>> - add pmstore support to allow people to opt-in to these clusters.
>>
>> But if opting in for these means that the timeout is hit pretty much
>> guaranteed, not sure what's the point then?
> 
> The point was to give you a working agent (with all the other metrics).
> 
> The agent is working fine for "everyone" else ... (although thats a small
> set at this stage, I suspect).

I think this is one of the larger setups tested so far. I mentioned that
e.g. for oracle.object_cache the select query returned over 220k rows
here, how many rows it returns on your test setup?

> The not-yet-understood root cause of this particular problematic platform
> or Oracle version combination are the reason we're contemplating these
> (quite horrible) workarounds.
> 
> For my system (Fedora 23, Oracle 12.1), the Red Hat perf folks system and
> the Intel folks who have been hacking on this PMDA too - we haven't seen
> these problems you are seeing, so we'd just continue on with everything
> enabled.  But, it would give you (& anyone else who hits this) with a way
> to get up and running with all the other Oracle metrics even with this
> problematic Oracle version (or platform, or host setup, or whatever it is).

Might as well be the DB layout, size or load as well. In general the
setup in whole here is very much focused on performance, it processes
hundreds of millions of events per hour (where an event is something
more complex than an individual counter) and database performance is not
the main worry.

> For your system, yep, its guaranteed.  So unless we get to the bottom of
> it, and fix the real cause, we'll need a workaround - hence, the earlier
> suggestions.
> 
>>> Its not ideal but I don't think there's much else we're going to be able to
>>> do to improve things on our end of the connection, and this would stabilize
>>> things for you at least.  Thoughts?
>>
>> I checked with some local DB folks - they haven't used the object_cache
>> metrics anywhere so for them it's nice-to-have category. But the file
>> metrics are important.
>>
>> The above timings are with almost completely unloaded DB instance so not
>> sure how they would look like under extreme load, I wouldn't be
>> surprised if they'd be higher then. But that'd be the time when the
>> metrics are needed the most to see what was going on.
>>
>> So we're back to the initial question of the thread, can we for example
>> adjust the 5 second timer for the Oracle PMDA to be more forgiving or
>> come up with some other approach here? It seems that we can't affect how
>> much it takes for Oracle to respond and on the PMDA side the actual
>> select query seems to be as efficient as it can be.
> 
> Adjusting the timeout isn't great - that introduces other, nasty problems.
> What we'd need is a background thread that fetches these metrics on a timer
> and serve up cached values (but, that's also quite a horrible solution).
> 
> I'd really prefer to understand what it is about your system/setup that has
> this pathologically slow query behaviour & fix that instead of doing any of
> these workarounds TBH.  Could you try different hosts, operating systems &|
> Oracle versions?  (so we can try to isolate which might be causing it).

Unfortunately there are no chances for that, this is part of a real
enterprise setup where there are dozens of applications on several VMs
using the database and setting everything up (even with most steps
completely automated) requires dedicated personnel and preparations (and
reserving the needed hardware which for this kind of external
development work is a bit unlikely to go through).

How about if we try the opposite, start making your test system where
things seem to behave better to make it more like the setup here - for
example create large enough database so that oracle.object_cache results
are comparable and see whether it causes any issues then?

Thanks,

-- 
Marko Myllynen

<Prev in Thread] Current Thread [Next in Thread>