pcp
[Top] [All Lists]

Re: [pcp] Handling Oracle PMDA Latencies

To: Marko Myllynen <myllynen@xxxxxxxxxx>
Subject: Re: [pcp] Handling Oracle PMDA Latencies
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Wed, 23 Mar 2016 21:06:10 -0400 (EDT)
Cc: pcp developers <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <56F25541.9020602@xxxxxxxxxx>
References: <56F25541.9020602@xxxxxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: I872sW+KtwcDDYGnh359+oTMho4s7w==
Thread-topic: Handling Oracle PMDA Latencies
Heya Marko,

----- Original Message -----
> Hi Nathan,
> 
> I might have a chance to test the Oracle PMDA in the near future, then I
> can probably do some real hands-on experiments in a realistic
> environment but one question I already have is related to latencies: the
> Oracle node is often heavily loaded and sometimes when things don't work
> perfectly load might be extreme, there might be swapping, and so forth.

Great!  Yes, Nandhita and Doug from Intel, myself, and Red Hat performance
team folk have had alot of fun getting pmdaoracle metrics under some quite
full-on benchmark conditions - with good success.

> So requests for Oracle performance stats can take several seconds.

OK - let's observe that happening first, and then analyze any underlying
problem with the domain/PMDA if/when it is actually observed to fail.

> How does the Oracle PMDA cope with this so that PMCD won't kill it?

It's a bit of a myth that pmcd "kills PMDAs".  PMDAs may exit(2) of their own
free will once they realize pmcd is not listening to them anymore (after they
did not respond in a timely fashion), but pmcd does not actively "kill" them.

There are many possible root causes for this domain instability.  We need to
do root cause analysis and understand the issues properly to know how best to
proceed in each case.

It's not helpful to paper over this kind of problem with long timeouts or "use
more threads" or add code that returns PM_ERR_SORRY_I_CANT_HELP_YOU_RIGHT_NOW
for the duration of the problem.  What people need is actual metric values and
especially so at those difficult times.

For example, the Intel folk found a v$filestat query that could block for many
*minutes*, with certain erm extreme database configurations.  This turned out
to be an issue in Oracle itself, and not anything to do with machine load.

As another example, see the pmchart window at time offset 8:45 in this video:
https://www.youtube.com/watch?v=zrAjevr8_Ds

... that CPU utilization view shows a 24-CPU system with 23/24 CPUs spinning
on a VM spinlock, with the kernel unable to allocate memory, and all going on
for multiple minutes.  This is a whole new level of "extreme" but pmcd and all
of the PMDAs there continued to provide timely kernel and application metrics
for the duration of that production system meltdown.  This is not just by good
fortune.

FWIW, Parfait was in use on that system - some of the pmchart plots there are
from Parfait & MMV metrics.  Some of us have the scars from "battle proving"
that code.  ;)

cheers.

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>