pcp
[Top] [All Lists]

Re: [pcp] QA Status

To: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Subject: Re: [pcp] QA Status
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Tue, 19 Jul 2016 02:44:00 -0400 (EDT)
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <578D6698.2020606@xxxxxxxxxxxxxxxx>
References: <578D6698.2020606@xxxxxxxxxxxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: zcay23fIejnJZUnLzMaH+pqZVtn61A==
Thread-topic: QA Status
Hi Ken,

----- Original Message -----
> Things are looking better, but not yet back to where they were 6 months ago.
> 
> 1108 is a mystery ... we get 2 primary pmloggers started from pmlogger_check
> (this is not supposed to happen, ever!).  The failure is non-deterministic.
> I've been unable to track it down ... most likely it will be a race
> triggered by some earlier QA test (could be a long time before 1108 I think)
> and no one else notices until 1108 stumbles along.

Only clue I've come across so far is the second logger always seems to be
started 6 minutes after the first.  The search continues though.

> 361 has gone a bit under the radar ... it is not passing _anywhere_ as it is
> not run or skipped (-) in the (new) full report on all but 4 hosts and it
> fails on the 4 hosts on which it is run.  Note %fail is percentage of all
> hosts, not just percentage of hosts on which the test was run, which is why
> %fail for 361 is 11% and not 100%.

Fixed now.

> Apart from that, there are odd failures all over the landscape which make it
> very hard to progress any of this in a dramatic fashion ... if you really
> care about any of the failing tests below, I'd appreciate any assistance you
> could offer to smack 'em into submission.

381 is possibly due to pmlogger being more resilient to pmcd &| pmda restarts
now ... but I'd have expected it to see the same failure signature everywhere?

That 581 failure we've talked about before too I think - seems to be sensitive
to number of open fds in pmcd, and I wonder if this is related to that timeout
change from awhile back where we open multiple connections at once?  I think
the right fix is to expect a range of fds in order 12-20 or so?  (depends on
network config as to max #fds observable, if that theory is correct).

823 I'm certain is also a _notrun candidate - some versions of SASL seem buggy
and a newly created user becomes oddly invisible.  It may be worth collecting
"pmconfig -L sasl_version" from the failing machines and looking for a pattern
that could be squashed by _notrun?  Certainly passes reliably for me on recent
SASL library versions anyway.

cheers.

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>