pcp
[Top] [All Lists]

Re: [pcp] pmcd core dumping

To: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Subject: Re: [pcp] pmcd core dumping
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Tue, 13 Aug 2013 20:01:25 -0400 (EDT)
Cc: PCP Mailing List <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <520AC01F.7080201@xxxxxxxxxxxxxxxx>
References: <520AC01F.7080201@xxxxxxxxxxxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: smwq4hG/sm72yDJBzGPMbF3rdbWxOw==
Thread-topic: pmcd core dumping

----- Original Message -----
> Here it is.
> 
> 1. reproducible for me in qa/323 but only if some other tests are run first,
> e.g. a full run and I've made it happen again with with check 101-200 323

Good stuff.

> 2. traceback
> 
> Procedure call traceback ...
>   0x7fe42c8924a0 [/lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7fe42c8924a0]]
>   0x7fe42ce6a904 [/usr/lib/libpcp.so.3(__pmSockAddrGetFamily+0x4)
>   [0x7fe42ce6a904]]
>   0x7fe42ce55dae [/usr/lib/libpcp.so.3(__pmSockAddrIsLoopBack+0x1e)
>   [0x7fe42ce55dae]]
>   0x7fe42ce6287f [/usr/lib/libpcp.so.3(+0x4387f) [0x7fe42ce6287f]]
>   0x7fe42ce644f8 [/usr/lib/libpcp.so.3(__pmAccAddClient+0x48)
>   [0x7fe42ce644f8]]
>   0x7fe42d2b6104 [/usr/lib/pcp/bin/pmcd(ParseRestartAgents+0x884)
>   [0x7fe42d2b6104]]
>   0x7fe42d2b0494 [/usr/lib/pcp/bin/pmcd(SignalRestart+0x74) [0x7fe42d2b0494]]
>   0x7fe42d2b116f [/usr/lib/pcp/bin/pmcd(+0x716f) [0x7fe42d2b116f]]
>   0x7fe42d2af9bb [/usr/lib/pcp/bin/pmcd(main+0x5bb) [0x7fe42d2af9bb]]
>   0x7fe42c87d76d [/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)
>   [0x7fe42c87d76d]]
>   0x7fe42d2afb99 [/usr/lib/pcp/bin/pmcd(+0x5b99) [0x7fe42d2afb99]]
> 
> Dumping to core ...
> 

So ... we're in the delayed responding-to-SIGHUP code in the main loop, we've
gone through all of the config re-parsing, agent terminating, restarting, and
general fiddling about, and have arrived at re-evaluating client access.  We
do the __pmAccAddClient call on every(*) client in the client array and we're
peeking inside the sockaddr pointer (recently became a pointer to an aux data
structure - IIRC it used to be inline in the client structure (iow no sigsegv
used to be likely, even from a dodgey array entry access).  If we continue to
follow the call path it would appear that __pmSockAddrIsLoopBack and its call
to __pmSockAddrGetFamily are the very first paths where we attempt to crack
open the now-not-inline sock-addr-pointer, and a fine explosion ensues.

(*) Every client?  That doesn't look right - does the patch below help at all?
Theory: some client has been disconnected, at some point earlier, and its slot
in the client array has not yet been reclaimed.  As a result, its addr pointer
is still null, and the CheckClientAccess call will fall over such an entry.
[ There are several opportunities for race conditions here, depending on order
of client disconnection, timing of arrival of sighup, etc... seems promising ]

diff --git a/src/pmcd/src/config.c b/src/pmcd/src/config.c
index c666889..472db7f 100644
--- a/src/pmcd/src/config.c
+++ b/src/pmcd/src/config.c
@@ -2511,6 +2511,8 @@ ParseRestartAgents(char *fileName)
     for (i = 0; i < nClients; i++) {
        ClientInfo      *cp = &client[i];
 
+       if (cp->status.connected == 0)
+           continue;
        if ((sts = CheckClientAccess(cp)) >= 0)
            sts = CheckAccountAccess(cp);
        if (sts < 0) {


cheers.

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>