netdev
[Top] [All Lists]

Re: select says I can read, but recvfrom hangs

To: Andi Kleen <ak@xxxxxx>
Subject: Re: select says I can read, but recvfrom hangs
From: "D. Hugh Redelmeier" <hugh@xxxxxxxxxx>
Date: Fri, 22 Jun 2001 22:16:36 -0400 (EDT)
Cc: <netdev@xxxxxxxxxxx>, Marco Berizzi <pupilla@xxxxxxxxxxx>
In-reply-to: <Pine.LNX.4.33.0106221109310.8383-100000@redshift.mimosa.com>
Reply-to: <hugh@xxxxxxxxxx>
Sender: owner-netdev@xxxxxxxxxxx
| | From: Andi Kleen <ak@xxxxxx>
|
| | Is the affected machine a SMP box?

Marco says:  No, only one processor. P100 with F0 0F bug.

Marco also answered another question that I asked:

| From: Marco Berizzi <pupilla@xxxxxxxxxxx>
| > I have a question: does the hang occur every time you go through the
| > procedure you describe (I infer that from what you've said), or just
| > too often?
| >
|
| every time I go through the procedure


I also think it might be helpful if I showed some of the logging done
by Pluto.  This was captured by Marco (thanks!):

Remember that this machine (darkstar) is talking to a "Road Warrior",
a Windows box running PGPnet IPSEC.

The windows machine had IP address 151.20.110.219 and had negotiated a
tunnel.  It was then disconnected and reconnected with IP address
151.25.5.30.

Jun 20 17:21:43 darkstar Pluto[121]: | *received 196 bytes from 151.25.5.30:500 
on eth0
Jun 20 17:21:43 darkstar Pluto[121]: | state object #3 found, in STATE_MAIN_R1
Jun 20 17:21:43 darkstar Pluto[121]: | sending:
        message from and response to 151.25.5.30:500.
        Included for context (timestamp, in particular)

Jun 20 17:21:44 darkstar Pluto[121]: | *received 1732 bytes from 
151.25.5.30:500 on eth0
Jun 20 17:21:44 darkstar Pluto[121]: | state object #3 found, in STATE_MAIN_R2
Jun 20 17:21:45 darkstar Pluto[121]: "Berizzi" #3: deleting connection 
"Berizzi" instance with peer 151.20.110.219
Jun 20 17:21:45 darkstar Pluto[121]: "Berizzi" #2: deleting state 
(STATE_QUICK_R2)
Jun 20 17:21:45 darkstar Pluto[121]: | ***emit ISAKMP Delete Payload:
Jun 20 17:21:45 darkstar Pluto[121]: | sending:
        This is not FreeS/WAN code.  It must be part of the x.509 patch
        sends about 68 bytes, probably to 151.20.110.219:500

Jun 20 17:21:45 darkstar Pluto[121]: | ***emit ISAKMP Delete Payload:
Jun 20 17:21:45 darkstar Pluto[121]: | sending:
        This is not FreeS/WAN code.  It must be part of the x.509 patch
        sends about 68 bytes, probably to 151.20.110.219:500

Jun 20 17:21:46 darkstar Pluto[121]: | sending:
        sends about 1652 bytes to 151.25.5.30:500

Jun 20 17:21:46 darkstar Pluto[121]: ERROR: "Berizzi" #3: sendto() on eth0 to 
151.25.5.30:500 failed in STF_REPLY. Errno 113: No route to host

        I have no idea why a sendto to 151.25.5.30:500 would fail.  It
        is more likely that the sendtos to 151.20.110.219:500 (the
        former address) would fail.  The error report is just based on
        what Pluto tried to do: it tried to sendto 151.25.5.30:500,
        and got an EHOSTUNREACH.  Perhaps that error return was caused
        by an earlier sendto.

        There is no attempt to read MSG_ERRQUE for a write failure.
        Perhaps there should be.  But I think that the next time
        through the master select call, if there is a message on the
        MSG_ERRQUEUE, a recvfrom will be triggered, get an error
        return, and a recvmsg(,, MSG_ERRQUEUE) will then read
        the error report.

Jun 20 17:21:46 darkstar Pluto[121]: | *received 1732 bytes from 
151.25.5.30:500 on eth0
Jun 20 17:21:46 darkstar Pluto[121]: | state object #3 found, in STATE_MAIN_R3
Jun 20 17:21:46 darkstar Pluto[121]: "Berizzi" #3: retransmitting in response 
to duplicate packet; already STATE_MAIN_R3

        Interestingly, the next read actually got a retransmitted
        packet, not an error indication.

        It isn't clear why the other side retransmitted.  It is
        *possible* that we were hung in a read.  But the  timestamps
        don't indicate an excessive delay.

        Could it be that the ERROR quoted above supressed the message
        actually being sent?  Makes some sense.

Jun 20 17:21:46 darkstar Pluto[121]: | sending:
        sends about 1652 bytes to 151.25.5.30:500

        This time we don't get an error, even though the destination
        is identical?!?!

Jun 20 17:21:49 darkstar Pluto[121]: | *received 52 bytes from 151.25.5.30:500 
on eth0
Jun 20 17:21:49 darkstar Pluto[121]: | state object #4 found, in STATE_QUICK_R1

        Pluto receives the last message of an IKE Quick Mode exchange.
        This would be the end of negotiation with the Road Warrior
        (mobile IPSEC host, which in this case is running PGPnet IPSEC
        software, not ours).

        Pluto then proceeds to install a new tunnel.

Jun 20 17:21:50 darkstar Pluto[121]: "Berizzi" #4: STATE_QUICK_R2: IPsec SA 
established
Jun 20 17:21:50 darkstar Pluto[121]: | next event EVENT_SHUNT_SCAN in 47 seconds

        Pluto now goes into select, waiting for next event.
        select tells it that a message is ready.
        It hangs in the recvfrom.

Jun 20 17:24:42 darkstar Pluto[121]: ERROR: recvfrom() on eth0 failed in 
comm_handle (Pluto cannot decode source sockaddr in rejection: unknown source). 
Errno 4: Interrupted system call

        Marco does a SIGHUP to Pluto.  Pluto catches the signal, sets
        a variable, and returns.  Pluto is stupid enough (thankfully)
        to treat the EINTR as any other read error: it goes on to look
        at the MSG_ERRQUEUE.

        Notice the almost 3 minute delay between this message and the
        preceding one.  Fascinating, since there was supposed to be
        a timer event sooner than that.  Somewhere within the first
        47 seconds of that select, the select said a message was
        ready (the timer is implemented as part of the select).

Jun 20 17:24:42 darkstar Pluto[121]: | rejected packet:
Jun 20 17:24:42 darkstar Pluto[121]: |
Jun 20 17:24:42 darkstar Pluto[121]: | control:
Jun 20 17:24:42 darkstar Pluto[121]: |   2c 00 00 00  00 00 00 00  0b 00 00 00  
71 00 00 00
Jun 20 17:24:42 darkstar Pluto[121]: |   02 03 01 00  00 00 00 00  00 00 00 00  
02 00 96 c3
Jun 20 17:24:42 darkstar Pluto[121]: |   97 05 b8 0a  a0 07 ec c2  c0 00 00 00
Jun 20 17:24:42 darkstar Pluto[121]: | name:
Jun 20 17:24:42 darkstar Pluto[121]: |
Jun 20 17:24:42 darkstar Pluto[121]: extended network error info for message to 
unknown: compainant 151.5.184.10, errno 113 No route to host, origin ICMP (not 
authenticated) 2, type 3, code 1

        This is what was found from the MSG_ERRQUEUE.  There was no
        rejected packet, nor name.  97 05 b8 0a: 151.5.184.a0 I
        haven't looked at what the other parts of the control portion
        mean.

So there is an interpretation of the log (with lots deleted) around
the time of the recvfrom hang.  Other than KLIPS messages, no kernel
messages were logged during this period.

I hope that this additional information helps.

Hugh Redelmeier
hugh@xxxxxxxxxx  voice: +1 416 482-8253


<Prev in Thread] Current Thread [Next in Thread>