| | From: Andi Kleen <ak@xxxxxx>
|
| | Is the affected machine a SMP box?
Marco says: No, only one processor. P100 with F0 0F bug.
Marco also answered another question that I asked:
| From: Marco Berizzi <pupilla@xxxxxxxxxxx>
| > I have a question: does the hang occur every time you go through the
| > procedure you describe (I infer that from what you've said), or just
| > too often?
| >
|
| every time I go through the procedure
I also think it might be helpful if I showed some of the logging done
by Pluto. This was captured by Marco (thanks!):
Remember that this machine (darkstar) is talking to a "Road Warrior",
a Windows box running PGPnet IPSEC.
The windows machine had IP address 151.20.110.219 and had negotiated a
tunnel. It was then disconnected and reconnected with IP address
151.25.5.30.
Jun 20 17:21:43 darkstar Pluto[121]: | *received 196 bytes from 151.25.5.30:500
on eth0
Jun 20 17:21:43 darkstar Pluto[121]: | state object #3 found, in STATE_MAIN_R1
Jun 20 17:21:43 darkstar Pluto[121]: | sending:
message from and response to 151.25.5.30:500.
Included for context (timestamp, in particular)
Jun 20 17:21:44 darkstar Pluto[121]: | *received 1732 bytes from
151.25.5.30:500 on eth0
Jun 20 17:21:44 darkstar Pluto[121]: | state object #3 found, in STATE_MAIN_R2
Jun 20 17:21:45 darkstar Pluto[121]: "Berizzi" #3: deleting connection
"Berizzi" instance with peer 151.20.110.219
Jun 20 17:21:45 darkstar Pluto[121]: "Berizzi" #2: deleting state
(STATE_QUICK_R2)
Jun 20 17:21:45 darkstar Pluto[121]: | ***emit ISAKMP Delete Payload:
Jun 20 17:21:45 darkstar Pluto[121]: | sending:
This is not FreeS/WAN code. It must be part of the x.509 patch
sends about 68 bytes, probably to 151.20.110.219:500
Jun 20 17:21:45 darkstar Pluto[121]: | ***emit ISAKMP Delete Payload:
Jun 20 17:21:45 darkstar Pluto[121]: | sending:
This is not FreeS/WAN code. It must be part of the x.509 patch
sends about 68 bytes, probably to 151.20.110.219:500
Jun 20 17:21:46 darkstar Pluto[121]: | sending:
sends about 1652 bytes to 151.25.5.30:500
Jun 20 17:21:46 darkstar Pluto[121]: ERROR: "Berizzi" #3: sendto() on eth0 to
151.25.5.30:500 failed in STF_REPLY. Errno 113: No route to host
I have no idea why a sendto to 151.25.5.30:500 would fail. It
is more likely that the sendtos to 151.20.110.219:500 (the
former address) would fail. The error report is just based on
what Pluto tried to do: it tried to sendto 151.25.5.30:500,
and got an EHOSTUNREACH. Perhaps that error return was caused
by an earlier sendto.
There is no attempt to read MSG_ERRQUE for a write failure.
Perhaps there should be. But I think that the next time
through the master select call, if there is a message on the
MSG_ERRQUEUE, a recvfrom will be triggered, get an error
return, and a recvmsg(,, MSG_ERRQUEUE) will then read
the error report.
Jun 20 17:21:46 darkstar Pluto[121]: | *received 1732 bytes from
151.25.5.30:500 on eth0
Jun 20 17:21:46 darkstar Pluto[121]: | state object #3 found, in STATE_MAIN_R3
Jun 20 17:21:46 darkstar Pluto[121]: "Berizzi" #3: retransmitting in response
to duplicate packet; already STATE_MAIN_R3
Interestingly, the next read actually got a retransmitted
packet, not an error indication.
It isn't clear why the other side retransmitted. It is
*possible* that we were hung in a read. But the timestamps
don't indicate an excessive delay.
Could it be that the ERROR quoted above supressed the message
actually being sent? Makes some sense.
Jun 20 17:21:46 darkstar Pluto[121]: | sending:
sends about 1652 bytes to 151.25.5.30:500
This time we don't get an error, even though the destination
is identical?!?!
Jun 20 17:21:49 darkstar Pluto[121]: | *received 52 bytes from 151.25.5.30:500
on eth0
Jun 20 17:21:49 darkstar Pluto[121]: | state object #4 found, in STATE_QUICK_R1
Pluto receives the last message of an IKE Quick Mode exchange.
This would be the end of negotiation with the Road Warrior
(mobile IPSEC host, which in this case is running PGPnet IPSEC
software, not ours).
Pluto then proceeds to install a new tunnel.
Jun 20 17:21:50 darkstar Pluto[121]: "Berizzi" #4: STATE_QUICK_R2: IPsec SA
established
Jun 20 17:21:50 darkstar Pluto[121]: | next event EVENT_SHUNT_SCAN in 47 seconds
Pluto now goes into select, waiting for next event.
select tells it that a message is ready.
It hangs in the recvfrom.
Jun 20 17:24:42 darkstar Pluto[121]: ERROR: recvfrom() on eth0 failed in
comm_handle (Pluto cannot decode source sockaddr in rejection: unknown source).
Errno 4: Interrupted system call
Marco does a SIGHUP to Pluto. Pluto catches the signal, sets
a variable, and returns. Pluto is stupid enough (thankfully)
to treat the EINTR as any other read error: it goes on to look
at the MSG_ERRQUEUE.
Notice the almost 3 minute delay between this message and the
preceding one. Fascinating, since there was supposed to be
a timer event sooner than that. Somewhere within the first
47 seconds of that select, the select said a message was
ready (the timer is implemented as part of the select).
Jun 20 17:24:42 darkstar Pluto[121]: | rejected packet:
Jun 20 17:24:42 darkstar Pluto[121]: |
Jun 20 17:24:42 darkstar Pluto[121]: | control:
Jun 20 17:24:42 darkstar Pluto[121]: | 2c 00 00 00 00 00 00 00 0b 00 00 00
71 00 00 00
Jun 20 17:24:42 darkstar Pluto[121]: | 02 03 01 00 00 00 00 00 00 00 00 00
02 00 96 c3
Jun 20 17:24:42 darkstar Pluto[121]: | 97 05 b8 0a a0 07 ec c2 c0 00 00 00
Jun 20 17:24:42 darkstar Pluto[121]: | name:
Jun 20 17:24:42 darkstar Pluto[121]: |
Jun 20 17:24:42 darkstar Pluto[121]: extended network error info for message to
unknown: compainant 151.5.184.10, errno 113 No route to host, origin ICMP (not
authenticated) 2, type 3, code 1
This is what was found from the MSG_ERRQUEUE. There was no
rejected packet, nor name. 97 05 b8 0a: 151.5.184.a0 I
haven't looked at what the other parts of the control portion
mean.
So there is an interpretation of the log (with lots deleted) around
the time of the recvfrom hang. Other than KLIPS messages, no kernel
messages were logged during this period.
I hope that this additional information helps.
Hugh Redelmeier
hugh@xxxxxxxxxx voice: +1 416 482-8253
|