"D. Hugh Redelmeier" wrote:
>
> I'm the maintainer of Pluto, the IKE daemon for the LINUX FreeS/WAN
> project.
>
> Some of our users have experienced situations where the Pluto process
> becomes unresponsive because it is waiting in a recvfrom. The thing
> that puzzles me is that recvfrom will not be executed unless select
> has indicated that there is something to read on that socket. I have
> no idea how that could happen.
>
> Do any of you have any ideas about what could be happening?
>
> Details:
>
> - A couple of people have noticed it happening, but not often. It may
> be happening without being noticed, but not at great frequency.
>
> - one user has been able to reproduce it fairly consistently. I've
> mutated the code to try to narrow down what is going on. Right
> now, there are three selects that say a message is ready, but the
> recvfrom still hangs.
>
> - this user's system is Slackware 7.1, with a kernel.org 2.2.19,
> patched by FreeS/WAN 1.91. Richard Briggs, our kernel guy doesn't
> see a way that FreeS/WAN affects the input path for messages that
> are UDP (i.e. not ESP and not AH)
>
> - the socket in question is bound to UDP, Port 500 with the IP address
> of the public interface. The RFCs dictate this. Socket options:
> SO_REUSEADDR and IP_RECVERR. Hmm, I wonder if IP_RECVERR could be
> the problem -- I have evidence that not many folks have used it.
>
> - I would not ask you to read the whole of Pluto to help me. But if
> you wish to, it can be found through www.freeswan.org. Here is the
> recvfrom that is hanging, and the preceding just-to-be-safe select:
>
> {
> fd_set nreadfds;
> int nndes;
> struct timeval tm;
>
> tm.tv_sec = 0; /* don't wait at all */
> tm.tv_usec = 0;
>
> FD_ZERO(&nreadfds);
> FD_SET(ifp->fd, &nreadfds);
> do {
> nndes = select(ifp->fd + 1, &nreadfds, NULL, NULL, &tm);
> } while (nndes == -1 && errno == EINTR);
> if (nndes < 0)
> {
> log_errno((e, "re-select() failed in comm_handle"));
> return;
> }
> if (nndes == 0)
> {
> log("SURPRISE: re-select() in comm_handle finds %s no longer
> ready for input"
> , ifp->rname);
> return;
> }
> passert(nndes == 1 && FD_ISSET(ifp->fd, &nreadfds));
> }
>
> passert(select_found == ifp->fd);
> zero(&from.sa);
> packet_len = recvfrom(ifp->fd, bigbuffer, sizeof(bigbuffer), 0
> , &from.sa, &from_len);
> passert(select_found == ifp->fd); /* true paranoia */
> select_found = NULL_FD;
>
> - the only signal handlers simply set a sigatomic_t variable and
> return (SIGHUP, SIGTERM). They are not firing.
>
> - The file descriptor in question is not shared with another process.
> Locking prevents two copies of Pluto from running at once.
>
> - the scenario that provokes the problem for the user goes as follows:
>
> + Pluto is running on a security gateway, with a Windows NT box
> behind it
>
> + he connects a second windows box, running PGPnet (an IPSEC
> implementation), through the internet, to the public interface
> of the security gateway. This box negotiates a tunnel with
> the security gateway.
>
> + he disconnects the second windows box, and reconnects the same way
> but with a different IP address (the IP address is dynamically
> assigned whenever he connects this box to the internet).
>
> + the second box starts and completes IKE negotiation.
>
> + Pluto is tricked into hanging on a recvfrom.
>
> Is there any way to tell from the system whether the select is wrong
> (i.e. there is no message) or the recvfrom is wrong (i.e. there is a
> message, but it still hangs reading it)?
Make your socket O_NONBLOCKing, and you don't have to worry about that
kind of thing (just be sure you handle all the error cases, ie read
no data) correctly.
I always just consider select() a hint, not the Truth :)
>
> Thanks,
>
> Hugh Redelmeier
> hugh@xxxxxxxxxx voice: +1 416 482-8253
--
Ben Greear <greearb@xxxxxxxxxxxxxxx> <Ben_Greear@xxxxxxxxxx>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear
|