netdev
[Top] [All Lists]

Re: select says I can read, but recvfrom hangs

To: hugh@xxxxxxxxxx
Subject: Re: select says I can read, but recvfrom hangs
From: Ben Greear <greearb@xxxxxxxxxxxxxxx>
Date: Wed, 20 Jun 2001 20:13:25 -0700
Cc: netdev@xxxxxxxxxxx, Marco Berizzi <pupilla@xxxxxxxxxxx>
Organization: Candela Technologies
References: <Pine.LNX.4.33.0106202214500.8383-100000@redshift.mimosa.com>
Sender: owner-netdev@xxxxxxxxxxx
"D. Hugh Redelmeier" wrote:
> 
> I'm the maintainer of Pluto, the IKE daemon for the LINUX FreeS/WAN
> project.
> 
> Some of our users have experienced situations where the Pluto process
> becomes unresponsive because it is waiting in a recvfrom.  The thing
> that puzzles me is that recvfrom will not be executed unless select
> has indicated that there is something to read on that socket.  I have
> no idea how that could happen.
> 
> Do any of you have any ideas about what could be happening?
> 
> Details:
> 
> - A couple of people have noticed it happening, but not often.  It may
>   be happening without being noticed, but not at great frequency.
> 
> - one user has been able to reproduce it fairly consistently.  I've
>   mutated the code to try to narrow down what is going on.  Right
>   now, there are three selects that say a message is ready, but the
>   recvfrom still hangs.
> 
> - this user's system is Slackware 7.1, with a kernel.org 2.2.19,
>   patched by FreeS/WAN 1.91.  Richard Briggs, our kernel guy doesn't
>   see a way that FreeS/WAN affects the input path for messages that
>   are UDP (i.e. not ESP and not AH)
> 
> - the socket in question is bound to UDP, Port 500 with the IP address
>   of the public interface.  The RFCs dictate this.  Socket options:
>   SO_REUSEADDR and IP_RECVERR.  Hmm, I wonder if IP_RECVERR could be
>   the problem -- I have evidence that not many folks have used it.
> 
> - I would not ask you to read the whole of Pluto to help me.  But if
>   you wish to, it can be found through www.freeswan.org.  Here is the
>   recvfrom that is hanging, and the preceding just-to-be-safe select:
> 
>         {
>             fd_set nreadfds;
>             int nndes;
>             struct timeval tm;
> 
>             tm.tv_sec = 0;      /* don't wait at all */
>             tm.tv_usec = 0;
> 
>             FD_ZERO(&nreadfds);
>             FD_SET(ifp->fd, &nreadfds);
>             do {
>                 nndes = select(ifp->fd + 1, &nreadfds, NULL, NULL, &tm);
>             } while (nndes == -1 && errno == EINTR);
>             if (nndes < 0)
>             {
>                 log_errno((e, "re-select() failed in comm_handle"));
>                 return;
>             }
>             if (nndes == 0)
>             {
>                 log("SURPRISE: re-select() in comm_handle finds %s no longer 
> ready for input"
>                     , ifp->rname);
>                 return;
>             }
>             passert(nndes == 1 && FD_ISSET(ifp->fd, &nreadfds));
>         }
> 
>         passert(select_found == ifp->fd);
>         zero(&from.sa);
>         packet_len = recvfrom(ifp->fd, bigbuffer, sizeof(bigbuffer), 0
>             , &from.sa, &from_len);
>         passert(select_found == ifp->fd);       /* true paranoia */
>         select_found = NULL_FD;
> 
> - the only signal handlers simply set a sigatomic_t variable and
>   return (SIGHUP, SIGTERM).  They are not firing.
> 
> - The file descriptor in question is not shared with another process.
>   Locking prevents two copies of Pluto from running at once.
> 
> - the scenario that provokes the problem for the user goes as follows:
> 
>   + Pluto is running on a security gateway, with a Windows NT box
>     behind it
> 
>   + he connects a second windows box, running PGPnet (an IPSEC
>     implementation), through the internet, to the public interface
>     of the security gateway.  This box negotiates a tunnel with
>     the security gateway.
> 
>   + he disconnects the second windows box, and reconnects the same way
>     but with a different IP address (the IP address is dynamically
>     assigned whenever he connects this box to the internet).
> 
>   + the second box starts and completes IKE negotiation.
> 
>   + Pluto is tricked into hanging on a recvfrom.
> 
> Is there any way to tell from the system whether the select is wrong
> (i.e. there is no message) or the recvfrom is wrong (i.e. there is a
> message, but it still hangs reading it)?

Make your socket O_NONBLOCKing, and you don't have to worry about that
kind of thing (just be sure you handle all the error cases, ie read
no data) correctly.

I always just consider select() a hint, not the Truth :)

> 
> Thanks,
> 
> Hugh Redelmeier
> hugh@xxxxxxxxxx  voice: +1 416 482-8253

-- 
Ben Greear <greearb@xxxxxxxxxxxxxxx>          <Ben_Greear@xxxxxxxxxx>
President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear

<Prev in Thread] Current Thread [Next in Thread>