Kernel 2.4.19
glibc 2.2.5
I am running a parallel program on 170 nodes, with 2 processes on each
nodes. so 340 total processes. Each process has a TCP connection
established with every other process. So each process has 339 sockets in
ESTABLISHED state. The problem occurs when I try to write() on these
socket. The TCP connection gets dropped for some of the sockets of a few
processes as soon as they try to write to those socket. This problem,
however, does not occur, if I reduce the number of processes to less than
306 (305 TCP sockets/connections for each process).
Any ideas why connections are getting dropped?
Hassan
p.s All sockets are non-blocking and nagle is disabled.
|