xfs
[Top] [All Lists]

Re: oss.sgi.com ftp down?

To: Russell Cattelan <cattelan@xxxxxxxxxxx>
Subject: Re: oss.sgi.com ftp down?
From: Keith Owens <kaos@xxxxxxxxxxxxxxxxx>
Date: Sat, 01 Dec 2001 14:21:15 +1100
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: Your message of "Thu, 22 Nov 2001 10:17:21 +1100." <7951.1006384641@ocs3.intra.ocs.com.au>
Sender: owner-linux-xfs@xxxxxxxxxxx
On Thu, 22 Nov 2001 10:17:21 +1100, 
Keith Owens <kaos@xxxxxxxxxxxxxxxxx> wrote:
>On 21 Nov 2001 16:46:40 -0600, 
>Russell Cattelan <cattelan@xxxxxxxxxxx> wrote:
>>On Wed, 2001-11-21 at 16:36, Keith Owens wrote:
>>Yes this is quite annoying... I do go in every once in a while
>>and whack all the rysncd, just to get things moving again.
>>
>>Do you have any references to this bug? I might be nice
>>to figure out a way around the problem.
>
>It was supposed to have been fixed in rsync 2.4.[56] but I still get
>the problem in RH rsync 2.4.6-2.  I will fetch the clean rsync sources
>and see if I can reproduce the bug.

I have come to the conclusion that this is probably not rsync, it looks
more like the kernel.  However the hang is almost always triggered by
rsync sending a chunk of data about hardlinks, so it may still be rsync
related.

Doing an rsync over ssh of 3 kernel directories with lots of hard links
(created by cp -al then patched), it hung.  On the sending machine -

...---rsync(32301)---ssh(32302)

# ll /proc/{32301,32302}/fd
/proc/32301/fd:
total 0
dr-x------    2 kaos     ocs             0 Dec  1 12:41 ./
dr-xr-xr-x    3 kaos     ocs             0 Dec  1 12:41 ../
lrwx------    1 kaos     ocs            64 Dec  1 12:41 0 -> /dev/pts/0
lrwx------    1 kaos     ocs            64 Dec  1 12:41 1 -> /dev/pts/0
lrwx------    1 kaos     ocs            64 Dec  1 12:41 2 -> /dev/pts/0
lrwx------    1 kaos     ocs            64 Dec  1 12:41 4 -> socket:[210429]
lrwx------    1 kaos     ocs            64 Dec  1 12:41 5 -> socket:[210430]

/proc/32302/fd:
total 0
dr-x------    2 root     root            0 Dec  1 12:41 ./
dr-xr-xr-x    3 kaos     ocs             0 Dec  1 12:41 ../
lrwx------    1 root     root           64 Dec  1 12:41 0 -> socket:[210428]
lrwx------    1 root     root           64 Dec  1 12:41 1 -> socket:[210431]
lrwx------    1 root     root           64 Dec  1 12:41 2 -> /dev/pts/0
lrwx------    1 root     root           64 Dec  1 12:41 3 -> socket:[210434]
lrwx------    1 root     root           64 Dec  1 12:41 4 -> socket:[210428]
lrwx------    1 root     root           64 Dec  1 12:41 5 -> socket:[210431]
lrwx------    1 root     root           64 Dec  1 12:41 6 -> /dev/pts/0

# lsof -p 32301,32302
COMMAND   PID USER   FD   TYPE     DEVICE    SIZE    NODE NAME
rsync   32301 kaos  cwd    DIR       8,18    4096     133 /build/kaos
rsync   32301 kaos  rtd    DIR        8,2    1024       2 /
rsync   32301 kaos  txt    REG       8,17  149180 6993748 /usr/bin/rsync
rsync   32301 kaos  mem    REG        8,2  471781   45634 /lib/ld-2.2.2.so
rsync   32301 kaos  mem    REG        8,2  275531   45668 
/lib/libnss_files-2.2.2.so
rsync   32301 kaos  mem    REG        8,2 5634864   27779 
/lib/i686/libc-2.2.2.so
rsync   32301 kaos    0u   CHR      136,0               2 /dev/pts/0
rsync   32301 kaos    1u   CHR      136,0               2 /dev/pts/0
rsync   32301 kaos    2u   CHR      136,0               2 /dev/pts/0
rsync   32301 kaos    4u  unix 0xc85177c0          210429 socket
rsync   32301 kaos    5u  unix 0xc8517480          210430 socket

ssh     32302 root  cwd    DIR       8,18    4096     133 /build/kaos
ssh     32302 root  rtd    DIR        8,2    1024       2 /
ssh     32302 root  txt    REG       8,17  195472 6547129 /usr/bin/ssh
ssh     32302 root  mem    REG        8,2  471781   45634 /lib/ld-2.2.2.so
ssh     32302 root  mem    REG        8,2   35352   45707 /lib/libpam.so.0.74
ssh     32302 root  mem    REG        8,2   79024   45647 /lib/libdl-2.2.2.so
ssh     32302 root  mem    REG       8,17   60112 8641101 /usr/lib/libz.so.1.1.3
ssh     32302 root  mem    REG        8,2  445289   45652 /lib/libnsl-2.2.2.so
ssh     32302 root  mem    REG        8,2   61451   45687 /lib/libutil-2.2.2.so
ssh     32302 root  mem    REG       8,17  869430 8526806 
/usr/lib/libcrypto.so.0.9.6
ssh     32302 root  mem    REG        8,2 5634864   27779 
/lib/i686/libc-2.2.2.so
ssh     32302 root  mem    REG        8,2  275531   45668 
/lib/libnss_files-2.2.2.so
ssh     32302 root    0u  unix 0xc8517b00          210428 socket
ssh     32302 root    1u  unix 0xc8517140          210431 socket
ssh     32302 root    2u   CHR      136,0               2 /dev/pts/0
ssh     32302 root    3u  IPv4     210434             TCP 
ocs4.intra.ocs.com.au:32804->ocs3.intra.ocs.com.au:ssh (ESTABLISHED)
ssh     32302 root    4u  unix 0xc8517b00          210428 socket
ssh     32302 root    5u  unix 0xc8517140          210431 socket
ssh     32302 root    6u   CHR      136,0               2 /dev/pts/0

# ltrace -p 32301
gettimeofday(0xbffff7a8, 0, 0xbffff7c8, 0x0804f067, 0) = 0
waitpid(32302, 0xbffff808, 1, 0x0807b9f8, 0x0807b9f8) = 0
gettimeofday(0xbffff7b0, 0, 0xbffff7c8, 0x0804f067, 0) = 0
gettimeofday(0xbffff7a8, 0, 0xbffff7c8, 0x0804f067, 0) = 0
__errno_location()                                = 0x4014f600

According to ltrace, rsync is spinning in that loop, waiting for its
child (ssh) to exit.  strace says the same thing, kdb confirms it.  I
needed the confirmation because of the effect below.

# ltrace -p 32302
write(5, "q", 65536)                              = 65406
memcpy(0x08094238, "", 6)                         = 0x08094238
.....
select(7, 0x0808cd00, 0x0808cf40, 0, 0)           = 1
write(3, 
"\033\370\331\014(\340\306\310\r\317\274\225\n\017YE\376\375\275U\257\211.\363\310\236\224\233\326\233S\260"...,
 48) = 48
memset(0x0808cd00, '\000', 4)                     = 0x0808cd00
memset(0x0808cf40, '\000', 4)                     = 0x0808cf40
select(7, 0x0808cd00, 0x0808cf40, 0, 0)           = 1
read(3, 
"\366\340\313\360Ba\237\206\214*\231\231d\275mo.\217\374}\222\354h\225\342b4\217\210\330w\374"...,
 8192) = 48

The first thing of note is that ltrace on ssh brings the process out of
hung state, rsync then completes!  strace on ssh has the same effect.

select(7, [3 4], [5], NULL, NULL)       = 1 (out [5])
write(5, "q\0\0\tbuild/kaos/2.4.17-pre1/arch/"..., 65536) = 65406

Both traces point to ssh hanging trying to write a chunk of 65536 bytes
to fd 5.  fd 5 is a loopback to ssh fd 1 (yes, ssh reads from stdout).
For reasons unknown, the write to fd 5 blocks so ssh hangs.  [ls]trace
brings ssh out of the select with a valid response so it recovers.

Running ssh under strace perturbs the environment enough that the
problem does not occur.  In particular all the writes to fd 5 are for
much smaller chunks.  Connecting to ssh with gdb then disconnecting
also restarts ssh.  I could not find any signals that had the same
effect, sending different signals to ssh either killed it (HUP, USR1,
ALRM) or had no effect (STOP, TSTP, CONT).

This workaround might be worth running once a minute on oss.sgi.com.

for i in $(ps xa | grep '[s]h.*rsync' | awk '{print $1}')
do
        strace -p $i 2>&1 | head -1 > /dev/null
done

AFAICT it has no effect on working connections.

I can reproduce the hang using just rsync from one disk to another on
the same machine, no ssh.  Again doing strace on the first child of the
rsync command restarts the transfer.

bash(30493)---rsync(32597)---rsync(32598)---rsync(32599)

# strace -p 32598 2>&1 | head -2
select(2, NULL, [1], NULL, {9, 150000}) = 1 (out [1], left {9, 140000})
write(1, "1\0\0\tbuild/kaos/2.4.16-pristine/a"..., 53) = 53

I am going to follow this up on l-k after upgrading to 2.4.16, my
testing was on 2.4.14-pre7-xfs.  It definitely looks like a kernel bug
that rysnc just happens to trip over.


<Prev in Thread] Current Thread [Next in Thread>