On Thu, 22 Nov 2001 10:17:21 +1100,
Keith Owens <kaos@xxxxxxxxxxxxxxxxx> wrote:
>On 21 Nov 2001 16:46:40 -0600,
>Russell Cattelan <cattelan@xxxxxxxxxxx> wrote:
>>On Wed, 2001-11-21 at 16:36, Keith Owens wrote:
>>Yes this is quite annoying... I do go in every once in a while
>>and whack all the rysncd, just to get things moving again.
>>
>>Do you have any references to this bug? I might be nice
>>to figure out a way around the problem.
>
>It was supposed to have been fixed in rsync 2.4.[56] but I still get
>the problem in RH rsync 2.4.6-2. I will fetch the clean rsync sources
>and see if I can reproduce the bug.
I have come to the conclusion that this is probably not rsync, it looks
more like the kernel. However the hang is almost always triggered by
rsync sending a chunk of data about hardlinks, so it may still be rsync
related.
Doing an rsync over ssh of 3 kernel directories with lots of hard links
(created by cp -al then patched), it hung. On the sending machine -
...---rsync(32301)---ssh(32302)
# ll /proc/{32301,32302}/fd
/proc/32301/fd:
total 0
dr-x------ 2 kaos ocs 0 Dec 1 12:41 ./
dr-xr-xr-x 3 kaos ocs 0 Dec 1 12:41 ../
lrwx------ 1 kaos ocs 64 Dec 1 12:41 0 -> /dev/pts/0
lrwx------ 1 kaos ocs 64 Dec 1 12:41 1 -> /dev/pts/0
lrwx------ 1 kaos ocs 64 Dec 1 12:41 2 -> /dev/pts/0
lrwx------ 1 kaos ocs 64 Dec 1 12:41 4 -> socket:[210429]
lrwx------ 1 kaos ocs 64 Dec 1 12:41 5 -> socket:[210430]
/proc/32302/fd:
total 0
dr-x------ 2 root root 0 Dec 1 12:41 ./
dr-xr-xr-x 3 kaos ocs 0 Dec 1 12:41 ../
lrwx------ 1 root root 64 Dec 1 12:41 0 -> socket:[210428]
lrwx------ 1 root root 64 Dec 1 12:41 1 -> socket:[210431]
lrwx------ 1 root root 64 Dec 1 12:41 2 -> /dev/pts/0
lrwx------ 1 root root 64 Dec 1 12:41 3 -> socket:[210434]
lrwx------ 1 root root 64 Dec 1 12:41 4 -> socket:[210428]
lrwx------ 1 root root 64 Dec 1 12:41 5 -> socket:[210431]
lrwx------ 1 root root 64 Dec 1 12:41 6 -> /dev/pts/0
# lsof -p 32301,32302
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
rsync 32301 kaos cwd DIR 8,18 4096 133 /build/kaos
rsync 32301 kaos rtd DIR 8,2 1024 2 /
rsync 32301 kaos txt REG 8,17 149180 6993748 /usr/bin/rsync
rsync 32301 kaos mem REG 8,2 471781 45634 /lib/ld-2.2.2.so
rsync 32301 kaos mem REG 8,2 275531 45668
/lib/libnss_files-2.2.2.so
rsync 32301 kaos mem REG 8,2 5634864 27779
/lib/i686/libc-2.2.2.so
rsync 32301 kaos 0u CHR 136,0 2 /dev/pts/0
rsync 32301 kaos 1u CHR 136,0 2 /dev/pts/0
rsync 32301 kaos 2u CHR 136,0 2 /dev/pts/0
rsync 32301 kaos 4u unix 0xc85177c0 210429 socket
rsync 32301 kaos 5u unix 0xc8517480 210430 socket
ssh 32302 root cwd DIR 8,18 4096 133 /build/kaos
ssh 32302 root rtd DIR 8,2 1024 2 /
ssh 32302 root txt REG 8,17 195472 6547129 /usr/bin/ssh
ssh 32302 root mem REG 8,2 471781 45634 /lib/ld-2.2.2.so
ssh 32302 root mem REG 8,2 35352 45707 /lib/libpam.so.0.74
ssh 32302 root mem REG 8,2 79024 45647 /lib/libdl-2.2.2.so
ssh 32302 root mem REG 8,17 60112 8641101 /usr/lib/libz.so.1.1.3
ssh 32302 root mem REG 8,2 445289 45652 /lib/libnsl-2.2.2.so
ssh 32302 root mem REG 8,2 61451 45687 /lib/libutil-2.2.2.so
ssh 32302 root mem REG 8,17 869430 8526806
/usr/lib/libcrypto.so.0.9.6
ssh 32302 root mem REG 8,2 5634864 27779
/lib/i686/libc-2.2.2.so
ssh 32302 root mem REG 8,2 275531 45668
/lib/libnss_files-2.2.2.so
ssh 32302 root 0u unix 0xc8517b00 210428 socket
ssh 32302 root 1u unix 0xc8517140 210431 socket
ssh 32302 root 2u CHR 136,0 2 /dev/pts/0
ssh 32302 root 3u IPv4 210434 TCP
ocs4.intra.ocs.com.au:32804->ocs3.intra.ocs.com.au:ssh (ESTABLISHED)
ssh 32302 root 4u unix 0xc8517b00 210428 socket
ssh 32302 root 5u unix 0xc8517140 210431 socket
ssh 32302 root 6u CHR 136,0 2 /dev/pts/0
# ltrace -p 32301
gettimeofday(0xbffff7a8, 0, 0xbffff7c8, 0x0804f067, 0) = 0
waitpid(32302, 0xbffff808, 1, 0x0807b9f8, 0x0807b9f8) = 0
gettimeofday(0xbffff7b0, 0, 0xbffff7c8, 0x0804f067, 0) = 0
gettimeofday(0xbffff7a8, 0, 0xbffff7c8, 0x0804f067, 0) = 0
__errno_location() = 0x4014f600
According to ltrace, rsync is spinning in that loop, waiting for its
child (ssh) to exit. strace says the same thing, kdb confirms it. I
needed the confirmation because of the effect below.
# ltrace -p 32302
write(5, "q", 65536) = 65406
memcpy(0x08094238, "", 6) = 0x08094238
.....
select(7, 0x0808cd00, 0x0808cf40, 0, 0) = 1
write(3,
"\033\370\331\014(\340\306\310\r\317\274\225\n\017YE\376\375\275U\257\211.\363\310\236\224\233\326\233S\260"...,
48) = 48
memset(0x0808cd00, '\000', 4) = 0x0808cd00
memset(0x0808cf40, '\000', 4) = 0x0808cf40
select(7, 0x0808cd00, 0x0808cf40, 0, 0) = 1
read(3,
"\366\340\313\360Ba\237\206\214*\231\231d\275mo.\217\374}\222\354h\225\342b4\217\210\330w\374"...,
8192) = 48
The first thing of note is that ltrace on ssh brings the process out of
hung state, rsync then completes! strace on ssh has the same effect.
select(7, [3 4], [5], NULL, NULL) = 1 (out [5])
write(5, "q\0\0\tbuild/kaos/2.4.17-pre1/arch/"..., 65536) = 65406
Both traces point to ssh hanging trying to write a chunk of 65536 bytes
to fd 5. fd 5 is a loopback to ssh fd 1 (yes, ssh reads from stdout).
For reasons unknown, the write to fd 5 blocks so ssh hangs. [ls]trace
brings ssh out of the select with a valid response so it recovers.
Running ssh under strace perturbs the environment enough that the
problem does not occur. In particular all the writes to fd 5 are for
much smaller chunks. Connecting to ssh with gdb then disconnecting
also restarts ssh. I could not find any signals that had the same
effect, sending different signals to ssh either killed it (HUP, USR1,
ALRM) or had no effect (STOP, TSTP, CONT).
This workaround might be worth running once a minute on oss.sgi.com.
for i in $(ps xa | grep '[s]h.*rsync' | awk '{print $1}')
do
strace -p $i 2>&1 | head -1 > /dev/null
done
AFAICT it has no effect on working connections.
I can reproduce the hang using just rsync from one disk to another on
the same machine, no ssh. Again doing strace on the first child of the
rsync command restarts the transfer.
bash(30493)---rsync(32597)---rsync(32598)---rsync(32599)
# strace -p 32598 2>&1 | head -2
select(2, NULL, [1], NULL, {9, 150000}) = 1 (out [1], left {9, 140000})
write(1, "1\0\0\tbuild/kaos/2.4.16-pristine/a"..., 53) = 53
I am going to follow this up on l-k after upgrading to 2.4.16, my
testing was on 2.4.14-pre7-xfs. It definitely looks like a kernel bug
that rysnc just happens to trip over.
|