On the next big SGI machines we ran into a problem where PMCD runs at
100%, all of that time spent collecting metrics from the linux pmda.
Eventually we tracked the problem to PMIE failing to nanosleep().
1272757517.709705 select(5, [4], NULL, NULL, {9, 997965}) = 1 (in [4], left {9,
661665})
1272757518.046516 recvfrom(4, "\0\0\0H\0\0p\1\0\0015\365", 12, 0, NULL, NULL) =
12
1272757518.046600 select(5, [4], NULL, NULL, {10, 0}) = 1 (in [4], left {9,
999997})
1272757518.046644 recvfrom(4,
"K\334\275\16\0\0\264\326\0\0\0\2\17\0\10\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\20"...,
60, 0, NULL, NULL) = 60
1272757518.046714 wait4(-1, 0x7fff22cbafbc, WNOHANG, NULL) = -1 ECHILD (No
child processes)
1272757518.046808 gettimeofday({1272757518, 46817}, NULL) = 0
1272757518.046844 nanosleep({18446744073709013837, 18446744073410341545},
0x7fff22cbaf90) = -1 EINVAL (Invalid argument)
1272757518.046886 sendto(4,
"\0\0\0$\0\0p\3\0\0017\253\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\17\0\10\0"..., 36,
0, NULL, 0) = 36
1272757518.047014 select(5, [4], NULL, NULL, {10, 0}) = 1 (in [4], left {9,
660755})
1272757518.386349 recvfrom(4, "\0\0\0H\0\0p\1\0\0015\365", 12, 0, NULL, NULL) =
12
1272757518.386398 select(5, [4], NULL, NULL, {10, 0}) = 1 (in [4], left {9,
999998})
1272757518.386442 recvfrom(4,
"K\334\275\16\0\5\344s\0\0\0\2\17\0\10\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\20"...,
60, 0, NULL, NULL) = 60
1272757518.386496 wait4(-1, 0x7fff22cbafbc, WNOHANG, NULL) = -1 ECHILD (No
child processes)
1272757518.386530 gettimeofday({1272757518, 386539}, NULL) = 0
1272757518.386562 nanosleep({18446744073709013957, 18446744073070619627},
0x7fff22cbaf90) = -1 EINVAL (Invalid argument)
1272757518.386599 sendto(4,
"\0\0\0$\0\0p\3\0\0017\253\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\17\0\10\0"..., 36,
0, NULL, 0) = 36
1272757518.386724 select(5, [4], NULL, NULL, {10, 0}) = 1 (in [4], left {9,
660103})
I believe what has happened is:
- Machine's clock is set far in the past
- pmie starts and notes the epoch for polling the pmcd. It then polls at even
intervals where: next poll == epoch + poll_count * poll_delay
- ntp starts, and warps time forward by a couple years.
- pmie spins, failing nanosleep() because the "next poll" is still a long way in
the past and advancing by poll_delay at each subsequent call until it gets to
the present time.
Attached is my first crack at fixing it, but I don't love it because:
1) It doesn't handle warping into the past. I think PMIE fails in
similar ways, in that the nanosleep() is a few years in length.
2) small nit: we still do the waitpid() call with the way I have it
setup now. I'd rather it not do any unnecessary work if its going to
spin/fast-forward itself.
I suspect if we fix #1 properly we may not need to worry about #2.
Any suggestions about how we proceed to fix this? I am mostly worried
about archives where the delta between samples may be very large. Are
we concerned about this in practice?
mh
pmie.patch
Description: Text Data
|