Hi Team,
This is a long report. I know some of you might be busy on kernel
summit, we just can's believe XFS has such a serious bug. For some
reason, we must stick on 2.4.5-xfs-1.0.1.
We have the super-pcserver, connect with 5 Mac clients via Giga switch.
All of the clients also have Gigaether NIC. After about 2-3 hours
stressing, load average goes as high as 21, and continue increasing. All
clients have lost connection.
During the tests, there are many error messages:
__alloc_pages: 0-order allocation failed
__alloc_pages: 0-order allocation failed
__alloc_pages: 0-order allocation failed
............
and such bulk of messages appear nearly every 10 minutes, 2 more hours
later, system goes to dead...
server is Duel Xeon 2.0G, 2GB Rambus RAM, ADTX 500G Raid storage, BCM
gigaether card. Hyperthread is on, with "noapic" option, kernel
2.4.5-xfs-1.0.1, compiled *without* XFS_DEBUG. partition size is 500G.
At the dead time, kdb show the zone infomations:
DMA(4096 pages): free 559, inactive_clean 0, inactive_dirty 119
watermark is (128, 256, 384)
Normal(225280 pages): free 21332, inactive_clean 92, inactive_dirty 4495
watermark is (255, 510, 765)
High(294784 pages): free 516, inactive_clean 189311, inactive_dirty 23466
system overall active pages is 204281, inactive_dirty is 28080
bt shows kswapd is wait on xfs_ilock. and atalkd's backtrace is
fsync_dev
sys_sync
panic
kmem_zone_zalloc
xfs_btree_init_cursor
xfs_alloc_ag_vextend_near
xfs_alloc_ag_vextend
xfs_alloc_vextend
xfs_bmap_alloc
xfs_bmapi
pagebuf_delalloc_convert
pagebuf_write_full_page
linvfs_write_full_page_nounlock
_write_buffer
sync_buffers
fsync_dev
sys_sync
panic
kmem_zone_zalloc
xfs_btree_init_cursor
xfs_alloc_ag_vextend_near
xfs_alloc_ag_vextend
xfs_alloc_vextend
xfs_bmap_alloc
xfs_bmapi
xfs_strategy
linvfs_pb_bmap
pagebuf_delalloc_convert
pagebuf_write_full_page
linvfs_write_full_page_nounlock
try_to_free_buffers
page_launder
do_try_to_free_buffers
try_to_free_pages
__alloc_pages
__get_free_pages
__pollwait
datagram_poll
....
do_select
sys_select
at that time no other processes in xfs or vm code.
During the test, I found the inactive_clean+inactive_dirty in Normal
zone is keeping decreasing. Is that a correct vm behaviour? Are there
any page leaking in XFS?
I appreciate anyone's help. If any further information needed, I'd be
glad to do.
thanks.
Eric
|