hp's mail servers & sgi's don't seem to get along. can someone look into it?
----- Forwarded message from "GOBEILLE,BOB (HP-FtCollins,ex1)"
<bob.gobeille@xxxxxx> -----
Envelope-to: willy@xxxxxxxxxxxxxxxx
Delivery-date: Wed, 30 Oct 2002 14:58:40 +0000
From: "GOBEILLE,BOB (HP-FtCollins,ex1)" <bob.gobeille@xxxxxx>
To: "WILCOX,MATTHEW (HP-FtCollins,unix1)" <matthew.wilcox@xxxxxx>
Subject: FW: unlink deadlock
Date: Tue, 29 Oct 2002 14:09:48 -0500
X-Mailer: Internet Mail Service (5.5.2655.55)
Resent-From: willy@xxxxxxxxxxxxx
Resent-Date: Wed, 30 Oct 2002 07:58:32 -0700
Resent-To: willy@xxxxxxxxxxxxxxxx
Resent-Message-Id: <20021030145832.CD0AE461A@xxxxxxxxxxxxx>
Willy,
This message I sent to linux-xfs never got there. I also sent it to hch and
never received a reply. Either I'm being ignored as being too ignorant or
my email isn't reaching it's destination.
Would you do me a favor, and take my original email below, edit it however
you see fit and send it to linux-xfs? I just feel like we have some useful
data here if it could get to the right eyes.
Thanks,
Bob
-----Original Message-----
From: GOBEILLE,BOB (HP-FtCollins,ex1)
Sent: Saturday, October 26, 2002 5:46 PM
To: 'linux-xfs@xxxxxxxxxxx'
Subject: Re: PATCH: sleeping while holding a lock in
_pagebuf_free_bh()::page_buf.c
This problem sounds somewhat similiar to an intermittant problem we are
seeing on a 2.4.18 cluster of dual cpu IA-64 nodes. Two processes on each
node are writing (different) checkpoint files to the same directory on an
XFS RAID 0 array and sometimes ending up in a D state and locking the entire
filesystem (can't even do an ls). Each process is writing a 155 MB
checkpoint file A then a 155 MB checkpoint file B then unlinking A and
writing a new A, then unlinks B ... Were trying to reproduce the problem
with a single process writing two large files (1MB to 3,000MB) and unlinking
them over and over again but haven't been successful yet.
I haven't tried your patches because I don't understand them. Do you think
your patches might address this problem? We are also trying to reproduce
the problem on the 32 node cluster with a kdb instrumented 2.4.18.
Here is some data (truncated to the D state processes) on the state of one
node where two processes (nwchem.0) are doing an unlink (of different files)
and getting in a D state locking the filesystem. The other commands (touch,
ls, ...) were then done to show that the fs is really locked up.
>ps -eo pid,user,fname,tmout,f,wchan
PID USER COMMAND TMOUT F WCHAN
1744 edo nwchem.o - 000 down
6623 hpsupt bash - 000 down
7308 root updatedb - 100 down
7700 root touch - 000 down
7757 root ls - 000 down
8254 hpsupt bash - 000 down
1745 edo nwchem.o - 000 down
8429 hpsupt bash - 000 down
>top -b -n 1
1:02pm up 1 day, 57 min, 4 users, load average: 8.99, 8.97, 8.91
96 processes: 90 sleeping, 1 running, 5 zombie, 0 stopped
CPU0 states: 1.1% user, 3.7% system, 0.0% nice, 95.1% idle
CPU1 states: 0.0% user, 0.2% system, 0.0% nice, 99.8% idle
Mem: 12365920K av, 1739696K used, 10626224K free, 0K shrd, 59056K
buff
Swap: 12287920K av, 3792K used, 12284128K free 969888K
cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
1744 edo 12 0 1983M 689M 559M D 0.0 1.4 80:01 nwchem.ok
6623 hpsupt 12 0 5376 5376 3888 D 0.0 0.0 0:00 bash
7308 root 20 19 2352 2352 1808 D N 0.0 0.0 0:00 updatedb
7700 root 12 0 2144 2144 1792 D 0.0 0.0 0:00 touch
7757 root 12 0 2560 2560 2080 D 0.0 0.0 0:00 ls
7890 hpsupt 12 0 5360 5360 3872 D 0.0 0.0 0:00 bash
8254 hpsupt 12 0 5360 5360 3872 D 0.0 0.0 0:00 bash
1745 edo 12 0 1982M 687M 558M D 0.0 1.4 80:07 nwchem.ok
8429 hpsupt 12 0 5360 5360 3872 D 0.0 0.0 0:00 bash
99 active task structs found
>>
#
## 0xe00000028b6e0000 19911 1744 1 2 0 -
nwchem.ok
#
>> deftask 0xe00000028b6e0000
>> trace -f
================================================================
STACK TRACE FOR TASK: 0xe00000028b6e0000 (nwchem.ok)
0 schedule+3724 [0xe00000000446d06c]
1 __down+524 [0xe0000000044246ec]
2 sys_unlink+332 [0xe0000000044ee28c]
================================================================
>>
#
## 0xe0000040f1188000 19911 1745 1 2 0x8000 -
nwchem.ok
#
>> deftask 0xe0000040f1188000
Default task is 0xe0000040f1188000
>> trace -f | more
================================================================
STACK TRACE FOR TASK: 0xe0000040f1188000 (nwchem.ok)
0 schedule+3724 [0xe00000000446d06c]
1 __down+524 [0xe0000000044246ec]
2 pagebuf_lock+140 [0xe0000000046de4ec]
3 _pagebuf_find_lockable_buffer+540 [0xe0000000046ddcfc]
4 _pagebuf_get_lockable_buffer+140 [0xe0000000046de12c]
5 pagebuf_get+60 [0xe0000000046d50bc]
6 xfs_trans_get_buf+508 [0xe0000000046bac1c]
7 xfs_btree_get_bufs+156 [0xe000000004657f1c]
8 xfs_alloc_newroot+204 [0xe00000000462bdcc]
9 xfs_alloc_insrec+124 [0xe00000000462a23c]
10 xfs_alloc_insert+252 [0xe00000000462e2fc]
11 xfs_free_ag_extent+2316 [0xe000000004625b4c]
12 xfs_alloc_fix_freelist+1740 [0xe00000000462668c]
13 xfs_free_extent+332 [0xe00000000462816c]
14 xfs_bmap_finish+508 [0xe000000004647abc]
15 xfs_itruncate_finish+700 [0xe0000000046937fc]
16 xfs_inactive+1468 [0xe0000000046c6fdc]
17 vn_put+380 [0xe0000000046f0dbc]
18 linvfs_put_inode+44 [0xe0000000046edf2c]
19 iput+268 [0xe00000000450658c]
20 d_delete+368 [0xe000000004500d10]
21 vfs_unlink+1132 [0xe0000000044ee0ac]
22 sys_unlink+524 [0xe0000000044ee34c]
================================================================
>>
#
## 0xe00000409f888000 500 6623 1 2 0 - bash
#
>> deftask 0xe00000409f888000
Default task is 0xe00000409f888000
>> trace -f
================================================================
STACK TRACE FOR TASK: 0xe00000409f888000 (bash)
0 schedule+3724 [0xe00000000446d06c]
1 __down+524 [0xe0000000044246ec]
2 real_lookup+76 [0xe0000000044e7f8c]
3 link_path_walk+4572 [0xe0000000044e981c]
4 path_walk+44 [0xe0000000044e9f6c]
5 __user_walk+124 [0xe0000000044eb01c]
6 sys_newlstat+60 [0xe0000000044e10dc]
================================================================
>>
#
## 0xe000000270578000 0 7308 7305 2 0x100 -
updatedb
#
>> deftask 0xe000000270578000
Default task is 0xe000000270578000
>> trace -f
================================================================
STACK TRACE FOR TASK: 0xe000000270578000 (updatedb)
0 schedule+3724 [0xe00000000446d06c]
1 __down+524 [0xe0000000044246ec]
2 vfs_readdir+172 [0xe0000000044f4aec]
3 sys_getdents64+156 [0xe0000000044f5a3c]
================================================================
>>
#
## 0xe0000001ac3e0000 0 7700 1 2 0 - touch
#
>> deftask 0xe0000001ac3e0000
Default task is 0xe0000001ac3e0000
>> trace -f
================================================================
STACK TRACE FOR TASK: 0xe0000001ac3e0000 (touch)
0 schedule+3724 [0xe00000000446d06c]
1 __down+524 [0xe0000000044246ec]
2 open_namei+572 [0xe0000000044eb5bc]
3 filp_open+92 [0xe0000000044c9dfc]
4 sys_open+156 [0xe0000000044ca81c]
================================================================
>>
#
## 0xe0000002ab978000 0 7757 1 2 0 - ls
#
>> deftask 0xe0000001ac3e0000
Default task is 0xe0000001ac3e0000
>> trace -f
================================================================
STACK TRACE FOR TASK: 0xe0000001ac3e0000 (touch)
0 schedule+3724 [0xe00000000446d06c]
1 __down+524 [0xe0000000044246ec]
2 open_namei+572 [0xe0000000044eb5bc]
3 filp_open+92 [0xe0000000044c9dfc]
4 sys_open+156 [0xe0000000044ca81c]
================================================================
>>
#
## 0xe000000196798000 500 7890 1 2 0 - bash
#
>> deftask 0xe000000196798000
Default task is 0xe000000196798000
>> trace -f
================================================================
STACK TRACE FOR TASK: 0xe000000196798000 (bash)
0 schedule+3724 [0xe00000000446d06c]
1 __down+524 [0xe0000000044246ec]
2 real_lookup+76 [0xe0000000044e7f8c]
3 link_path_walk+4572 [0xe0000000044e981c]
4 path_walk+44 [0xe0000000044e9f6c]
5 __user_walk+124 [0xe0000000044eb01c]
6 sys_newlstat+60 [0xe0000000044e10dc]
================================================================
>>
#
## 0xe0000001940d8000 500 8254 1 2 0 - bash
#
>> deftask 0xe0000001940d8000
Default task is 0xe0000001940d8000
>> trace -f
================================================================
STACK TRACE FOR TASK: 0xe0000001940d8000 (bash)
0 schedule+3724 [0xe00000000446d06c]
1 __down+524 [0xe0000000044246ec]
2 real_lookup+76 [0xe0000000044e7f8c]
3 link_path_walk+4572 [0xe0000000044e981c]
4 path_walk+44 [0xe0000000044e9f6c]
5 __user_walk+124 [0xe0000000044eb01c]
6 sys_newlstat+60 [0xe0000000044e10dc]
================================================================
>>
#
## 0xe000000191e38000 500 8429 1 2 0 - bash
#
>> deftask 0xe000000191e38000
Default task is 0xe000000191e38000
>> trace -f
================================================================
STACK TRACE FOR TASK: 0xe000000191e38000 (bash)
0 schedule+3724 [0xe00000000446d06c]
1 __down+524 [0xe0000000044246ec]
2 real_lookup+76 [0xe0000000044e7f8c]
3 link_path_walk+4572 [0xe0000000044e981c]
4 path_walk+44 [0xe0000000044e9f6c]
5 __user_walk+124 [0xe0000000044eb01c]
6 sys_newlstat+60 [0xe0000000044e10dc]
================================================================
>>
----- End forwarded message -----
--
Revolutions do not require corporate support.
|