Stalled xfs_repair on 100TB filesystem
Jason.Vagalatos at citrixonline.com
Tue Mar 2 11:22:34 CST 2010
On Friday 2/26 I started an xfs_repair on a 100TB filesystem:
#> nohup xfs_repair -v -l /dev/logfs-sessions/logdev /dev/logfs-sessions/sessions > /root/xfs_repair.out.logfs1.sjc.02262010 &
I've been monitoring the process with 'top' and tailing the output file from the redirect above. I believe the repair has "stalled". When the process was running 'top' showed almost all physical memory consumed and 12.6G of virt memory consumed by xfs_repair. It made it all the way to Phase 6 and has been sitting at agno = 14 for almost 48 hours. The memory consumption of xfs_repair has ceased but the process is still "running" and consuming 100% CPU:
top - 10:10:37 up 3 days, 21:06, 1 user, load average: 1.20, 1.13, 1.09
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
Cpu(s): 12.5%us, 0.0%sy, 0.0%ni, 87.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8177380k total, 896668k used, 7280712k free, 247100k buffers
Swap: 56525356k total, 173852k used, 56351504k free, 304588k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32705 root 25 0 160m 95m 704 R 100 1.2 2629:53 xfs_repair
#> tail -f -n1000 xfs_repair.out.logfs1.sjc.02262010
- agno = 98
- agno = 99
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
<stopped here, fs has 99 ag's>
Is there anything I can do at this point to salvage the repair? I do not want to kill the repair process based on the amount of time it takes to run. If I do kill it, is there any risk of damaging the filesystem?
Any help would be greatly appreciated.
More information about the xfs