xfs
[Top] [All Lists]

Stalled xfs_repair on 100TB filesystem

To: "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Subject: Stalled xfs_repair on 100TB filesystem
From: Jason Vagalatos <Jason.Vagalatos@xxxxxxxxxxxxxxxx>
Date: Tue, 2 Mar 2010 09:22:34 -0800
Accept-language: en-US
Acceptlanguage: en-US
Cc: "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Thread-index: Acq6LPKIZBi9KAZARFiD3Wly/qwQrA==
Thread-topic: Stalled xfs_repair on 100TB filesystem
Hello,
On Friday 2/26 I started an xfs_repair on a 100TB filesystem:

#> nohup xfs_repair -v -l /dev/logfs-sessions/logdev 
/dev/logfs-sessions/sessions > /root/xfs_repair.out.logfs1.sjc.02262010 &

I've been monitoring the process with 'top' and tailing the output file from 
the redirect above.  I believe the repair has "stalled".  When the process was 
running 'top' showed almost all physical memory consumed and 12.6G of virt 
memory consumed by xfs_repair.  It made it all the way to Phase 6 and has been 
sitting at agno = 14 for almost 48 hours.  The memory consumption of xfs_repair 
has ceased but the process is still "running" and consuming 100% CPU:

top - 10:10:37 up 3 days, 21:06,  1 user,  load average: 1.20, 1.13, 1.09
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
Cpu(s): 12.5%us,  0.0%sy,  0.0%ni, 87.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8177380k total,   896668k used,  7280712k free,   247100k buffers
Swap: 56525356k total,   173852k used, 56351504k free,   304588k cached

  PID    USER   PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  
COMMAND                                                                
32705  root     25   0  160m  95m  704 R    100        1.2      2629:53 
   xfs_repair

#> tail -f -n1000 xfs_repair.out.logfs1.sjc.02262010
........
        - agno = 98
        - agno = 99
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
<stopped here, fs has 99 ag's>

Is there anything I can do at this point to salvage the repair?  I do not want 
to kill the repair process based on the amount of time it takes to run.  If I 
do kill it, is there any risk of damaging the filesystem?

Any help would be greatly appreciated.

Thank you

<Prev in Thread] Current Thread [Next in Thread>