X-Spam-Checker-Version: SpamAssassin 3.3.0-rupdated (updated) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-0.6 required=5.0 tests=AWL,BAYES_00, FH_DATE_PAST_20XX autolearn=no version=3.3.0-rupdated Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o230NcwV026448 for ; Tue, 2 Mar 2010 18:23:39 -0600 X-ASG-Debug-ID: 1267575903-771101350000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from mail.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 95EB9211D0F for ; Tue, 2 Mar 2010 16:25:04 -0800 (PST) Received: from mail.internode.on.net (bld-mail17.adl2.internode.on.net [150.101.137.102]) by cuda.sgi.com with ESMTP id Zuc94YFy0cExvh1K for ; Tue, 02 Mar 2010 16:25:04 -0800 (PST) Received: from discord (unverified [121.44.103.80]) by mail.internode.on.net (SurgeMail 3.8f2) with ESMTP id 15495418-1927428 for multiple; Wed, 03 Mar 2010 10:55:02 +1030 (CDT) Received: from dave by discord with local (Exim 4.69) (envelope-from ) id 1NmcOa-000308-Mg; Wed, 03 Mar 2010 11:25:00 +1100 Date: Wed, 3 Mar 2010 11:25:00 +1100 From: Dave Chinner To: Jason Vagalatos Cc: "xfs@oss.sgi.com" X-ASG-Orig-Subj: Re: Stalled xfs_repair on 100TB filesystem Subject: Re: Stalled xfs_repair on 100TB filesystem Message-ID: <20100303002500.GH18369@discord.disaster> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) X-Barracuda-Connect: bld-mail17.adl2.internode.on.net[150.101.137.102] X-Barracuda-Start-Time: 1267575905 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.23872 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on oss.sgi.com X-Virus-Status: Clean On Tue, Mar 02, 2010 at 09:22:34AM -0800, Jason Vagalatos wrote: > Hello, On Friday 2/26 I started an xfs_repair on a 100TB > filesystem: > > #> nohup xfs_repair -v -l /dev/logfs-sessions/logdev > /dev/logfs-sessions/sessions > > /root/xfs_repair.out.logfs1.sjc.02262010 & > > I've been monitoring the process with 'top' and tailing the output > file from the redirect above.  I believe the repair has > "stalled".  When the process was running 'top' showed almost all > physical memory consumed and 12.6G of virt memory consumed by > xfs_repair.  It made it all the way to Phase 6 and has been > sitting at agno = 14 for almost 48 hours.  The memory consumption > of xfs_repair has ceased but the process is still "running" and > consuming 100% CPU: I wish we could reproduce hangs like this easily. I'd kill the repair and run with the -P option. From the xfs_repair man page: -P Disable prefetching of inode and directory blocks. Use this option if you find xfs_repair gets stuck and proceeding. Interrupting a stuck xfs_repair is safe. Cheers, Dave. -- Dave Chinner david@fromorbit.com