xfs
[Top] [All Lists]

Re: hdd strange badblocks problem

To: Linux-XFS Mailing List <linux-xfs@xxxxxxxxxxx>
Subject: Re: hdd strange badblocks problem
From: Federico Sevilla III <jijo@xxxxxxxxxxx>
Date: Tue, 6 Jul 2004 17:04:35 +0800
In-reply-to: <20040705213659.GA29703@pooh>
Mail-followup-to: Linux-XFS Mailing List <linux-xfs@xxxxxxxxxxx>
References: <20040705213659.GA29703@pooh>
Sender: linux-xfs-bounce@xxxxxxxxxxx
User-agent: Mutt/1.5.6+20040523i
On Mon, Jul 05, 2004 at 11:36:59PM +0200, Laszlo 'GCS' Boszormenyi wrote:
> So actually my problem is that we have a small server in our office,
> which served us well in the last one and half years. However when I
> came in this morning, I realised it's running slow. First I thought a
> network problem, but quickly recognised that the second and 'big' --
> read 120 Gb Maxtor IDE hdd -- has read errors on it. As it's quite
> new, less than six months old, I hoped it's not really true for some
> minutes.

I have a fairly new (less than six months old, too) 120GB Seagate IDE
hard drive that uses XFS, and recently ran into hardware read problems
with it, too. I didn't need the data in it (it's used for backups using
rsnapshot, and both machines it was backing up were okay so I could do
without the backups for a short while) so I used DBAN
[http://dban.sourceforge.net/] and ran multiple rounds of the PRNG
writes with verification on every step. What this basically did was
write random data to each sector of the entire drive then read things to
make sure the data was actually written, about 45 times all in all (I
dictated how many rounds to do, it takes about 1.5 hours per round).

(Note: in case it's not yet obvious, I wiped out the entire drive doing
this, which was okay in my case since I didn't need the data and really
just wanted to make sure the drive was actually okay.)

My rationale for doing this is that modern drives automatically remap
bad sectors using a set of reserved sectors. This is done on-the-fly,
but only happens when you write to a bad sector. Reading from a bad
sector will just give you an error. This seems to have done the trick:
I've been using the drive for five days (and read from and write to it
intensively every night during the automatic backup) and haven't had
errors in five days so far.

It may be worth mentioning that the drive consistently passed the full
media scans I did using Seagate's SeaTools utility, before and after the
IDE read errors showed up with Linux.

> So I began experiencing, someone told me there's a good program called
> HDD Regenerator, which can cure bad blocks. OK, used it, and although
> I did not wait until it finishes, but the first 60 Gb was checked,
> about 500 bad sectors were found, and 'fixed'. Wow, I checked the
> partition table, and that's readable and correct! I begin to have
> faith again.

Maybe you want to run HDD Regenerator completely to fix your entire
drive before running xfs_repair?

> How should I proceed? I have tried to do 'xfs_repair -n -v /dev/hde1':

As recommended by the experts on the list, run xfs_repair on /dev/hde1
without "-n". If this is able to fix everything, and you can mount
/dev/hde1, view all the files, and xfs_check gives your filesystem a
clean bill of health then you're good.

There's an off-chance that it won't be able to fix things completely,
though. In my particular case I ran into corruption on a filesystem on a
perfectly good internal drive, because the entire IO system locked up
when operations on a separate usb-storage drive also running XFS froze.

xfs_repair was able to fix things partially, but I ran into errors
similar to those detailed in the mailing list archives in
<http://marc.free.net.ph/thread/20030223.163330.ad33fb2e.html>.
Fortunately only three files were corrupted and I didn't need them, so I
used the tips detailed in the previously-mentioned thread to remove the
files using xfs_db.

 --> Jijo

-- 
Federico Sevilla III : jijo.free.net.ph : When we speak of free software
GNU/Linux Specialist : GnuPG 0x93B746BE : we refer to freedom, not price.


rom owner-linux-xfs@xxxxxxxxxxx Tue Jul  6 04:53:36 2004
Received: with ECARTIS (v1.0.0; list linux-xfs); Tue, 06 Jul 2004 04:53:45 
-0700 (PDT)
Received: from pimout3-ext.prodigy.net (pimout3-ext.prodigy.net 
[207.115.63.102])
        by oss.sgi.com (8.12.10/8.12.9) with SMTP id i66BrZgi014088
        for <linux-xfs@xxxxxxxxxxx>; Tue, 6 Jul 2004 04:53:36 -0700
Received: from taniwha.stupidest.org 
(adsl-63-202-172-209.dsl.snfc21.pacbell.net [63.202.172.209])
        by pimout3-ext.prodigy.net (8.12.10 milter /8.12.10) with ESMTP id 
i66BrYlM116822;
        Tue, 6 Jul 2004 07:53:34 -0400
Received: by taniwha.stupidest.org (Postfix, from userid 38689)
        id 5F1A710A8CD2; Tue,  6 Jul 2004 04:53:33 -0700 (PDT)
Date: Tue, 6 Jul 2004 04:53:33 -0700
From: Chris Wedgwood <cw@xxxxxxxx>
To: Andi Kleen <ak@xxxxxxx>
Cc: linux-xfs@xxxxxxxxxxx
Subject: Re: [PATCH] deadlocks on ENOSPC
Message-ID: <20040706115333.GA2098@xxxxxxxxxxxxxxxxxxxxx>
References: <20040612040838.020a2efb.ak@xxxxxxx>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20040612040838.020a2efb.ak@xxxxxxx>
X-archive-position: 3597
X-ecartis-version: Ecartis v1.0.0
Sender: linux-xfs-bounce@xxxxxxxxxxx
Errors-to: linux-xfs-bounce@xxxxxxxxxxx
X-original-sender: cw@xxxxxxxx
Precedence: bulk
X-list: linux-xfs

On Sat, Jun 12, 2004 at 04:08:38AM +0200, Andi Kleen wrote:

> I've been tracking a deadlock on out of space conditions.

I can get these too...  but not just with processes stuck in D,
sometimes it will wedge the entire machine up solid (no network
activity, sysrq will no bring the screen out of sleep, etc).  This is
a little surprising on an SMP machine.

Have you seen anything similar?

It's fairly repeatable.  The NMI watchdog is busted and I've not had a
chance to look over this, but the fact everything dies seems odd, (my
best guess now is that it's wedging up the CPU which happens to have
the interrupt routing for eth0 and the 8042 on it, but that seems a
little odd it would do it every-time).


  --cw


rom owner-linux-xfs@xxxxxxxxxxx Tue Jul  6 05:46:01 2004
Received: with ECARTIS (v1.0.0; list linux-xfs); Tue, 06 Jul 2004 05:46:08 
-0700 (PDT)
Received: from Cantor.suse.de (cantor.suse.de [195.135.220.2])
        by oss.sgi.com (8.12.10/8.12.9) with SMTP id i66Cjogi021141
        for <linux-xfs@xxxxxxxxxxx>; Tue, 6 Jul 2004 05:45:51 -0700
Received: from hermes.suse.de (hermes-ext.suse.de [195.135.221.8])
        (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (168/168 bits))
        (No client certificate requested)
        by Cantor.suse.de (Postfix) with ESMTP id 6460885879E;
        Tue,  6 Jul 2004 14:08:40 +0200 (CEST)
Date: Tue, 6 Jul 2004 14:08:23 +0200
From: Andi Kleen <ak@xxxxxxx>
To: Chris Wedgwood <cw@xxxxxxxx>
Cc: linux-xfs@xxxxxxxxxxx
Subject: Re: [PATCH] deadlocks on ENOSPC
Message-Id: <20040706140823.5fea0584.ak@xxxxxxx>
In-Reply-To: <20040706115333.GA2098@xxxxxxxxxxxxxxxxxxxxx>
References: <20040612040838.020a2efb.ak@xxxxxxx>
        <20040706115333.GA2098@xxxxxxxxxxxxxxxxxxxxx>
X-Mailer: Sylpheed version 0.9.11 (GTK+ 1.2.10; i686-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-archive-position: 3598
X-ecartis-version: Ecartis v1.0.0
Sender: linux-xfs-bounce@xxxxxxxxxxx
Errors-to: linux-xfs-bounce@xxxxxxxxxxx
X-original-sender: ak@xxxxxxx
Precedence: bulk
X-list: linux-xfs

On Tue, 6 Jul 2004 04:53:33 -0700
Chris Wedgwood <cw@xxxxxxxx> wrote:

> On Sat, Jun 12, 2004 at 04:08:38AM +0200, Andi Kleen wrote:
> 
> > I've been tracking a deadlock on out of space conditions.
> 
> I can get these too...  but not just with processes stuck in D,
> sometimes it will wedge the entire machine up solid (no network
> activity, sysrq will no bring the screen out of sleep, etc).  This is
> a little surprising on an SMP machine.
> 
> Have you seen anything similar?

Nope, I only see file system deadlocks, but the machine is still quite
usable.

-Andi


<Prev in Thread] Current Thread [Next in Thread>