xfs
[Top] [All Lists]

Re: 2.4.18-14SGI_XFS_1.2a1 oops && raid5 troubles

To: Daryl Herzmann <akrherz@xxxxxxxxxxx>
Subject: Re: 2.4.18-14SGI_XFS_1.2a1 oops && raid5 troubles
From: Simon Matter <simon.matter@xxxxxxxxxxxxxxxx>
Date: Mon, 07 Oct 2002 17:13:06 +0200
Cc: linux-xfs@xxxxxxxxxxx
Organization: Sauter AG, Basel
References: <Pine.LNX.4.44.0210070843550.4175-300000@xxxxxxxxxxxxxxxxxxxxxxxxxx>
Sender: linux-xfs-bounce@xxxxxxxxxxx
Daryl Herzmann schrieb:
> 
> Hi!
>     You all have been great help in the past.  Hopefully you can help me
> save my 800+ GB data partition!
> 
> It all started after a sucessful upgrade to RH 8.0 .  I swapped video
> cards to play with a Matrox G450 and everything locked hard after starting
> X once, screen went dark, no ethernet response.  So I hard reset <sigh>
> 
> So once the machine rebooted, my raid5 array (8x120, no spares) started a
> reconstruction.  After about 20 minutes, I got this error
> 
> Oct  4 21:05:28 pircsds0 kernel: hdf: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Oct  4 21:05:28 pircsds0 kernel: hdf: dma_intr: error=0x01 {
> AddrMarkNotFound }, LBAsect=43686424, high=2, low=10131992,
> sector=43686315
> Oct  4 21:05:30 pircsds0 kernel: hdf: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Oct  4 21:05:30 pircsds0 kernel: hdf: dma_intr: error=0x01 {
> AddrMarkNotFound }, LBAsect=43686424, high=2, low=10131992,
> sector=43686315
> Oct  4 21:05:31 pircsds0 kernel: hdf: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Oct  4 21:05:31 pircsds0 kernel: hdf: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=43686424, high=2, low=10131992,
> sector=43686315
> Oct  4 21:05:31 pircsds0 kernel: end_request: I/O error, dev 21:41 (hdf),
> sector 43686315
> 
> >From the syslogs, the raid array went into degradded and then the machine
> locked up.  Again no eth0 or video.  So I hard reset again <sigh>
> 
> Seeing those DMA errors, I decided to disable DMA for the drives and then
> let it reconstruct that way.  Well, the estimations were about 15 days
> for raid reconstruction, so I just disabled DMA for hdf.  After about 90
> minutes, hdl produced the same DMA errors. Sooo, I stopped everything,
> marked hdf1 as being a failed disk and started raid5 in degradded mode.

I don't think what you see are DMA related problems but just bad sectors
on the disks (just my own experience with IBM deathstar disks). The
problem is that after a crash the RAID resync detects all those error on
the disk which you don't find otherwise because there is no access the
the exact location where the disk failed. That's why hardware RAID
controllers like 3ware can do background surface check (or how they call
it, I don't have 3ware hardware).

I'm using 4 disks on a Promise PDC20268 TX2  Ultra-100 controller and
had a problem recently. The problem was that I have put two disks per
channel and when one disk had a problem, it would block I/O on the
channel so the other disks was blocked too and I had a corrupt RAID5. I
rebooted, mounted, and it crashed again. I have then physically removed
the bad drive and marked it as failed disk in /etc/raidtab. Then I
recreated the RAID5 volume with mkraid -f /dev/mdx, which in fact
brought the RAID5 back and I was able to mount the XFS volume and it was
all fine.

HTH
Simon

> 
> At this point, I am getting desperate :)  So I ran xfs_repair on /dev/md0
> and did not get any errors, so I mounted the device, again no errors.  I
> then tried doing a simple 'ls -l' in the top level directory and
> immediately got this (Attached as error.txt)  So I ran ksymoops on it and
> that is attached as well (ksymoops.txt)
> 
> Does anybody have any ideas about how to proceed?  Some other bits of
> information are
> 
>   1.  I am using 36 inch 80 pin cables
>   2.  The eight drives are on two Promise PDC20269 TX2  Ultra-133
>       controllers
>   3.  This array has been functional for over 10 months, but it has never
>       experienced a crash/hard reset.
>   4.  This is not the same system that I have reported raid5/XFS troubles
>       before.
> 
> Thanks,
>   Daryl
> 
> --
> /**
>  * Daryl Herzmann (akrherz@xxxxxxxxxxx)
>  * Program Assistant -- Iowa Environmental Mesonet
>  * http://mesonet.agron.iastate.edu
>  */
> 
>   ------------------------------------------------------------------------
>                 Name: error.txt
>    error.txt    Type: Plain Text (TEXT/plain)
>             Encoding: BASE64
> 
>                    Name: ksymoops.txt
>    ksymoops.txt    Type: Plain Text (TEXT/plain)
>                Encoding: BASE64


<Prev in Thread] Current Thread [Next in Thread>