On Fri, Mar 01, 2013 at 01:24:36PM +0100, Ole Tange wrote:
> On Fri, Mar 1, 2013 at 12:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Thu, Feb 28, 2013 at 04:22:08PM +0100, Ole Tange wrote:
> >> I forced a RAID online. I have done that before and xfs_repair
> >> normally removes the last hour of data or so, but saves everything
> >> else.
> > Why did you need to force it online?
> More than 2 harddisks went offline. We have seen that before and it is
> not due to bad harddisks. It may be due to driver/timings/controller.
I thought that might be the case. What filesystem errors occurred
when the srives went offline?
> >> /usr/local/src/xfsprogs-3.1.10/repair# ./xfs_repair -n /dev/md5p1
> >> Phase 1 - find and verify superblock...
> >> Phase 2 - using internal log
> >> - scan filesystem freespace and inode maps...
> >> flfirst 232 in agf 91 too large (max = 128)
> > Can you run:
> > # xfs_db -c "agf 91" -c p /dev/md5p1
> > And post the output?
> # xfs_db -c "agf 91" -c p /dev/md5p1
> xfs_db: cannot init perag data (117)
Interesting. It's detecting corrupt AG headers.
> magicnum = 0x58414746
> versionnum = 1
> seqno = 91
> length = 268435200
> bnoroot = 295199
> cntroot = 13451007
> bnolevel = 2
> cntlevel = 2
> flfirst = 232
> fllast = 32
> flcount = 191
That implies that the free list is actually 232+191-32 = 391
entries long. That doesn't add up any way I look at it. both the
flfirst and flcount fields look wrong here, which rules out a simple
bit error as the problem. I can't see how these values would have
been written by XFS as they are out of range for 512 byte sector
if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp))
agf->agf_flfirst = 0;
So I suspect that something more than just disks going off line here
went wrong here, as I've never seen this sort of corruption before...