[Top] [All Lists]

Re: help with xfs_repair on 10TB fs

To: Eric Sandeen <sandeen@xxxxxxxxxxx>
Subject: Re: help with xfs_repair on 10TB fs
From: Alberto Accomazzi <aaccomazzi@xxxxxxxxx>
Date: Sat, 17 Jan 2009 13:42:51 -0500
Cc: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type; bh=p5U4fGvXn1SUMtcgyIA0ZRCkHe1p8C07tMACQRUWS8Y=; b=ZS2YYnB4X5Lc+M35qZ0uwVvjqCrwmgxlNVNJwr0zE0sgYjEHFcN6kCpSe91ZEA+OjO cdycJ4qexbObA7vgahmvGp6kAg+fOGX1SPvtZwlV82sSJ5eM7s58ox7gKL+Zo8SOk29u xx3tv4cqBPUYW51vHzKbw9xD4N6q/9vX3I8TA=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=q/BIVyyBEwQOrn9+Z1kgy9GFjxhNISn96dpaFzMJKCc/CQacG7FQq9j1mqmDK5IFRe QkFxsxuXtndGntDO3reMo1+kpkTSJTw9Kd8fru2dAWXquoBbUcWi44Qt7VPGp16Iddin n56rNZRHpcfSQk5g4B7fHWNImpY5MTF5o2Xoc=
In-reply-to: <4972166D.5000006@xxxxxxxxxxx>
References: <adcf4ef70901170913l693376d7s6fd0395e2c88e10@xxxxxxxxxxxxxx> <4972166D.5000006@xxxxxxxxxxx>
On Sat, Jan 17, 2009 at 12:33 PM, Eric Sandeen <sandeen@xxxxxxxxxxx> wrote:

> Alberto Accomazzi wrote:
> > I need some help with figuring out how to repair a large XFS
> > filesystem (10TB of data, 100+ million files).  xfs_repair seems to
> > have crapped out before finishing the job and now I'm not sure how to
> > proceed.
> How did it "crap out?

Well, in the way I described below, namely it ran for several hours and then
died without completing.  As you can see from the log (which captured both
stdout and stderr) there's nothing that indicates what terminated the
program.  And it's definitely not running now.

> the src.rpm from
> http://kojipkgs.fedoraproject.org/packages/xfsprogs/2.10.2/3.fc11/src/

Ok, I guess it's worth giving it a shot.  I assume I don't need to worry
about kernel modules because the xfsprogs don't depend on that, right?

> > After bringing the system back, a mount of the fs reported problems:
> >
> > Starting XFS recovery on filesystem: sdb1 (logdev: internal)
> > Filesystem "sdb1": XFS internal error xfs_btree_check_sblock at line 334
> of file
> >  /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_btree.c.
>  Caller 0x
> > ffffffff882fa8d2
> so log replay is failing now; but that indicates an unclean shutdown.
> Something else must have happened between the xfs_repair and this mount
> instance?

Sorry, I wasn't clear: there was indeed an unclean shutdown (actually a
couple), after which the mount would not succeed presumably because of the
dirty log.  I was able to mount the system read-only and take enough of a
look to see that there was significant corruption of the data.  Running
xfs_repair -L at that point seemed the only option available.  But do let me
know if this line of thinking is incorrect.

> > So I'm lead to believe that xfs_repair died before completing the job.
> >  Should I try again?  Does anyone have an idea why this might have
> > happened?  Is it possible that we still don't have enough memory in
> > the system for xfs_repair to do the job?  Also, it's not clear to me
> > how xfs_repair works.  Assuming we won't be able to get it to complete
> > all of its steps, has it in fact repaired the filesystem somewhat or
> > are all the changes mentioned while it runs not committed to the
> > filesystem until the end of the run?
> I don't see any evidence of it dying in the logs; either it looks like
> it's still progressing, or it's stuck.

It's definitely not running now, so it has died at some point.

> > For lack of better ideas I'm running an xfs_check at the moment.  It's
> > been running for close to an hour and has used almost 29GB of memory
> > so far.  No errors reported.
> xfs_check doesn't actually repair anything, just FWIW.

Right, but I'm hoping to get some clue as to the status of the filesystem at
this point.

> I'd rebuild the srpm I mentioned above and give xfs_repair another shot
> with that newer version, at this point.

Ok, will work on that and report back.  Thank you much for the suggestion.

-- Alberto

[[HTML alternate version deleted]]

<Prev in Thread] Current Thread [Next in Thread>