help with xfs_repair on 10TB fs
Eric Sandeen
sandeen at sandeen.net
Sat Jan 17 11:33:33 CST 2009
Alberto Accomazzi wrote:
> I need some help with figuring out how to repair a large XFS
> filesystem (10TB of data, 100+ million files). xfs_repair seems to
> have crapped out before finishing the job and now I'm not sure how to
> proceed.
>
> The system is a CentOS 5.2 storage server with a 3ware controller and
> 16 x 1TB drives, 32GB RAM and 64GB swap. After clearing the issues
> with bad blocks on the disks, yesterday we set out to fix the
> filesystem. This is the list of relevant packages that yum reports
> installed:
>
> kmod-xfs.x86_64 0.4-1.2.6.18_53.1.14.e installed
> kmod-xfs.x86_64 0.4-2 installed
> kmod-xfs.x86_64 0.4-1.2.6.18_92.1.10.e installed
> xfsdump.x86_64 2.2.46-1.el5.centos installed
> xfsprogs.x86_64 2.9.4-1.el5.centos installed
> xfsprogs-devel.x86_64 2.9.4-1.el5.centos installed
> kernel.x86_64 2.6.18-92.1.13.el5.cen installed
How did it "crap out?"
You could pretty easily run the very latest xfsprogs here by rebuilding
the src.rpm from
http://kojipkgs.fedoraproject.org/packages/xfsprogs/2.10.2/3.fc11/src/
> After bringing the system back, a mount of the fs reported problems:
>
> Starting XFS recovery on filesystem: sdb1 (logdev: internal)
> Filesystem "sdb1": XFS internal error xfs_btree_check_sblock at line 334 of file
> /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_btree.c. Caller 0x
> ffffffff882fa8d2
so log replay is failing now; but that indicates an unclean shutdown.
Something else must have happened between the xfs_repair and this mount
instance?
> Call Trace:
> [<ffffffff882eacc9>] :xfs:xfs_btree_check_sblock+0xbc/0xcb
> .....
>
> An xfs_check on the device suggests how to solve the problem:
>
> alberto at adsduo-54: sudo xfs_check /dev/sdb1
> ERROR: The filesystem has valuable metadata changes in a log which needs to
> be replayed. Mount the filesystem to replay the log, and unmount it before
> re-running xfs_check. If you are unable to mount the filesystem, then use
> the xfs_repair -L option to destroy the log and attempt a repair.
> Note that destroying the log may cause corruption -- please attempt a mount
> of the filesystem before doing this.
Just means that you have a dirty log.
> xfs_info reports the following for the filesystem:
>
> meta-data=/dev/sdb1 isize=256 agcount=32, agsize=98361855 blks
> = sectsz=512 attr=0
> data = bsize=4096 blocks=3147579360, imaxpct=25
> = sunit=0 swidth=0 blks, unwritten=1
> naming =version 2 bsize=4096
> log =internal bsize=4096 blocks=32768, version=1
> = sectsz=512 sunit=0 blks, lazy-count=0
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> So last night I started an "xfs_repair -L" on the device, which
> proceeded through step 6 before quitting at some point in the middle
> of the night without giving me many clues ast to what went wrong. I
> know that this process uses a ton of memory so we loaded the server
> with 32GB of RAM (the swap file is 64GB) and before goint to sleep I
> noticed that the xfs_repair was using about 24GB of RAM. I put the
> complete log of xfs_repair online at:
> http://www.cfa.harvard.edu/~alberto/ads/xfs_repair.log
wow, that's messy
> bad hash table for directory inode 58134992 (no data entry): rebuilding
> rebuilding directory inode 58134992
> rebuilding directory inode 58345355
> rebuilding directory inode 60221905
>
> So I'm lead to believe that xfs_repair died before completing the job.
> Should I try again? Does anyone have an idea why this might have
> happened? Is it possible that we still don't have enough memory in
> the system for xfs_repair to do the job? Also, it's not clear to me
> how xfs_repair works. Assuming we won't be able to get it to complete
> all of its steps, has it in fact repaired the filesystem somewhat or
> are all the changes mentioned while it runs not committed to the
> filesystem until the end of the run?
I don't see any evidence of it dying in the logs; either it looks like
it's still progressing, or it's stuck.
> For lack of better ideas I'm running an xfs_check at the moment. It's
> been running for close to an hour and has used almost 29GB of memory
> so far. No errors reported.
xfs_check doesn't actually repair anything, just FWIW.
I'd rebuild the srpm I mentioned above and give xfs_repair another shot
with that newer version, at this point.
-Eric
> TIA,
>
> -- Alberto
More information about the xfs
mailing list