xfs
[Top] [All Lists]

help with xfs_repair on 10TB fs

To: xfs@xxxxxxxxxxx
Subject: help with xfs_repair on 10TB fs
From: Alberto Accomazzi <aaccomazzi@xxxxxxxxx>
Date: Sat, 17 Jan 2009 12:13:26 -0500
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type:content-transfer-encoding; bh=ZXIVJ4i8TQMOCEbVfkdkSOXU3wAEGVBbfKMt0dZwWvk=; b=Xnqc76RUxjpM6aoTFciHNxbeZE/G+Rw3AoZZS+Mi1ew91wrTR5aZrWtXrGEgWaj1Gs ieRoqYEUNzSJNWXlaw1qzw4KqbWRBCgub7tawgWgGFKGDINExlTZQVTi2wqrjREQ1fXG DgpcgNUZUmsBc5yRKqb3pWbrIoyT0A79LIkGU=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type :content-transfer-encoding; b=wIWoKDvrH7sMRhPu6pCJNCYvU+H+36T5hX/z3nwVGFXCLMKf4pbkhpHySnQOW/XvvK 9CdIWQxSxrMgBrtjssK3X+/eVSzYkA/ZbFCiT/lYOfI2MsVeseuZNBPwpzHc5WM5dHTc sVZ71yVZefM/ha6invB2GLhTQEQWRj3vs8dv4=
I need some help with figuring out how to repair a large XFS
filesystem (10TB of data, 100+ million files).  xfs_repair seems to
have crapped out before finishing the job and now I'm not sure how to
proceed.

The system is a CentOS 5.2 storage server with a 3ware controller and
16 x 1TB drives, 32GB RAM and 64GB swap.  After clearing the issues
with bad blocks on the disks, yesterday we set out to fix the
filesystem.  This is the list of relevant packages that yum reports
installed:

kmod-xfs.x86_64                          0.4-1.2.6.18_53.1.14.e installed
kmod-xfs.x86_64                          0.4-2                  installed
kmod-xfs.x86_64                          0.4-1.2.6.18_92.1.10.e installed
xfsdump.x86_64                           2.2.46-1.el5.centos    installed
xfsprogs.x86_64                          2.9.4-1.el5.centos     installed
xfsprogs-devel.x86_64                    2.9.4-1.el5.centos     installed
kernel.x86_64                            2.6.18-92.1.13.el5.cen installed

After bringing the system back, a mount of the fs reported problems:

Starting XFS recovery on filesystem: sdb1 (logdev: internal)
Filesystem "sdb1": XFS internal error xfs_btree_check_sblock at line 334 of file
 /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_btree.c.  Caller 0x
ffffffff882fa8d2

Call Trace:
 [<ffffffff882eacc9>] :xfs:xfs_btree_check_sblock+0xbc/0xcb
 .....

An xfs_check on the device suggests how to solve the problem:

alberto@adsduo-54: sudo xfs_check /dev/sdb1
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_check.  If you are unable to mount the filesystem, then use
the xfs_repair -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

xfs_info reports the following for the filesystem:

meta-data=/dev/sdb1              isize=256    agcount=32, agsize=98361855 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=3147579360, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=32768, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0

So last night I started an "xfs_repair -L" on the device, which
proceeded through step 6 before quitting at some point in the middle
of the night without giving me many clues ast to what went wrong.  I
know that this process uses a ton of memory so we loaded the server
with 32GB of RAM (the swap file is 64GB) and before goint to sleep I
noticed that the xfs_repair was using about 24GB of RAM.  I put the
complete log of xfs_repair online at:
http://www.cfa.harvard.edu/~alberto/ads/xfs_repair.log

bad hash table for directory inode 58134992 (no data entry): rebuilding
rebuilding directory inode 58134992
rebuilding directory inode 58345355
rebuilding directory inode 60221905

So I'm lead to believe that xfs_repair died before completing the job.
 Should I try again?  Does anyone have an idea why this might have
happened?  Is it possible that we still don't have enough memory in
the system for xfs_repair to do the job?  Also, it's not clear to me
how xfs_repair works.  Assuming we won't be able to get it to complete
all of its steps, has it in fact repaired the filesystem somewhat or
are all the changes mentioned while it runs not committed to the
filesystem until the end of the run?

For lack of better ideas I'm running an xfs_check at the moment.  It's
been running for close to an hour and has used almost 29GB of memory
so far.  No errors reported.

TIA,

-- Alberto

<Prev in Thread] Current Thread [Next in Thread>