At work, we have a huge 5.6 Tb SCSI RAID, hooked up to a SuperMicro
6012P-6 1U server, which NFS exports the RAID to over 100 machines.
The entire RAID is configured as a single XFS filesystem.
Because the RAID is over 2 Tb, and the SCSI implementation in the
RAID doesn't support the newer READ16/WRITE16 SCSI commands, the
RAID unit has set the sector size to 2048 bytes. Also, because the
partition tables seem to assume a sector size of 512 bytes, we are
running XFS directly on the block device /dev/sdc, _NOT_ /dev/scd1.
We are runing on a Suse 9.2 system, with a 2.6.11.7 kernel.
Anyhow, we have been having some hang (not panic) issues with the SuperMicro,
and just to make sure it wasn't an XFS problem we decided to check the
XFS filesystem for errors. We first ran xfs_check, but it immediately
gives an "out of memory" error. What are my options for fixing this?
We added oodles of swap space, and made sure ulimit was set to unlimited
data and stack space. Since xfs_check is just a shell script wrapper
around xfs_db, we can't run that either. :-(
So, we next ran xfs_repair -n, and it reported the following errors:
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
LEAFN node level is 1 inode 1074431615 bno = 8388608
LEAFN node level is 1 inode 1074431632 bno = 8388608
- agno = 2
LEAFN node level is 1 inode 2148178682 bno = 8388608
LEAFN node level is 1 inode 2148183066 bno = 8388608
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
- agno = 16
- agno = 17
- agno = 18
- agno = 19
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 27
- agno = 28
- agno = 29
- agno = 30
- agno = 31
- process newly discovered inodes...
and it repeated the LEAFN warnings again in Phase 4.
A quick check of xfs_repair source shows that xfs_repair only reports this
situation, but does nothing about it. However, we unmounted the filesystem, and
ran xfs_repair in full repair mode, just in case, and hoping that Phase 5
(which is skipped in -n mode) would fix the problem, but it doesn't.
So, I checked the xfs mailing list archives, and this has been mentioned a
few times before, but no definate answers were given. However, I did notice
that in EVERY ONE of the instances where this message was printed, the bno
was equal to 8388608. Here is a link to the xfs mailing list archives that
shows the 6 messages that contain the LEAFN error message:
<http://oss.sgi.com/cgi-bin/namazu.cgi?query=LEAFN&submit=Search%21&idxname=linux-xfs&max=20&result=normal&sort=score>
Decimal 8388608 equals hex 0x800000. This seems very suspicious. Is this a
real problem, or just a mild annoyance?
BTW, we were able to get rid of the error message using "find . -inum ..." to
find the offending files. They were all directories. We then moved them aside,
created new directories, mv the contents into the new directory, and deleted
the old directory.
Oh, just in case anyone was wondering xfs_repair took about 45 minutes on our
5.6 Tb RAID which is 81% full, and running "find . -inum ..." took about 50
minutes.
Many thanks in advance for any help or insight.
Ken Sumrall
ksumrall@xxxxxxxxxxx
|