xfs
[Top] [All Lists]

Re: defrag xfs

To: Sonny Rao <sonny@xxxxxxxxxxx>
Subject: Re: defrag xfs
From: Greg Freemyer <greg.freemyer@xxxxxxxxx>
Date: Fri, 21 Jan 2005 14:50:34 -0500
Cc: Steve Lord <lord@xxxxxxx>, linux-xfs@xxxxxxxxxxx
Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:references; b=dNLkuQXOnORTfK0hMvHtStY2nEFOxM5oUomUBC43hOslom2KkcNhslrhLailRo+rcFJaqw5hC75uHpRMhWmage6LztpS0RGVAqw+DxYsdwVT2SfazOU81FNvAUva1XQikL6FQeWMZ3J2tJvJS8TK4Os9R3ip7RsW4rJYsHe1f8I=
In-reply-to: <20050121190521.GA15073@xxxxxxxxxxxxxxxxxx>
References: <F62740B0EFCFC74AA6DCF52CD746242D010337FA@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <41F07494.1060501@xxxxxxx> <20050121043237.GA28699@xxxxxxxxxxxxxxxxxx> <1106286413.8580.66.camel@kennedy> <20050121054830.GA29637@xxxxxxxxxxxxxxxxxx> <m18y6n1ede.fsf@xxxxxx> <20050121070221.GA30287@xxxxxxxxxxxxxxxxxx> <41F11609.4020907@xxxxxxx> <20050121190521.GA15073@xxxxxxxxxxxxxxxxxx>
Reply-to: Greg Freemyer <greg.freemyer@xxxxxxxxx>
Sender: linux-xfs-bounce@xxxxxxxxxxx
On Fri, 21 Jan 2005 14:05:21 -0500, Sonny Rao <sonny@xxxxxxxxxxx> wrote:
> On Fri, Jan 21, 2005 at 08:47:37AM -0600, Steve Lord wrote:
> > Sonny Rao wrote:
> <snip>
> > I did describe how to do this once, but I no longer have that email, so
> > I have to recreate.
> >
> > 1. Add to the kernel the ability to turn off new allocations to an
> >    allocation group. You do need some special under the cover
> >    allocations into the group to work though, in freeing up the
> >    space in the allocation group, btree splits for the free space
> >    may be required, these still need to work in the interim.
> >
> > 2. Find all the directories with inodes or blocks in the allocation
> >    group - this requires walking all the extents of all the directory
> >    inodes..... so not fast. Note that just because an inode is not in
> >    the last allocation group does not mean it has no disk blocks there.
> >
> > 3. Recreate these directories from user space with a temp name, link all
> >    their contents over to the new directory, switch the names of the
> >    two inodes atomically inside the kernel, remove the old links and
> >    directory. There needs to be logic to detect new files appearing in
> >    the old directory, these need to be renamed to the new parent.
> >
> >    There are now only file blocks and inodes in the allocation group.
> >
> > 4. Repeat the above process with files who's inode or extents are in
> >    the allocation group. If just the inode is there (unlikely), then
> >    no need to move blocks. xfs_fsr contains most of the logic for this.
> >
> > 5. Fix up the superblock counters so that the allocation group count
> >    shrinks. Note this could be applied to several allocation
> >    groups at once.
> >
> > As Andi pointed out, this results in the inode numbers changing, so
> > there is no way to do while the filesystem is exported, it also probably
> > messes with backups - they would need redoing afterwards.
> >
> > There are several months of effort in this to get it all right and
> > working robustly.
> >
> > Given the low price of storage nowadays, it is a lot cheaper to buy
> > another disk than to pay someone to do this. At current rates for
> > an experienced xfs developer, you are talking about 120 Gbytes/hour
> > at current prices ;-)
> 
> I Guess I won't be paying for it anytime soon :)
> 
> > Now, what would be really neat is for a layer underneath the filesystem
> > to dynamically detect failing storage (smart?), take some storage from
> > a free pool of drives, and remap the filesystem blocks out to the new
> > space while it is live.
> 
> Hmm, I would think one might be able to do something like this by
> writing an EVMS/LVM2 plugin which communicated with smartd and could
> begin a migration to another device when an error is detected.  EVMS
> already supports dynamic bad-block-relocation.  In reality it's fairly
> useles  since modern drives do this for you anyway and won't report
> bad writes until they have actually run out of extra space.  But what
> you're proposing makes much more sense.
> 
> Thanks for the explanation.
> 
> Sonny
> 
> 
sg3_utils also has some disk re-mapping capabilities.

sg = SCSI Generic  (I think)

Below is a quote from another mailing list:

=====
Looking at the settings of the "read write error recovery"
mode page on /dev/sdl may be instructive. ['sginfo -e /dev/sdl'
from sg3_utils.] The PER bit seems to be set (otherwise a
recovered error should not have been reported) but the ARRE
and AWRE bits are probably clear. Those bits control the
automatic reaasignment of a block when a recovered error
occurs as reported in your case.

Assuming the problem occurred on a read and that the ARRE
it is clear then you may want to reassign that block. To
check its current state you might try:
 sg_dd if=/dev/sdl skip=0x25e6e3 of=. bs=512 count=1 blk_sgio=1

If that recovered error persists (or worse) rather than formatting
the disk, reassigning that block is more surgical. sg_reassign has
be added to sg3_utils recently (v1.12 beta at www.torque.net/sg)
to do this. In your case:
 sg_reassign -a 0x25e6e3 /dev/sdl

If successful the replaced sector should go into the
"grown" defect list ('sginfo -G /dev/sdl'). This utility
may be worth trying before and after the sg_reassign.

Another way to accomplish the same thing is to set
the ARRE bit (and the AWRE while you are at it) and do
another read of that block. The reported additonal
sense message should change to something like "Recovered
data: data auto-reallocated". Reading the whole disk
might be wise (to see if that lba was a lone case).

More generally this is not a good sign concerning the
health of that disk. No data has been lost _yet_ but it
had to work hard to recovery it. Any entries in the "grown"
defect list is not a good sign. Also with smartmontools
you might like to try 'smartctl -a /dev/sdl' and examine
the "Error counter log" and compare that does some of your
other drives that are not reporting problems. A long
self test may also be appropriate: 'smartctl -t long /dev/sdl'.

Doug Gilbert
=====

Greg
-- 
Greg Freemyer


<Prev in Thread] Current Thread [Next in Thread>