[Top] [All Lists]

Re: 10minutes for rm -rf on 400MB

To: Dirk Wetter <dirkw@xxxxxxxxxx>
Subject: Re: 10minutes for rm -rf on 400MB
From: Steve Lord <lord@xxxxxxx>
Date: Mon, 10 Sep 2001 10:13:57 -0500
Cc: "Bernhard R. Erdmann" <be@xxxxxxxxxxx>, Linux XFS Mailing List <linux-xfs@xxxxxxxxxxx>
In-reply-to: Message from Dirk Wetter <dirkw@rentec.com> of "Sun, 09 Sep 2001 20:42:51 EDT." <3B9C0C8B.2070201@rentec.com>
Sender: owner-linux-xfs@xxxxxxxxxxx
> hi,
> Bernhard R. Erdmann wrote:
> >>we've been running XFS on the data disks of our HPC Linux cluster since 
> >>a while. we are quite happy with xfs, thx guys for your work!
> >>the setup is:
> >>
> >>- dual >=1GHZ box, 4GB mem
> >>- lvm 0.9beta7, phys. volume size ~140 GB, logical vol for xfs: 100GB
> >>- no additional mount options or options for mkfs.xfs were given
> >>- kernel 2.4.8pre4-xfs, highly patched SuSE 7.0 (not that it should matter)
> >>
> >>a user complained that "rm -rf of 400MB" takes ~10 minutes (!) until
> >>the
> >>command returns, whereas on the systems with reiserfs we have e.g. it
> >>takes seconds.
> >>
> >
> >Some very important data is missing:
> >- what's the I/O performance of the disk subsystem?
> >
> why is that relevant? with reiserfs it takes seconds,  so the 
> disks/controller cannot be the bottleneck.

This is because of a fundamental difference between reiserfs operation and xfs
operation. reiserfs flushes its log to disk periodically, the log could in
theory be very large in memory - a crash will basically undo everything which
is in the in memory log. In fact, I have seen reiserfs create and remove 30000
files without doing any disk I/O at all.

On the other hand XFS has a small in memory log, 64K is the default, when
this is full it must go to disk. The in memory log is built of fixed sized
32K buffers, you can add more with the mount option logbufs=x where x is
the number allowed, 8 is the maximum. This translates into the number of
log writes which can be in transit at anyone time. This small log means
that the amount of metadata which update can be lost at a crash is fairly

To compound this issue some transactions in XFS are always synchronous,
freeing space being one of them. The reason for this is complex, say you
free some metadata and then it gets reallocated as data - and the data
gets flushed out to disk. If you crash at this point and the removal of the
space was in a transaction which did not make it out to disk then you
end up with a filesystem which has data in an active metadata block.
In order to avoid this situation, XFS flushes the free space operation to
disk immediately, an expensive operation which is being done to deal with
a very rare set of events. The 'obvious' fix is to not reuse this freed
space until its transaction is on disk, or to not flush the reallocation
to disk until the transaction is on disk. This fix has been on the TODO
list around here for quite a long time - but there is always higher priority
work to do.

Having said that, all is not right with your system. I have a 2 CPU 450 Mhz PIII
using a 7200rpm scsi drive on an Adaptec 7896 controller. On a single partition
on this machine I just created 3 complete copies of the xfs tree, (kernel and
commands) this consisted of 34549 files occupying 500Mbytes of disk space.
In order to clean the cache and force reads from disk in the remove process
I unmounted and remounted the filesystem before removing it. It took 65.373
seconds to remove the whole directory tree with a single rm -r -f.
Having said that, this filesystem was mounted with 8 log buffers, and has
a 16384 block log.

The mkfs option for creating a bigger log is:

        mkfs -t xfs -f -l size=16384b /dev/xxx


<Prev in Thread] Current Thread [Next in Thread>