xfs
[Top] [All Lists]

2.4.18 XFS 1.1 : Gave up on XFS - too many Oops

To: linux-xfs@xxxxxxxxxxx
Subject: 2.4.18 XFS 1.1 : Gave up on XFS - too many Oops
From: Poul Petersen <petersp@xxxxxxxxxxxxx>
Date: Fri, 19 Jul 2002 12:07:55 -0700
Sender: owner-linux-xfs@xxxxxxxxxxx
        Since we started using XFS over a year ago, we have had periodic
problems. At the worst, our server would die perhaps twice a week with a
kswapd oops, or other Null Pointer errors. These problems persisted, though
changed slightly (I don't remember the whole history now - but I've posted
here several times) as we went from XFS 1.01, to 1.02, to 1.1 and from
kernel 2.4.5, 2.4.14, to 2.4.18 (I think that is right - the only ones I
remember for sure are 2.4.14+XFS-1.0.2 and 2.4.18+XFS-1.1). The problems
definitely seemed to intensify as if the file-systems had become hopelessly
corrupt. Indeed, running xfs_repair after each crash would extend the uptime
between crashes to a week, perhaps 10 days. 

        We also tried different hardware, in fact a total of three different
machines. We also tried to duplicate the problem: we kept the original
file-server OS disk with 2.4.5+XFS-1.01 which had been horribly unstable and
installed it into identical hardware: cpu, SCSI controllers, SAN controller,
SAN disks, network cards, etc. I then wrote a script which tried to
duplicate our typical usage (a continuous build) and stressed the machine
from 12 nodes for well over a week (pushing about 30MB/s: yup gig ether). I
also added a continuous backup and I even threw in some memory stress tests
as well as Bonnie. This configuration *never* crashed. I can only conclude
that there is something about the real usage of our file systems that
exposes a flaw in XFS, but I have no idea what it could be.

        I do feel that it is *not* a hardware problem. I state this because
we have migrated all of our file-systems from XFS to ext3 (about 1TB of
data). While I was going through the hassle of moving data, I decided to add
LVM as well (much recommended to anyone considering LVM - way cool). This
configuration, though more complicated, has been up solidly for about a
month now - a time period which would have seen our XFS based file server
crash perhaps 3 or 4 times. 

        I debated not sending this message since it seemed like I might be
taken as complaining. I'm not. We've given up on XFS simply because we can't
duplicate the problem in a controlled environment, so we don't feel like we
are going to be able to fix it. As this is a production machine, we simply
can't have it crash each week. I felt it was important to let everyone know
that there may be a significant bug, though obviously obscure.

-poul

Last XFS config:

Dell 2550 Dual P-III (Coppermine) 933 MHz
1 GB Ram
1 9GB internal disk, aic7xxx driver (kernel)
RedHat 7.2
kernel 2.4.18 + XFS 1.1 (and xfs tools)
nfsutils-0.3.3
Qlogic 2200 SAN Adaptor firmware 2.02.01 running Qlogic driver 6.0b20
Intel PRO/1000 Gig-Fiber running Intel e1000 driver 4.0.7
Zzyzx RocketStor 2000 Raid Head
Qlogic SAN box (Isolated Hard segment between file-server and Zzyzx)


<Prev in Thread] Current Thread [Next in Thread>