xfs
[Top] [All Lists]

Re: debugging xfs issue (maybe)

To: Nathan Scott <nathans@xxxxxxx>
Subject: Re: debugging xfs issue (maybe)
From: Steve Wray <stevew@xxxxxxxxxxxxxxx>
Date: Thu, 8 Jul 2004 13:21:23 +1200
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <20040708111416.E1946083@xxxxxxxxxxxxxxxxxxxxxxxx>
References: <200407081154.25556.stevew@xxxxxxxxxxxxxxx> <200407081247.11376.stevew@xxxxxxxxxxxxxxx> <20040708111416.E1946083@xxxxxxxxxxxxxxxxxxxxxxxx>
Reply-to: stevew@xxxxxxxxxxxxxxx
Sender: linux-xfs-bounce@xxxxxxxxxxx
User-agent: KMail/1.6.1
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thursday 08 July 2004 13:14, Nathan Scott wrote:
> On Thu, Jul 08, 2004 at 12:47:01PM +1200, Steve Wray wrote:
> > Well... the first thing we notice is that a nightly cron job
> > returns things like;
> >
> > /etc/cron.daily/standard:
> > find: /var/lib/postgres/data/base/141159285/141159709: Unknown
> > error 990 find: /var/lib/postgres/data/base/141159285/141159709:
> > Unknown error 990 find:
> > /var/lib/postgres/data/base/141159285/141159709: Unknown error 990
> >
> > and we look in dmesg and see things like;
> >
> > Filesystem "dm-0": corrupt dinode 67279401, extent total = 1025,
> > nblocks = 1.  Unmount and run xfs_repair.
>
> Hmm, that line is quite interesting - we shutdown the filesystem
> because that number of blocks is impossible with that number of
> extents.
>
> What I suspect you have there is a very fragmented file (with 1025
> extents) - this puts alot of strain on the memory allocator since
> we need to allocate contiguous pages to hold that information.  In
> most kernels until very recently, these allocations could fail, in
> very low-memory situations, and cause incore memory corruption - I
> wonder if thats what you have been seeing here.

This machine has 1G of RAM and looking at its memory profile over the 
last few months its not been pushing it...


> FWIW, the current -bk tree has a memory allocator which will not
> fail the above memory-holding-inode-extents allocation.

ok well its worth a shot, we will see about (yet another) new kernel
:)


> > So far we havn't been able to reliably reproduce the problem on
> > demand. Its a very active box. We reboot it, xfs_repair everything,
> > bring it up multiuser and sometimes the same day, sometimes weeks
> > later we find corruption. One time we went over a month without any
> > problems and thought the latest kernel upgrade had fixed it, then
> > it came back.
>
> That kind of randomness could be consistent with very low-memory
> situations.  It would be a good idea to keep an eye on the memory
> stats (lots of tools to export this info) to see if there is any
> correlation to VM strain and the failures you've been seeing.

We've been monitoring it with webminstats for some time now and have a 
good idea of its memory profile. It rarely uses swap.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFA7KGTmVx2hyhuTucRAqhyAKCPj8QZn0BxUwsMNfpSdRQ2VlORVgCfXhj+
KL3ad8USRUODiLgz4Xm0cRs=
=FltC
-----END PGP SIGNATURE-----


<Prev in Thread] Current Thread [Next in Thread>