xfs
[Top] [All Lists]

Re: debugging xfs issue (maybe)

To: Steve Wray <stevew@xxxxxxxxxxxxxxx>
Subject: Re: debugging xfs issue (maybe)
From: Nathan Scott <nathans@xxxxxxx>
Date: Thu, 8 Jul 2004 11:14:16 +1000
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <200407081247.11376.stevew@catalyst.net.nz>; from stevew@catalyst.net.nz on Thu, Jul 08, 2004 at 12:47:01PM +1200
References: <200407081154.25556.stevew@catalyst.net.nz> <20040708103041.D1946083@wobbly.melbourne.sgi.com> <200407081247.11376.stevew@catalyst.net.nz>
Sender: linux-xfs-bounce@xxxxxxxxxxx
User-agent: Mutt/1.2.5i
On Thu, Jul 08, 2004 at 12:47:01PM +1200, Steve Wray wrote:
> Well... the first thing we notice is that a nightly cron job returns things 
> like;
> 
> /etc/cron.daily/standard:
> find: /var/lib/postgres/data/base/141159285/141159709: Unknown error 990
> find: /var/lib/postgres/data/base/141159285/141159709: Unknown error 990
> find: /var/lib/postgres/data/base/141159285/141159709: Unknown error 990
> 
> and we look in dmesg and see things like;
> 
> Filesystem "dm-0": corrupt dinode 67279401, extent total = 1025, nblocks = 1. 
>  Unmount and run xfs_repair.

Hmm, that line is quite interesting - we shutdown the filesystem
because that number of blocks is impossible with that number of
extents.

What I suspect you have there is a very fragmented file (with 1025
extents) - this puts alot of strain on the memory allocator since
we need to allocate contiguous pages to hold that information.  In
most kernels until very recently, these allocations could fail, in
very low-memory situations, and cause incore memory corruption - I
wonder if thats what you have been seeing here.

FWIW, the current -bk tree has a memory allocator which will not
fail the above memory-holding-inode-extents allocation.

> > If you can find test cases from the problems you experienced,
> > that show going from a filesystem in known state A, applied a
> > sequence of operations X, Y and Z, and ended up with filesystem
> > or file data in bad state B - that is the sort of thing I can
> > best help out with, because I can then reproduce and analyse it
> > locally and figure out if XFS is at fault, etc.
> 
> So far we havn't been able to reliably reproduce the problem on demand.
> Its a very active box. We reboot it, xfs_repair everything,
> bring it up multiuser and sometimes the same day, sometimes weeks 
> later we find corruption. One time we went over a month without any
> problems and thought the latest kernel upgrade had fixed it, then
> it came back.

That kind of randomness could be consistent with very low-memory
situations.  It would be a good idea to keep an eye on the memory
stats (lots of tools to export this info) to see if there is any
correlation to VM strain and the failures you've been seeing.

cheers.

-- 
Nathan


<Prev in Thread] Current Thread [Next in Thread>