xfs
[Top] [All Lists]

Re: problems showing up as XFS problems on kernels after 2.6.28-git2

To: Christoph Hellwig <hch@xxxxxxxxxxxxx>, xfs@xxxxxxxxxxx
Subject: Re: problems showing up as XFS problems on kernels after 2.6.28-git2
From: Danny ter Haar <dth@xxxxxxx>
Date: Fri, 9 Jan 2009 02:26:10 +0100
In-reply-to: <20090109004609.GM9448@disturbed>
References: <20090107165218.GA11132@xxxxxxx> <20090107180246.GA15218@xxxxxxxxxxxxx> <20090107182415.GA12039@xxxxxxx> <20090107183115.GA6261@xxxxxxxxxxxxx> <20090107184420.GA15653@xxxxxxx> <20090107185628.GA19255@xxxxxxxxxxxxx> <20090108215602.GA24479@xxxxxxx> <20090109004609.GM9448@disturbed>
User-agent: Mutt/1.5.18 (2008-05-17)
Quoting Dave Chinner (david@xxxxxxxxxxxxx):
> Looking at this, I think there are two possibilities in terms of the
> problem being detected. We are modifying the inode BMBT here,
> so that means we have XFS_BTREE_ROOT_IN_INODE set. The corruption
> trigger has occurred because a xfs_btree_increment() call has
> returned a zero status. This means we failed here:
> 
> 1324         /* Fail if we just went off the right edge of the tree. */
> 1325         xfs_btree_get_sibling(cur, block, &ptr, XFS_BB_RIGHTSIB);
> 1326         if (xfs_btree_ptr_is_null(cur, &ptr))
> 1327                 goto out0;
> 
> or here:
> 
> 1351         /*
> 1352          * If we went off the root then we are either seriously
> 1353          * confused or have the tree root in an inode.
> 1354          */
> 1355         if (lev == cur->bc_nlevels) {
> 1356                 if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)
> 1357                         goto out0;
> 1358                 ASSERT(0);
> 
> i.e. we either fell off the right edge of the tree or went over the top
> of it.

> I can't really see how we've done either of those things unless the
> tree has been corrupted by a prior operation.
sounds logical.

First time when it happened i moved the primairy hd to sec ide connector, 
connected
a seperate hard drive as new master, installed a fresh debian lenny on that
harddrive, ran xfs-repair on all xfs filesystems: no errors

> Given that each time it is aptitude that is causing the problem, can you
> prevent aptitude from running automatically on boot and run it manually?
> If you can reporduce the problem manually then we can move on to the
> next step....

I wasn't clear (obvioulsy)
This machine is besides my NAS also my apt-cacher-ng server for all my other
machines here at home. The easiest way to trigger the error is often by running
a simple "aptitude update; aptitude -d dist-upgrade" 
So when it barfed i did the aptitude by hand.
And it checks everything from the cache at /var/cache/apt-cacher-ng
which is on sda6 (root filesystem on XFS)

So it doesn't "barf" right on boot, it takes a few minutes or even hours:

filer1:~# last -20 reboot
reboot   system boot  2.6.28-git2-d    Thu Jan  8 12:00 - current(05:18)    
reboot   system boot  2.6.28-git3-d    Thu Jan  8 11:31 - 11:59  (00:27)    
reboot   system boot  2.6.28-git3-d    Thu Jan  8 10:56 - 11:59  (01:02)    
reboot   system boot  2.6.28-git3-d    Thu Jan  8 10:44 - 10:54  (00:10)    
reboot   system boot  2.6.28-git3-d    Thu Jan  8 10:30 - 10:43  (00:12)    
reboot   system boot  2.6.28-git2      Wed Jan  7 15:08 - 10:28  (19:19)    
reboot   system boot  2.6.28-git9-d    Wed Jan  7 12:29 - 14:58  (02:29)    
reboot   system boot  2.6.28-git2      Wed Jan  7 10:08 - 12:27  (02:19)    
reboot   system boot  2.6.28-git9      Wed Jan  7 09:21 - 10:06  (00:45)    
reboot   system boot  2.6.28-git9      Wed Jan  7 08:42 - 10:06  (01:24)    
reboot   system boot  2.6.28-git2      Tue Jan  6 21:45 - 08:40  (10:55)    
reboot   system boot  2.6.28-git4      Tue Jan  6 21:27 - 08:40  (11:13)    
reboot   system boot  2.6.28-git4      Tue Jan  6 21:22 - 08:40  (11:18)

Sometimes the kernel barfes while accessing /dev/sdb1 of /dev/sdc1
which is only accessed using samba.

I can once more install the "other" debian lenny harddrive, boot from there
and than manually do an xfs_repair on xfs filesystems.
I can than boot a kernel that is know to barf and try to get it to barf.

> > So (in my case) something while going from git2 -> git3 didn't go positive.
> That would have been when Linus did the XFS pull...

Do you want me to figure out what patch from git2->git3 is the cullprit ?
I'll have to compile/reboot for a while.

Tell me what else i can do to resolve this.

Danny
-- 

<Prev in Thread] Current Thread [Next in Thread>