xfs
[Top] [All Lists]

Re: Kernel bug when running xfs_fsr

To: karn@xxxxxxxx
Subject: Re: Kernel bug when running xfs_fsr
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Fri, 20 May 2011 11:05:38 +1000
Cc: xfs@xxxxxxxxxxx
In-reply-to: <BANLkTi=YSBY5Zq5ePCLZ2mLY70YEw=Yv7w@xxxxxxxxxxxxxx>
References: <BANLkTi=YSBY5Zq5ePCLZ2mLY70YEw=Yv7w@xxxxxxxxxxxxxx>
User-agent: Mutt/1.5.20 (2009-06-14)
On Thu, May 19, 2011 at 03:35:04PM -0700, Phil Karn wrote:
> I just got the following on my console each time I invoked xfs_fsr on a XFS
> file system. The file system resides on a OCZ SSD that I've been having
> problems with. This morning my system deadlocked while running a program
> that created and deleted many small files on the SSD (a Perl script feeding
> a large number of email messages one at a time to procmail). I suspect bad
> garbage collection algorithms in the SSD; I recovered by booting into single
> user and running wiper.sh on the file system to replenish the drive's pool
> of erased pages. Since then I've been running wiper.sh regularly to ensure a
> sufficient erased page pool in the SSD. I had just run it when I ran
> xfs_fsr.
> 
> So it's possible that my file system data structures are messed up. However,
> the system otherwise seems normal, and I've been routinely tagging my files
> with extended attributes containing their SHA-1 hashes so I can check their
> integrity. So far my checks haven't found any corrupted files.
> 
> Here is the relevant output from my kernel log. Is this a XFS bug, or does
> it simply indicate a corrupted file system due to my earlier crash?
> 
> [29847.045684] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000018

Dereferencing an offset of 24 bytes from the start of a structure.

> [29847.045690] IP: [<ffffffffa033c11b>] xfs_trans_log_inode+0xb/0x30 [xfs]

Three structures possible: xfs_inode, xfs_trans, xfs_inode_log_item:

138 xfs_trans_log_inode(
139         xfs_trans_t     *tp,
140         xfs_inode_t     *ip,
141         uint            flags)
142 {
143         ASSERT(ip->i_transp == tp);
144         ASSERT(ip->i_itemp != NULL);
145         ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
146
147         tp->t_flags |= XFS_TRANS_DIRTY;
148         ip->i_itemp->ili_item.li_desc->lid_flags |= XFS_LID_DIRTY;

And the situation is that ip->i_itemp->ili_item.li_desc == NULL:

typedef struct xfs_log_item {
        struct list_head                li_ail;         /* AIL pointers */
        xfs_lsn_t                       li_lsn;         /* last on-disk lsn */
        struct xfs_log_item_desc        *li_desc;       /* ptr to current desc*/
.....

That should not happen - the inode should be linked into the
transaction (tp), and li_desc should never be NULL here.

Are you running with CONFIG_XFS_DEBUG=y? If not, it is probably
worthwhile as it should catch the problems more precisely before
a NULL pointer dereference occurs.

> and so on...it repeats a few times because I issued the xfs_fsr command a
> few times.

So it is reproducable? Can you turn on the xfs_swapext tracepoints
and gather the output over a failure, as well as using xfs_fsr -v -d
and capturing that output? That might indicate that there is a
specific inode extent swap configuration that triggers this problem
that I haven't realised exists.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>