What to do when... xfs_repair hangs?

Sean Caron scaron at umich.edu
Sun Jun 1 11:21:55 CDT 2014


Sorry, all, I was a little out-of-it on Friday afternoon, of course I had
kicked off xfs_repair actually in the background with all output sent to a
file, and I was just doing 'tail -f' on that file.

So I kill the 'tail -f' and jump back to the command line, it appears that
xfs_repair segfaulted and died.

That line of text:

disconnected inode 1109099673,

was indeed the last thing that it printed before it crashed.

If I look in dmesg, I just see -

xfs_repair[6770]: segfault at 28 ip 000000000042307b sp 00007fffef61bad0
error 4 in xfs_repair[400000+72000]

and that's it.

I checked with 'df' and there's plenty of space everywhere; I don't see why
it would have faulted out trying to connect something to lost+found.

Underlying storage should be good; this is basically a RAID 60 built on top
of a bunch of JBODs with LSI SAS9200 cards. MD sees all strings as started
and running OK; no problems getting the array assembled at all.

Since Dave is saying it's OK to try re-running xfs_repair; it'll just pick
up where it left off; let me give it another pass and see if it manages to
complete, or if it segfaults out again. I guess it it poops out a second
time, maybe we'll just want to consider rebuilding the filesystem and
restoring from our copies?

Thanks for the feedback,

Sean



On Fri, May 30, 2014 at 8:01 PM, Dave Chinner <david at fromorbit.com> wrote:

> On Fri, May 30, 2014 at 03:49:13PM -0400, Sean Caron wrote:
> > Hi all,
> >
> > Long story short, we have a big array formatted as XFS, we had a machine
> go
> > down hard maybe a month, month and a half ago... when it came back up,
> XFS
> > faulted out when we attempted to mount the filesystem; it complained the
> > log was bad or something... I did a dry run of xfs_repair (-L) and it
> > looked pretty bad, so we mounted up the filesystem read-only, ran a
> > backup... I think we got pretty much everything out OK except maybe files
> > that were open at the time of the crash.
> >
> > Now with a backup in hand, we kicked off xfs_repair "for real"... it ran
> > for a while and did its thing, but now it appears to be stuck at the
> stage -
> >
> > - agno = 436
> > rebuilding directory inode ...
> > rebuilding directory inode ...
> > rebuilding directory inode ...
> > ...
> > - traversal finished ...
> > - moving disconected inodes to lost+found ...
> > disconnected inode 1109099673,
> >
> > and then it just stops. I don't know how long its been sitting like that,
> > but it hasn't moved in the last hour or two. I assume that's not good...
>
> Is that the total of the last line of output? If so, it's likely
> stuck creating the lost+found directory. It's possible there's a
> corruption in the inode AVL tree (e.g. endless loop) that is causing
> it to spin doing an inode record lookup, but otherwise I can't see
> any reason for it getting stuck here.
>
> The information that Brian asked for will be a good start in
> tracking this down, as will the complete output of xfs_repair...
>
> > Interestingly when we ran a dry run of xfs_repair (-L) it got all the way
> > through; it never hung up at any point. Not sure why it would start to
> hang
> > up, once it gets run "for real".
>
> That's because a dry-run skips the "move to lost_found" phase.
>
> > This machine is in single-user-mode, I have exactly 24 lines of console
> > with no scrollback buffer, no other tty available besides that which I'm
> > running xfs_repair on, the system console.
>
> $ man script
>
> or
>
> $ man tee
>
> > Running Linux kernel 3.4.61, Ubuntu 12.04 LTS 64-bit with whatever their
> > current xfsprogs is.
>
> Upgrading xfsprogs to 3.2.0 would be a good idea.
>
> > This is a bit of an exceptional situation for me; I've never seen
> > xfs_repair just hang outright. I hoped I could maybe get some feedback
> from
> > the experts here... what should I do?
> >
> > Try to Control-C out of the xfs_repair and ... re-run it?
>
> That's fine - the next time repair runs it will start again and
> repair anything that wasn't repaired in the last run.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david at fromorbit.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20140601/f8fded54/attachment-0001.html>


More information about the xfs mailing list