<div dir="ltr"><div>Hi Brian,<br></div><div>Yes, this directory (./crashtest/.glusterfs/indices) is glusterfs index-base directory for glusterfs features/index xlator.The ./crashtest/.glusterfs/indices/xattrop directory is XATTROP_SUBDIR directory for glusterfs. This path can be set in glusterfs server-side volume file so before there're any progresses I'll try to isolate this issue by try to set it to a ext4 path or somewhere else.<br>
</div><div>Thank you.<br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">2013/4/20 Brian Foster <span dir="ltr"><<a href="mailto:bfoster@redhat.com" target="_blank">bfoster@redhat.com</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On 04/20/2013 07:52 AM, ·ûÓÀÌÎ wrote:<br>
> Hi Brain,<br>
> Here is the file:<br>
> find . -inum 539274<br>
> ./crashtest/.glusterfs/indices/xattrop<br>
</div>> [<a href="mailto:root@10.23.72.94">root@10.23.72.94</a> <mailto:<a href="mailto:root@10.23.72.94">root@10.23.72.94</a>> xfsd]# ls -lih<br>
<div class="im">> ./crashtest/.glusterfs/indices/xattrop<br>
> total 0<br>
> 539275 ---------- 2 root root 0 Apr 20 17:17<br>
> 132ef294-71d1-4435-8daa-aa002e67cb6e<br>
> 539275 ---------- 2 root root 0 Apr 20 17:17<br>
> xattrop-f3ad589a-b8dc-4416-ab84-fc9ad4033540<br>
> find . -inum 539275<br>
> ./crashtest/.glusterfs/indices/xattrop/xattrop-f3ad589a-b8dc-4416-ab84-fc9ad4033540<br>
> ./crashtest/.glusterfs/indices/xattrop/132ef294-71d1-4435-8daa-aa002e67cb6e<br>
> I'm not sure if it is normal or glusterfs fall in infinite loop. Is<br>
> there a change that the kernel fall into dead loop?<br>
> I'll study it.<br>
><br>
<br>
</div>Very interesting, thank you. I don't have the full context here yet, but<br>
the short of it is that this particular indices/xattrop directory is<br>
managed by a backend translator and driven by replication. It appears to<br>
be doing some kind of transaction level tracking based on links. E.g.,<br>
some quick behavioral observations:<br>
<br>
- This directory is created and an xattrop-#### file created (no size).<br>
- On starting a large sequential write from a client, I start observing<br>
a continuous sequence of xfs_link/xfs_remove operations via tracing.<br>
- The backend appears to create a link to the xattr-#### file for every<br>
replication transaction (handwave) with the name of the link referring<br>
to the gfid name of the file under write (but note again the link is not<br>
to the inode under write, but this special xattr-#### file). It then<br>
apparently removes this link on transaction completion and the process<br>
repeats. I suspect this has to do with identifying what files were under<br>
modification in the event of a crash before transaction completion, but<br>
the underlying pattern/workload is what's more important to us here for<br>
the time being...<br>
<br>
So we're seeing a heavy link related workload on a directory inode<br>
(0x83a8a) where the entries are all links to the same inode (0x83a8b).<br>
These inodes happen to be close in proximity, which may or may not be a<br>
factor. The translator that generates these link/unlink ops sits right<br>
above a generic thread pool translator, so this is multi-threaded.<br>
<br>
What isn't clear to me yet is where the xfs_iunlink() for this heavily<br>
linked inode is induced. The primary dentry remains after my file copy<br>
test, but then taking another look after a few minutes I see it removed.<br>
The same thing occurs if I gracefully restart the volume. I'm going to<br>
have to dig into that some more and also see if we can use this to<br>
narrow in on a reproducer. I'm thinking something along the lines of:<br>
<br>
- Create a directory/file. Ideally the directory and file inodes are in<br>
the same cluster.<br>
- Start a highly-threaded link-unlink workload against that file in the<br>
same directory.<br>
- Somewhere in the background unlink the main file.<br>
- Check for multiple xfs_iunlink() ops, repeat.<br>
<br>
... the assumption being that the xfs_iunlink() race could have lead to<br>
a possible unlinked list corruption on the associated list, such that a<br>
later inactivation/xfs_iunlink_remove of some other inode in that bucket<br>
could fail.<br>
<br>
Brian<br>
<br>
><br>
> 2013/4/20 Brian Foster <<a href="mailto:bfoster@redhat.com">bfoster@redhat.com</a> <mailto:<a href="mailto:bfoster@redhat.com">bfoster@redhat.com</a>>><br>
<div><div class="h5">><br>
> On 04/20/2013 06:10 AM, ·ûÓÀÌÎ wrote:<br>
> > Dear Eric,<br>
> > I have applied your latest patch and collected the following log:<br>
> ><br>
> > /var/log/message<br>
> > Apr 20 17:28:23 10 kernel: XFS (sdb): xfs_iunlink_remove:<br>
> xfs_inotobp()<br>
> > returned error 22 for inode 0x1b20b ag 0 agino 1b20b<br>
> > Apr 20 17:28:23 10 kernel:<br>
> > Apr 20 17:28:23 10 kernel: XFS (sdb): xfs_inactive: xfs_ifree returned<br>
> > error 22<br>
> > Apr 20 17:28:23 10 kernel: XFS (sdb): xfs_do_force_shutdown(0x1)<br>
> called<br>
> > from line 1184 of file fs/xfs/xfs_vnodeops.c. Return address =<br>
> > 0xffffffffa02d4d0a<br>
> > Apr 20 17:28:23 10 kernel: XFS (sdb): I/O Error Detected. Shutting<br>
> down<br>
> > filesystem<br>
> > Apr 20 17:28:23 10 kernel: XFS (sdb): Please umount the filesystem and<br>
> > rectify the problem(s)<br>
> > Apr 20 17:28:37 10 kernel: XFS (sdb): xfs_log_force: error 5 returned.<br>
> > Apr 20 17:29:07 10 kernel: XFS (sdb): xfs_log_force: error 5 returned.<br>
> > Apr 20 17:29:37 10 kernel: XFS (sdb): xfs_log_force: error 5 returned.<br>
> > Apr 20 17:30:07 10 kernel: XFS (sdb): xfs_log_force: error 5 returned.<br>
> ><br>
> > debugfs trace:<br>
> ><br>
> <a href="https://docs.google.com/file/d/0B7n2C4T5tfNCTlZGUVpnZENrZ3M/edit?usp=sharing" target="_blank">https://docs.google.com/file/d/0B7n2C4T5tfNCTlZGUVpnZENrZ3M/edit?usp=sharing</a><br>
> ><br>
><br>
> FWIW...<br>
><br>
> <...>-6908 [001] 8739.967623: xfs_iunlink: dev 8:16 ino 0x83a8b mode<br>
> 0100000, flags 0x0<br>
> <...>-6909 [001] 8739.970252: xfs_iunlink: dev 8:16 ino 0x83a8b mode<br>
> 0100000, flags 0x0<br>
><br>
> 0x83a8b and 0x1b20b both hash to unlinked list bucket 11.<br>
><br>
> As to the rest of the trace, there appears to be a significant amount of<br>
> link activity on (directory) inode 0x83a8a (the immediately prior inode<br>
> to the inode involved in the race). The name data in the trace suggests<br>
> activity somewhere under .glusterfs. A couple questions:<br>
><br>
> 1.) Any idea what entries point to this inode right now (e.g., how many<br>
> links on this inode) and where it resides in the fs (path)?<br>
><br>
> 2.) Can you associate this kind of heavy remove/link pattern on a single<br>
> inode to a higher level activity? For example, if you were to watch the<br>
> trace data live, is this a normal pattern you observe? Does it only<br>
> occur when a rebalance is in progress? Or when a rebalance finishes? Any<br>
> detailed observations you can make in that regard could be helpful.<br>
><br>
> Brian<br>
><br>
> > Thank you.<br>
> ><br>
> ><br>
> > 2013/4/20 ·ûÓÀÌÎ <<a href="mailto:yongtaofu@gmail.com">yongtaofu@gmail.com</a> <mailto:<a href="mailto:yongtaofu@gmail.com">yongtaofu@gmail.com</a>><br>
</div></div>> <mailto:<a href="mailto:yongtaofu@gmail.com">yongtaofu@gmail.com</a> <mailto:<a href="mailto:yongtaofu@gmail.com">yongtaofu@gmail.com</a>>>><br>
<div class="im">> ><br>
> > Hi Eric,<br>
> > The xfs module is loaded from system kernel, it happens on our<br>
> > production server too (I did not touch that till now) and if<br>
> the xfs<br>
> > module is mess up the systemstap may also not working but now it<br>
> > works. As you have mentioned, strange thing is xfs shutdown always<br>
> > happens when glusterfs rebalance completes.<br>
> ><br>
> ><br>
> > 2013/4/20 Eric Sandeen <<a href="mailto:sandeen@sandeen.net">sandeen@sandeen.net</a><br>
> <mailto:<a href="mailto:sandeen@sandeen.net">sandeen@sandeen.net</a>><br>
</div>> > <mailto:<a href="mailto:sandeen@sandeen.net">sandeen@sandeen.net</a> <mailto:<a href="mailto:sandeen@sandeen.net">sandeen@sandeen.net</a>>>><br>
<div class="HOEnZb"><div class="h5">> ><br>
> > On 4/19/13 9:03 PM, ·ûÓÀÌÎ wrote:<br>
> > > Hi Eric,<br>
> > > I will enable them and run test again. I can only<br>
> reproduce it<br>
> > with<br>
> > > glusterfs rebalance. Glusterfs uses a mechanism it called<br>
> > syncop to<br>
> > > unlink file. For rebalance it uses<br>
> > > syncop_unlink(glusterfs/libglusterfs/src/syncop.c). In the<br>
> > glusterfs<br>
> > > sync_task framework(glusterfs/libglusterfs/src/syncop.c)<br>
> it uses<br>
> > > "makecontext/swapcontext"<br>
> > ><br>
> ><br>
> <<a href="http://www.opengroup.org/onlinepubs/009695399/functions/makecontext.html" target="_blank">http://www.opengroup.org/onlinepubs/009695399/functions/makecontext.html</a>>.<br>
> > > Does it leads to racing unlink from different CPU core?<br>
> ><br>
> > Yep, I understand that it's rebalance. It dies when rebalance<br>
> > finishes because an<br>
> > open but unlinked file trips over the corrupted list from<br>
> > earlier, it seems.<br>
> ><br>
> > I don't know why makecontext would matter...<br>
> ><br>
> > Just to be sure, you are definitely loading the xfs module<br>
> from<br>
> > the kernel you built, right, and you don't have a "priority"<br>
> > module getting loaded from elsewhere? Seems unlikely, but<br>
> just<br>
> > to be sure.<br>
> ><br>
> > > Thank you.<br>
> ><br>
> > You could also add this patch to the xfs tracepoints to print<br>
> > more information about the inodes - the mode & flags.<br>
> ><br>
> > -Eric<br>
> ><br>
> ><br>
> > diff --git a/fs/xfs/linux-2.6/xfs_trace.h<br>
> > b/fs/xfs/linux-2.6/xfs_trace.h<br>
> > index e8ce644..c314b87 100644<br>
> > --- a/fs/xfs/linux-2.6/xfs_trace.h<br>
> > +++ b/fs/xfs/linux-2.6/xfs_trace.h<br>
> > @@ -544,14 +544,18 @@ DECLARE_EVENT_CLASS(xfs_inode_class,<br>
> > TP_STRUCT__entry(<br>
> > __field(dev_t, dev)<br>
> > __field(xfs_ino_t, ino)<br>
> > + __field(__u16, mode)<br>
> > + __field(unsigned long, flags)<br>
> > ),<br>
> > TP_fast_assign(<br>
> > __entry->dev = VFS_I(ip)->i_sb->s_dev;<br>
> > __entry->ino = ip->i_ino;<br>
> > + __entry->mode = VFS_I(ip)->i_mode;<br>
> > + __entry->flags = ip->i_flags;<br>
> > ),<br>
> > - TP_printk("dev %d:%d ino 0x%llx",<br>
> > + TP_printk("dev %d:%d ino 0x%llx mode 0%o, flags<br>
> 0x%lx",<br>
> > MAJOR(__entry->dev), MINOR(__entry->dev),<br>
> > - __entry->ino)<br>
> > + __entry->ino, __entry->mode, __entry->flags)<br>
> > )<br>
> ><br>
> > #define DEFINE_INODE_EVENT(name) \<br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > --<br>
> > ·ûÓÀÌÎ<br>
> ><br>
> ><br>
> ><br>
> ><br>
> > --<br>
> > ·ûÓÀÌÎ<br>
><br>
><br>
><br>
><br>
> --<br>
> ·ûÓÀÌÎ<br>
<br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br>·ûÓÀÌÎ
</div>