XFS corruption with failover

John Quigley jquigley at jquigley.com
Thu Aug 13 20:06:18 CDT 2009


Eric Sandeen wrote:
> Are you sure?
> 
>                 if (ohead->oh_clientid != XFS_TRANSACTION &&
>                     ohead->oh_clientid != XFS_LOG) {
>                         xlog_warn(
>                 "XFS: xlog_recover_process_data: bad clientid");
>                         ASSERT(0);
>                         return (XFS_ERROR(EIO));
>                 }
> 
> so it does say EIO but that seems to me to be the wrong error; loks more
> like a bad log to me.

Hey Eric:

That would certainly be consistent with our experience, as the only way we're able to bring the file system back online is by zeroing the log.

> It does make me wonder if there's any sort of per-initiator caching on
> the iscsi target or something.  </handwave>

There isn't, as mentioned above, though we have several intermediate layers between the file system and iSCSI initiator, including multipath and LVM, both of which I was initially suspicious of.  In testing with a similar scenario but in a more isolate fashion without those two intermediates, the behavior was still present.  Also, just to clarify the topology:

                  /-----[Failover Secondary]------\
                 /                                 \
 NFS Client ----/                                   \-----[ISCSI Target]----[Distributed Storage]
                \                                   /
                 \                                 /
                  \-----[Failover Primary]--------/

Those two failover machines, Primary and Secondary, act as the NFS server, the XFS mountpoint and ISCSI initiator.  Only one failover machine is logged into the ISCSI target/has XFS mounted.

Thanks very much for your cycles on this guys.

- John Quigley




More information about the xfs mailing list