xfs
[Top] [All Lists]

Re: Please (help) improve support for linux-vserver

To: Andras Korn <korn-xfs@xxxxxxxxxxxxxx>
Subject: Re: Please (help) improve support for linux-vserver
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Mon, 17 Dec 2012 11:01:24 +1100
Cc: xfs@xxxxxxxxxxx
In-reply-to: <20121216223953.GL18563@hellgate>
References: <20121216223953.GL18563@hellgate>
User-agent: Mutt/1.5.21 (2010-09-15)
On Sun, Dec 16, 2012 at 11:39:53PM +0100, Andras Korn wrote:
> Hi,
> 
> I'm a user of xfs and linux-vserver (http://linux-vserver.org/).
> 
> I'd like vserver to work better with xfs (or vice versa) and am trying to
> proxy between the two development communities (which in the case of vserver
> is not very large).
> 
> vservers are basically chroots on steroids: the host runs a single kernel,
> but it isolates processes running in the guests from each other. There is a
> "copy on write hardlink breaking" feature that allows you to hardlink files
> (such as libc6) of different guests together (so that they only get mapped
> into memory once), and have the kernel break the link if the inode is opened
> for writing (by creating a copy and returning a FD to the copy).
> 
> This feature relies on inode flags (like the 'immutable' bit). vserver adds
> two fields to the inode (the other is used to tag inodes with a context ID).
> 
> The kernel parts work, but xfs_repair breaks such filesystems because it
> thinks the flags are invalid.
> 
> I approached David Chinner and Eric Sandeen about this on IRC, and they said
> that the first step towards any improvement would be to share with this list
> the parts of the vserver patch that affect xfs, so that's what I'm doing
> now.
> 
> Please find attached the output of
> 
> filterdiff -i '*xfs*' patch-3.7-vs2.3.5.1.diff

Best to include patches in line so they are easy to quote and
comment on.

The main issue here is that it includes lots of bits that aren't in
the mainline kernel, so there are bits that we cannot push into the
mainline kernel.

What we really need to do is get the bits that change the on-disk
format formalised, and then we can ensure that the userspace tools
know about these bits and can query/validate them properly.

So, what htat means is that we need to sort out bits like:

> --- linux-3.7/fs/xfs/xfs_dinode.h     2012-10-04 13:27:44.000000000 +0000
> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_dinode.h   2012-12-11 15:56:32.000000000 
> +0000
> @@ -51,7 +51,9 @@ typedef struct xfs_dinode {
>       __be32          di_nlink;       /* number of links to file */
>       __be16          di_projid_lo;   /* lower part of owner's project id */
>       __be16          di_projid_hi;   /* higher part owner's project id */
> -     __u8            di_pad[6];      /* unused, zeroed space */
> +     __u8            di_pad[2];      /* unused, zeroed space */
> +     __be16          di_tag;         /* context tagging */
> +     __be16          di_vflags;      /* vserver specific flags */
>       __be16          di_flushiter;   /* incremented on flush */
>       xfs_timestamp_t di_atime;       /* time last accessed */
>       xfs_timestamp_t di_mtime;       /* time last modified */
> @@ -184,6 +186,8 @@ static inline void xfs_dinode_put_rdev(s
>  #define XFS_DIFLAG_EXTSZINHERIT_BIT 12       /* inherit inode extent size */
>  #define XFS_DIFLAG_NODEFRAG_BIT     13       /* do not reorganize/defragment 
> */
>  #define XFS_DIFLAG_FILESTREAM_BIT   14  /* use filestream allocator */
> +#define XFS_DIFLAG_IXUNLINK_BIT     15       /* Immutable inver on unlink */
> +
>  #define XFS_DIFLAG_REALTIME      (1 << XFS_DIFLAG_REALTIME_BIT)
>  #define XFS_DIFLAG_PREALLOC      (1 << XFS_DIFLAG_PREALLOC_BIT)
>  #define XFS_DIFLAG_NEWRTBM       (1 << XFS_DIFLAG_NEWRTBM_BIT)
> @@ -199,6 +203,7 @@ static inline void xfs_dinode_put_rdev(s
>  #define XFS_DIFLAG_EXTSZINHERIT  (1 << XFS_DIFLAG_EXTSZINHERIT_BIT)
>  #define XFS_DIFLAG_NODEFRAG      (1 << XFS_DIFLAG_NODEFRAG_BIT)
>  #define XFS_DIFLAG_FILESTREAM    (1 << XFS_DIFLAG_FILESTREAM_BIT)
> +#define XFS_DIFLAG_IXUNLINK      (1 << XFS_DIFLAG_IXUNLINK_BIT)
>  
>  #ifdef CONFIG_XFS_RT
>  #define XFS_IS_REALTIME_INODE(ip) ((ip)->i_d.di_flags & XFS_DIFLAG_REALTIME)
> @@ -211,6 +216,10 @@ static inline void xfs_dinode_put_rdev(s
>        XFS_DIFLAG_IMMUTABLE | XFS_DIFLAG_APPEND | XFS_DIFLAG_SYNC | \
>        XFS_DIFLAG_NOATIME | XFS_DIFLAG_NODUMP | XFS_DIFLAG_RTINHERIT | \
>        XFS_DIFLAG_PROJINHERIT | XFS_DIFLAG_NOSYMLINKS | XFS_DIFLAG_EXTSIZE | \
> -      XFS_DIFLAG_EXTSZINHERIT | XFS_DIFLAG_NODEFRAG | XFS_DIFLAG_FILESTREAM)
> +      XFS_DIFLAG_EXTSZINHERIT | XFS_DIFLAG_NODEFRAG | XFS_DIFLAG_FILESTREAM 
> | \
> +      XFS_DIFLAG_IXUNLINK)
> +
> +#define XFS_DIVFLAG_BARRIER  0x01
> +#define XFS_DIVFLAG_COW              0x02
>  

These definitions and on disk format changes, and wrap them in a
manner that allows us to support them without issue.

The main problem I see is that these bits are specific to vserver so
if they are set on a mainline kernel/filesytsem, they should never
be set. That means we really need a superblock feature bit to
indicate whether these changes are valid or not. A bit issue is this
bit:

#define XFS_DIFLAG_IXUNLINK_BIT     15  /* Immutable inver on unlink */

Which takes the last bit of the di_flags field. We've been reserving
that bit to be used as a "more bits bit" to indicate the presence of
an extra flags field in the inode code (similar to the superblock
XFS_SB_VERSION_MOREBITSBIT feature bit). That would allow us not to
have to use a superblock feature bit to indicate the precense of the
new field in the inode.

Unfortunately, this bit is taken, so AFAICT the only way we can
merge this into the upstream code is to add a superblock feature bit
at the same time. But that then makes the upstream code still
incompatible with the vserver code, because none of the vserver
filesystems will have the feature bit set. That can be worked around
(e.g. the vserver patchset includes a new piece of code that sets
the feature bit on mount), but it does make it somewhat difficult to
cleanly support the extensions that vserver has made...

To complicate things further, then new inode format for CRC changes
already has a new flags field added to it. Basically, I was not
intending to add a new flags field to the existing inode format
*ever* as new feature bits woul donly be supported on new format
filesystems. I'm not sure what the best way to reconcile this is - I
really don't want to have to support 3 separate flags fields, 2 of
which are optional...

> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_inode.c    2012-12-11 22:20:23.000000000 
> +0000
> @@ -16,6 +16,7 @@
>   * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
>   */
>  #include <linux/log2.h>
> +#include <linux/vs_tag.h>
>  
>  #include "xfs.h"
>  #include "xfs_fs.h"
> @@ -563,15 +564,25 @@ xfs_iformat_btree(
>  STATIC void
>  xfs_dinode_from_disk(
>       xfs_icdinode_t          *to,
> -     xfs_dinode_t            *from)
> +     xfs_dinode_t            *from,
> +     int                     tagged)
>  {
> +     uint32_t uid, gid, tag;
> +
>       to->di_magic = be16_to_cpu(from->di_magic);
>       to->di_mode = be16_to_cpu(from->di_mode);
>       to->di_version = from ->di_version;
>       to->di_format = from->di_format;
>       to->di_onlink = be16_to_cpu(from->di_onlink);
> -     to->di_uid = be32_to_cpu(from->di_uid);
> -     to->di_gid = be32_to_cpu(from->di_gid);
> +
> +     uid = be32_to_cpu(from->di_uid);
> +     gid = be32_to_cpu(from->di_gid);
> +     tag = be16_to_cpu(from->di_tag);
> +
> +     to->di_uid = INOTAG_UID(tagged, uid, gid);
> +     to->di_gid = INOTAG_GID(tagged, uid, gid);
> +     to->di_tag = INOTAG_TAG(tagged, uid, gid, tag);

Changes like this will still have to exist solely in the vserver
tree as they rely on core changes in the vserver tree. What is
important is that on-disk formats are compatible between the two
trees....

> @@ -669,6 +689,10 @@ _xfs_dic2xflags(
>                       flags |= XFS_XFLAG_FILESTREAM;
>       }
>  
> +     if (di_vflags & XFS_DIVFLAG_BARRIER)
> +             flags |= FS_BARRIER_FL;
> +     if (di_vflags & XFS_DIVFLAG_COW)
> +             flags |= FS_COW_FL;
>       return flags;

Similarly for bits like this.

> --- linux-3.7/fs/xfs/xfs_ioctl.c      2012-12-11 15:47:37.000000000 +0000
> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_ioctl.c    2012-12-11 15:56:32.000000000 
> +0000
> @@ -26,7 +26,7 @@
>  #include "xfs_bmap_btree.h"
>  #include "xfs_dinode.h"
>  #include "xfs_inode.h"
> -#include "xfs_ioctl.h"
> +// #include "xfs_ioctl.h"
>  #include "xfs_rtalloc.h"
>  #include "xfs_itable.h"
>  #include "xfs_error.h"

That doesn't look right....

> @@ -892,6 +900,10 @@ xfs_diflags_to_linux(
>               inode->i_flags |= S_IMMUTABLE;
>       else
>               inode->i_flags &= ~S_IMMUTABLE;
> +     if (xflags & XFS_XFLAG_IXUNLINK)
> +             inode->i_flags |= S_IXUNLINK;
> +     else
> +             inode->i_flags &= ~S_IXUNLINK;
>       if (xflags & XFS_XFLAG_APPEND)
>               inode->i_flags |= S_APPEND;
>       else
> @@ -1396,10 +1408,18 @@ xfs_file_ioctl(
>       case XFS_IOC_FSGETXATTRA:
>               return xfs_ioc_fsgetxattr(ip, 1, arg);
>       case XFS_IOC_FSSETXATTR:
> +             if (IS_BARRIER(inode)) {
> +                     vxwprintk_task(1, "messing with the barrier.");
> +                     return -XFS_ERROR(EACCES);
> +             }
>               return xfs_ioc_fssetxattr(ip, filp, arg);
>       case XFS_IOC_GETXFLAGS:
>               return xfs_ioc_getxflags(ip, arg);
>       case XFS_IOC_SETXFLAGS:
> +             if (IS_BARRIER(inode)) {
> +                     vxwprintk_task(1, "messing with the barrier.");
> +                     return -XFS_ERROR(EACCES);
> +             }
>               return xfs_ioc_setxflags(ip, filp, arg);

And these all rely on vserver infrastructure, so would have to
remain in the vserver tree....

>  
>       case XFS_IOC_FSSETDM: {
> --- linux-3.7/fs/xfs/xfs_ioctl.h      2011-10-24 16:45:31.000000000 +0000
> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_ioctl.h    2012-12-11 15:56:32.000000000 
> +0000
> @@ -70,6 +70,12 @@ xfs_handle_to_dentry(
>       void __user             *uhandle,
>       u32                     hlen);
>  
> +extern int
> +xfs_sync_flags(
> +     struct inode            *inode,
> +     int                     flags,
> +     int                     vflags);
> +
>  extern long
>  xfs_file_ioctl(
>       struct file             *filp,
> --- linux-3.7/fs/xfs/xfs_iops.c       2012-10-04 13:27:44.000000000 +0000
> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_iops.c     2012-12-11 15:56:32.000000000 
> +0000
> @@ -28,6 +28,7 @@
>  #include "xfs_bmap_btree.h"
>  #include "xfs_dinode.h"
>  #include "xfs_inode.h"
> +#include "xfs_ioctl.h"
>  #include "xfs_bmap.h"
>  #include "xfs_rtalloc.h"
>  #include "xfs_error.h"
> @@ -46,6 +47,7 @@
>  #include <linux/security.h>
>  #include <linux/fiemap.h>
>  #include <linux/slab.h>
> +#include <linux/vs_tag.h>
>  
>  static int
>  xfs_initxattrs(
> @@ -421,6 +423,7 @@ xfs_vn_getattr(
>       stat->nlink = ip->i_d.di_nlink;
>       stat->uid = ip->i_d.di_uid;
>       stat->gid = ip->i_d.di_gid;
> +     stat->tag = ip->i_d.di_tag;
>       stat->ino = ip->i_ino;
>       stat->atime = inode->i_atime;
>       stat->mtime = inode->i_mtime;
> @@ -1033,6 +1036,7 @@ static const struct inode_operations xfs
>       .listxattr              = xfs_vn_listxattr,
>       .fiemap                 = xfs_vn_fiemap,
>       .update_time            = xfs_vn_update_time,
> +     .sync_flags             = xfs_sync_flags,
>  };
>  
>  static const struct inode_operations xfs_dir_inode_operations = {
> @@ -1059,6 +1063,7 @@ static const struct inode_operations xfs
>       .removexattr            = generic_removexattr,
>       .listxattr              = xfs_vn_listxattr,
>       .update_time            = xfs_vn_update_time,
> +     .sync_flags             = xfs_sync_flags,
>  };

as would all these "sync_flag" changes.

> --- linux-3.7/fs/xfs/xfs_super.c      2012-12-11 15:47:37.000000000 +0000
> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_super.c    2012-12-11 17:36:47.000000000 
> +0000
> @@ -114,6 +114,9 @@ mempool_t *xfs_ioend_pool;
>  #define MNTOPT_NODELAYLOG  "nodelaylog"      /* Delayed logging disabled */
>  #define MNTOPT_DISCARD          "discard"    /* Discard unused blocks */
>  #define MNTOPT_NODISCARD   "nodiscard"       /* Do not discard unused blocks 
> */
> +#define MNTOPT_TAGXID        "tagxid"        /* context tagging for inodes */
> +#define MNTOPT_TAGGED        "tag"           /* context tagging for inodes */
> +#define MNTOPT_NOTAGTAG      "notag"         /* do not use context tagging */
.....
> @@ -1149,6 +1170,16 @@ xfs_fs_remount(
>               case Opt_inode32:
>                       mp->m_maxagi = xfs_set_inode32(mp);
>                       break;
> +             case Opt_tag:
> +                     if (!(sb->s_flags & MS_TAGGED)) {
> +                             printk(KERN_INFO
> +                                     "XFS: %s: tagging not permitted on 
> remount.\n",
> +                                     sb->s_id);
> +                             return -EINVAL;
> +                     }
> +                     break;
> +             case Opt_notag:
> +                     break;
>               default:
>                       /*
>                        * Logically we would return an error here to prevent
> @@ -1368,6 +1399,9 @@ xfs_fs_fill_super(
>       if (error)
>               goto out_free_sb;
>  
> +     if (mp->m_flags & XFS_MOUNT_TAGGED)
> +             sb->s_flags |= MS_TAGGED;
> +

Why wouldn't you use vfs-based mount option parsing here and hence
not need XFS_MOUNT_TAGGED or all the duplicated parsing in each
filesystem? This seems like you all this could be removed from the
patch, and the XFS code just checks if (mp->m_super->s_flags &
MS_TAGGED) is true....

>       /*
>        * we must configure the block size in the superblock before we run the
>        * full mount process as the mount process can lookup and cache inodes.
> --- linux-3.7/fs/xfs/xfs_vnodeops.c   2012-10-04 13:27:44.000000000 +0000
> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_vnodeops.c 2012-12-11 15:56:32.000000000 
> +0000
> @@ -103,6 +103,77 @@ xfs_readlink_bmap(
>       return error;
>  }
>  
> +
> +STATIC void
> +xfs_get_inode_flags(
> +     xfs_inode_t     *ip)
> +{
> +     struct inode    *inode = VFS_I(ip);
> +     unsigned int    flags = inode->i_flags;
> +     unsigned int    vflags = inode->i_vflags;
> +
> +     if (flags & S_IMMUTABLE)
> +             ip->i_d.di_flags |= XFS_DIFLAG_IMMUTABLE;
> +     else
> +             ip->i_d.di_flags &= ~XFS_DIFLAG_IMMUTABLE;
> +     if (flags & S_IXUNLINK)
> +             ip->i_d.di_flags |= XFS_DIFLAG_IXUNLINK;
> +     else
> +             ip->i_d.di_flags &= ~XFS_DIFLAG_IXUNLINK;
> +
> +     if (vflags & V_BARRIER)
> +             ip->i_d.di_vflags |= XFS_DIVFLAG_BARRIER;
> +     else
> +             ip->i_d.di_vflags &= ~XFS_DIVFLAG_BARRIER;
> +     if (vflags & V_COW)
> +             ip->i_d.di_vflags |= XFS_DIVFLAG_COW;
> +     else
> +             ip->i_d.di_vflags &= ~XFS_DIVFLAG_COW;
> +}
> +
> +int
> +xfs_sync_flags(
> +     struct inode            *inode,
> +     int                     flags,
> +     int                     vflags)
> +{
> +     struct xfs_inode        *ip = XFS_I(inode);
> +     struct xfs_mount        *mp = ip->i_mount;
> +     struct xfs_trans        *tp;
> +     unsigned int            lock_flags = 0;
> +     int                     code;
> +
> +     tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_NOT_SIZE);
> +     code = xfs_trans_reserve(tp, 0, XFS_ICHANGE_LOG_RES(mp), 0, 0, 0);
> +     if (code)
> +             goto error_out;
> +
> +     xfs_ilock(ip, XFS_ILOCK_EXCL);
> +     xfs_trans_ijoin(tp, ip, 0);
> +
> +     inode->i_flags = flags;
> +     inode->i_vflags = vflags;
> +     xfs_get_inode_flags(ip);
> +
> +     xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> +     xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG);
> +
> +     XFS_STATS_INC(xs_ig_attrchg);
> +
> +     if (mp->m_flags & XFS_MOUNT_WSYNC)
> +             xfs_trans_set_sync(tp);
> +     code = xfs_trans_commit(tp, 0);
> +     xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +     return code;
> +
> +error_out:
> +     xfs_trans_cancel(tp, 0);
> +     if (lock_flags)
> +             xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +     return code;
> +}

Seems strange to ad a "sync_flags" method like this setting/clearing
the flags already has a generic method via FS_IOC_SETFLAGS.
Regardless, this is something that would need to be kept in the
vserver tree...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>