xfs
[Top] [All Lists]

Re: Please (help) improve support for linux-vserver

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: Please (help) improve support for linux-vserver
From: Herbert Poetzl <herbert@xxxxxxxxxxxx>
Date: Mon, 17 Dec 2012 13:20:48 +0100
Cc: Andras Korn <korn-xfs@xxxxxxxxxxxxxx>, xfs@xxxxxxxxxxx
In-reply-to: <20121217000124.GP9806@dastard>
References: <20121216223953.GL18563@hellgate> <20121217000124.GP9806@dastard>
User-agent: Mutt/1.5.11
On Mon, Dec 17, 2012 at 11:01:24AM +1100, Dave Chinner wrote:
> On Sun, Dec 16, 2012 at 11:39:53PM +0100, Andras Korn wrote:
>> Hi,

>> I'm a user of xfs and linux-vserver (http://linux-vserver.org/).

>> I'd like vserver to work better with xfs (or vice versa) and am trying to
>> proxy between the two development communities (which in the case of vserver
>> is not very large).

>> vservers are basically chroots on steroids: the host runs a single kernel,
>> but it isolates processes running in the guests from each other. There is a
>> "copy on write hardlink breaking" feature that allows you to hardlink files
>> (such as libc6) of different guests together (so that they only get mapped
>> into memory once), and have the kernel break the link if the inode is opened
>> for writing (by creating a copy and returning a FD to the copy).

>> This feature relies on inode flags (like the 'immutable' bit). vserver adds
>> two fields to the inode (the other is used to tag inodes with a context ID).

>> The kernel parts work, but xfs_repair breaks such filesystems because it
>> thinks the flags are invalid.

>> I approached David Chinner and Eric Sandeen about this on IRC, and they said
>> that the first step towards any improvement would be to share with this list
>> the parts of the vserver patch that affect xfs, so that's what I'm doing
>> now.

>> Please find attached the output of

>> filterdiff -i '*xfs*' patch-3.7-vs2.3.5.1.diff

> Best to include patches in line so they are easy to quote and
> comment on.

> The main issue here is that it includes lots of bits that aren't in
> the mainline kernel, so there are bits that we cannot push into the
> mainline kernel.

> What we really need to do is get the bits that change the on-disk
> format formalised, and then we can ensure that the userspace tools
> know about these bits and can query/validate them properly.

> So, what htat means is that we need to sort out bits like:

>> --- linux-3.7/fs/xfs/xfs_dinode.h    2012-10-04 13:27:44.000000000 +0000
>> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_dinode.h  2012-12-11 15:56:32.000000000 
>> +0000
>> @@ -51,7 +51,9 @@ typedef struct xfs_dinode {
>>      __be32          di_nlink;       /* number of links to file */
>>      __be16          di_projid_lo;   /* lower part of owner's project id */
>>      __be16          di_projid_hi;   /* higher part owner's project id */
>> -    __u8            di_pad[6];      /* unused, zeroed space */
>> +    __u8            di_pad[2];      /* unused, zeroed space */
>> +    __be16          di_tag;         /* context tagging */
>> +    __be16          di_vflags;      /* vserver specific flags */
>>      __be16          di_flushiter;   /* incremented on flush */
>>      xfs_timestamp_t di_atime;       /* time last accessed */
>>      xfs_timestamp_t di_mtime;       /* time last modified */
>> @@ -184,6 +186,8 @@ static inline void xfs_dinode_put_rdev(s
>>  #define XFS_DIFLAG_EXTSZINHERIT_BIT 12      /* inherit inode extent size */
>>  #define XFS_DIFLAG_NODEFRAG_BIT     13      /* do not reorganize/defragment 
>> */
>>  #define XFS_DIFLAG_FILESTREAM_BIT   14  /* use filestream allocator */
>> +#define XFS_DIFLAG_IXUNLINK_BIT     15      /* Immutable inver on unlink */
>> +
>>  #define XFS_DIFLAG_REALTIME      (1 << XFS_DIFLAG_REALTIME_BIT)
>>  #define XFS_DIFLAG_PREALLOC      (1 << XFS_DIFLAG_PREALLOC_BIT)
>>  #define XFS_DIFLAG_NEWRTBM       (1 << XFS_DIFLAG_NEWRTBM_BIT)
>> @@ -199,6 +203,7 @@ static inline void xfs_dinode_put_rdev(s
>>  #define XFS_DIFLAG_EXTSZINHERIT  (1 << XFS_DIFLAG_EXTSZINHERIT_BIT)
>>  #define XFS_DIFLAG_NODEFRAG      (1 << XFS_DIFLAG_NODEFRAG_BIT)
>>  #define XFS_DIFLAG_FILESTREAM    (1 << XFS_DIFLAG_FILESTREAM_BIT)
>> +#define XFS_DIFLAG_IXUNLINK      (1 << XFS_DIFLAG_IXUNLINK_BIT)

>>  #ifdef CONFIG_XFS_RT
>>  #define XFS_IS_REALTIME_INODE(ip) ((ip)->i_d.di_flags & XFS_DIFLAG_REALTIME)
>> @@ -211,6 +216,10 @@ static inline void xfs_dinode_put_rdev(s
>>       XFS_DIFLAG_IMMUTABLE | XFS_DIFLAG_APPEND | XFS_DIFLAG_SYNC | \
>>       XFS_DIFLAG_NOATIME | XFS_DIFLAG_NODUMP | XFS_DIFLAG_RTINHERIT | \
>>       XFS_DIFLAG_PROJINHERIT | XFS_DIFLAG_NOSYMLINKS | XFS_DIFLAG_EXTSIZE | \
>> -     XFS_DIFLAG_EXTSZINHERIT | XFS_DIFLAG_NODEFRAG | XFS_DIFLAG_FILESTREAM)
>> +     XFS_DIFLAG_EXTSZINHERIT | XFS_DIFLAG_NODEFRAG | XFS_DIFLAG_FILESTREAM 
>> | \
>> +     XFS_DIFLAG_IXUNLINK)
>> +
>> +#define XFS_DIVFLAG_BARRIER 0x01
>> +#define XFS_DIVFLAG_COW             0x02


> These definitions and on disk format changes, and wrap them in
> a manner that allows us to support them without issue.

> The main problem I see is that these bits are specific to
> vserver so if they are set on a mainline kernel/filesytsem,
> they should never be set. That means we really need a
> superblock feature bit to indicate whether these changes are
> valid or not. 

okay, I don't see a problem with that, any suggestions
how to implement this? any good examples?

> A bit issue is this bit:
> #define XFS_DIFLAG_IXUNLINK_BIT     15        /* Immutable inver on unlink */

> Which takes the last bit of the di_flags field. We've been
> reserving that bit to be used as a "more bits bit" to indicate
> the presence of an extra flags field in the inode code (similar
> to the superblock XFS_SB_VERSION_MOREBITSBIT feature bit). That
> would allow us not to have to use a superblock feature bit to
> indicate the precense of the new field in the inode.

we already introduced the DIVFLAGs, so I don't see a
problem moving the IXUNLINK into that field as well,
10 years ago, we assumed that this bit will become
part of mainline, but we were wrong on that :)

> Unfortunately, this bit is taken, so AFAICT the only way we
> can merge this into the upstream code is to add a superblock
> feature bit at the same time. But that then makes the upstream
> code still incompatible with the vserver code, because none of
> the vserver filesystems will have the feature bit set. That
> can be worked around (e.g. the vserver patchset includes a new
> piece of code that sets the feature bit on mount), but it does
> make it somewhat difficult to cleanly support the extensions
> that vserver has made...

I think we can either set it on mount or have it set
by a special tool (or by an xfs specific tool) manually

> To complicate things further, then new inode format for CRC
> changes already has a new flags field added to it. Basically,
> I was not intending to add a new flags field to the existing
> inode format *ever* as new feature bits woul donly be supported
> on new format filesystems. I'm not sure what the best way to
> reconcile this is - I really don't want to have to support 3
> separate flags fields, 2 of which are optional...

as it seems, we need 3 bits at most (for the flag field)
and I don't think that anybody cares which bits those
are, as long as they do not change too often, so any 3
bits would do and we could easily adapt the code to use
them like the DIVFLAGs

>> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_inode.c   2012-12-11 22:20:23.000000000 
>> +0000
>> @@ -16,6 +16,7 @@
>>   * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
>>   */
>>  #include <linux/log2.h>
>> +#include <linux/vs_tag.h>

>>  #include "xfs.h"
>>  #include "xfs_fs.h"
>> @@ -563,15 +564,25 @@ xfs_iformat_btree(
>>  STATIC void
>>  xfs_dinode_from_disk(
>>      xfs_icdinode_t          *to,
>> -    xfs_dinode_t            *from)
>> +    xfs_dinode_t            *from,
>> +    int                     tagged)
>>  {
>> +    uint32_t uid, gid, tag;
>> +
>>      to->di_magic = be16_to_cpu(from->di_magic);
>>      to->di_mode = be16_to_cpu(from->di_mode);
>>      to->di_version = from ->di_version;
>>      to->di_format = from->di_format;
>>      to->di_onlink = be16_to_cpu(from->di_onlink);
>> -    to->di_uid = be32_to_cpu(from->di_uid);
>> -    to->di_gid = be32_to_cpu(from->di_gid);
>> +
>> +    uid = be32_to_cpu(from->di_uid);
>> +    gid = be32_to_cpu(from->di_gid);
>> +    tag = be16_to_cpu(from->di_tag);
>> +
>> +    to->di_uid = INOTAG_UID(tagged, uid, gid);
>> +    to->di_gid = INOTAG_GID(tagged, uid, gid);
>> +    to->di_tag = INOTAG_TAG(tagged, uid, gid, tag);

> Changes like this will still have to exist solely in the
> vserver tree as they rely on core changes in the vserver tree.

I completely agree, those changes do not make any sense
outside the Linux-VServer code

> What is important is that on-disk formats are compatible
> between the two trees....

the filesystem tagging needs 16 bits for the tags, but
might in the future be replaced by project ids, so I'd
suggest to use one of the pad fields for now and leave
that as a future exercise ...

>> @@ -669,6 +689,10 @@ _xfs_dic2xflags(
>>                      flags |= XFS_XFLAG_FILESTREAM;
>>      }

>> +    if (di_vflags & XFS_DIVFLAG_BARRIER)
>> +            flags |= FS_BARRIER_FL;
>> +    if (di_vflags & XFS_DIVFLAG_COW)
>> +            flags |= FS_COW_FL;
>>      return flags;

> Similarly for bits like this.

>> --- linux-3.7/fs/xfs/xfs_ioctl.c     2012-12-11 15:47:37.000000000 +0000
>> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_ioctl.c   2012-12-11 15:56:32.000000000 
>> +0000
>> @@ -26,7 +26,7 @@
>>  #include "xfs_bmap_btree.h"
>>  #include "xfs_dinode.h"
>>  #include "xfs_inode.h"
>> -#include "xfs_ioctl.h"
>> +// #include "xfs_ioctl.h"
>>  #include "xfs_rtalloc.h"
>>  #include "xfs_itable.h"
>>  #include "xfs_error.h"

> That doesn't look right....

IIRC, it was done to avoid problems with recursive
inclusion, and it still seems to work, so that include
is not really required ...

>> @@ -892,6 +900,10 @@ xfs_diflags_to_linux(
>>              inode->i_flags |= S_IMMUTABLE;
>>      else
>>              inode->i_flags &= ~S_IMMUTABLE;
>> +    if (xflags & XFS_XFLAG_IXUNLINK)
>> +            inode->i_flags |= S_IXUNLINK;
>> +    else
>> +            inode->i_flags &= ~S_IXUNLINK;
>>      if (xflags & XFS_XFLAG_APPEND)
>>              inode->i_flags |= S_APPEND;
>>      else
>> @@ -1396,10 +1408,18 @@ xfs_file_ioctl(
>>      case XFS_IOC_FSGETXATTRA:
>>              return xfs_ioc_fsgetxattr(ip, 1, arg);
>>      case XFS_IOC_FSSETXATTR:
>> +            if (IS_BARRIER(inode)) {
>> +                    vxwprintk_task(1, "messing with the barrier.");
>> +                    return -XFS_ERROR(EACCES);
>> +            }
>>              return xfs_ioc_fssetxattr(ip, filp, arg);
>>      case XFS_IOC_GETXFLAGS:
>>              return xfs_ioc_getxflags(ip, arg);
>>      case XFS_IOC_SETXFLAGS:
>> +            if (IS_BARRIER(inode)) {
>> +                    vxwprintk_task(1, "messing with the barrier.");
>> +                    return -XFS_ERROR(EACCES);
>> +            }
>>              return xfs_ioc_setxflags(ip, filp, arg);

> And these all rely on vserver infrastructure, so would have to
> remain in the vserver tree....

agreed!

>>      case XFS_IOC_FSSETDM: {
>> --- linux-3.7/fs/xfs/xfs_ioctl.h     2011-10-24 16:45:31.000000000 +0000
>> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_ioctl.h   2012-12-11 15:56:32.000000000 
>> +0000
>> @@ -70,6 +70,12 @@ xfs_handle_to_dentry(
>>      void __user             *uhandle,
>>      u32                     hlen);

>> +extern int
>> +xfs_sync_flags(
>> +    struct inode            *inode,
>> +    int                     flags,
>> +    int                     vflags);
>> +
>>  extern long
>>  xfs_file_ioctl(
>>      struct file             *filp,
>> --- linux-3.7/fs/xfs/xfs_iops.c      2012-10-04 13:27:44.000000000 +0000
>> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_iops.c    2012-12-11 15:56:32.000000000 
>> +0000
>> @@ -28,6 +28,7 @@
>>  #include "xfs_bmap_btree.h"
>>  #include "xfs_dinode.h"
>>  #include "xfs_inode.h"
>> +#include "xfs_ioctl.h"
>>  #include "xfs_bmap.h"
>>  #include "xfs_rtalloc.h"
>>  #include "xfs_error.h"
>> @@ -46,6 +47,7 @@
>>  #include <linux/security.h>
>>  #include <linux/fiemap.h>
>>  #include <linux/slab.h>
>> +#include <linux/vs_tag.h>

>>  static int
>>  xfs_initxattrs(
>> @@ -421,6 +423,7 @@ xfs_vn_getattr(
>>      stat->nlink = ip->i_d.di_nlink;
>>      stat->uid = ip->i_d.di_uid;
>>      stat->gid = ip->i_d.di_gid;
>> +    stat->tag = ip->i_d.di_tag;
>>      stat->ino = ip->i_ino;
>>      stat->atime = inode->i_atime;
>>      stat->mtime = inode->i_mtime;
>> @@ -1033,6 +1036,7 @@ static const struct inode_operations xfs
>>      .listxattr              = xfs_vn_listxattr,
>>      .fiemap                 = xfs_vn_fiemap,
>>      .update_time            = xfs_vn_update_time,
>> +    .sync_flags             = xfs_sync_flags,
>>  };

>>  static const struct inode_operations xfs_dir_inode_operations = {
>> @@ -1059,6 +1063,7 @@ static const struct inode_operations xfs
>>      .removexattr            = generic_removexattr,
>>      .listxattr              = xfs_vn_listxattr,
>>      .update_time            = xfs_vn_update_time,
>> +    .sync_flags             = xfs_sync_flags,
>>  };

> as would all these "sync_flag" changes.

>> --- linux-3.7/fs/xfs/xfs_super.c     2012-12-11 15:47:37.000000000 +0000
>> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_super.c   2012-12-11 17:36:47.000000000 
>> +0000
>> @@ -114,6 +114,9 @@ mempool_t *xfs_ioend_pool;
>>  #define MNTOPT_NODELAYLOG  "nodelaylog"     /* Delayed logging disabled */
>>  #define MNTOPT_DISCARD         "discard"    /* Discard unused blocks */
>>  #define MNTOPT_NODISCARD   "nodiscard"      /* Do not discard unused blocks 
>> */
>> +#define MNTOPT_TAGXID       "tagxid"        /* context tagging for inodes */
>> +#define MNTOPT_TAGGED       "tag"           /* context tagging for inodes */
>> +#define MNTOPT_NOTAGTAG     "notag"         /* do not use context tagging */
> .....
>> @@ -1149,6 +1170,16 @@ xfs_fs_remount(
>>              case Opt_inode32:
>>                      mp->m_maxagi = xfs_set_inode32(mp);
>>                      break;
>> +            case Opt_tag:
>> +                    if (!(sb->s_flags & MS_TAGGED)) {
>> +                            printk(KERN_INFO
>> +                                    "XFS: %s: tagging not permitted on 
>> remount.\n",
>> +                                    sb->s_id);
>> +                            return -EINVAL;
>> +                    }
>> +                    break;
>> +            case Opt_notag:
>> +                    break;
>>              default:
>>                      /*
>>                       * Logically we would return an error here to prevent
>> @@ -1368,6 +1399,9 @@ xfs_fs_fill_super(
>>      if (error)
>>              goto out_free_sb;

>> +    if (mp->m_flags & XFS_MOUNT_TAGGED)
>> +            sb->s_flags |= MS_TAGGED;
>> +

> Why wouldn't you use vfs-based mount option parsing here
> and hence not need XFS_MOUNT_TAGGED or all the duplicated
> parsing in each filesystem? This seems like you all this could
> be removed from the patch, and the XFS code just checks if
> (mp->m_super->s_flags & MS_TAGGED) is true....

mainly because not all filesystems are able to support
tagging (there are certain requirements to make that work)
and handling exceptions on the vfs level is more complicated
than adding the options to the supported filesystems

>>      /*
>>       * we must configure the block size in the superblock before we run the
>>       * full mount process as the mount process can lookup and cache inodes.
>> --- linux-3.7/fs/xfs/xfs_vnodeops.c  2012-10-04 13:27:44.000000000 +0000
>> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_vnodeops.c        2012-12-11 
>> 15:56:32.000000000 +0000
>> @@ -103,6 +103,77 @@ xfs_readlink_bmap(
>>      return error;
>>  }

>> +
>> +STATIC void
>> +xfs_get_inode_flags(
>> +    xfs_inode_t     *ip)
>> +{
>> +    struct inode    *inode = VFS_I(ip);
>> +    unsigned int    flags = inode->i_flags;
>> +    unsigned int    vflags = inode->i_vflags;
>> +
>> +    if (flags & S_IMMUTABLE)
>> +            ip->i_d.di_flags |= XFS_DIFLAG_IMMUTABLE;
>> +    else
>> +            ip->i_d.di_flags &= ~XFS_DIFLAG_IMMUTABLE;
>> +    if (flags & S_IXUNLINK)
>> +            ip->i_d.di_flags |= XFS_DIFLAG_IXUNLINK;
>> +    else
>> +            ip->i_d.di_flags &= ~XFS_DIFLAG_IXUNLINK;
>> +
>> +    if (vflags & V_BARRIER)
>> +            ip->i_d.di_vflags |= XFS_DIVFLAG_BARRIER;
>> +    else
>> +            ip->i_d.di_vflags &= ~XFS_DIVFLAG_BARRIER;
>> +    if (vflags & V_COW)
>> +            ip->i_d.di_vflags |= XFS_DIVFLAG_COW;
>> +    else
>> +            ip->i_d.di_vflags &= ~XFS_DIVFLAG_COW;
>> +}
>> +
>> +int
>> +xfs_sync_flags(
>> +    struct inode            *inode,
>> +    int                     flags,
>> +    int                     vflags)
>> +{
>> +    struct xfs_inode        *ip = XFS_I(inode);
>> +    struct xfs_mount        *mp = ip->i_mount;
>> +    struct xfs_trans        *tp;
>> +    unsigned int            lock_flags = 0;
>> +    int                     code;
>> +
>> +    tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_NOT_SIZE);
>> +    code = xfs_trans_reserve(tp, 0, XFS_ICHANGE_LOG_RES(mp), 0, 0, 0);
>> +    if (code)
>> +            goto error_out;
>> +
>> +    xfs_ilock(ip, XFS_ILOCK_EXCL);
>> +    xfs_trans_ijoin(tp, ip, 0);
>> +
>> +    inode->i_flags = flags;
>> +    inode->i_vflags = vflags;
>> +    xfs_get_inode_flags(ip);
>> +
>> +    xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>> +    xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG);
>> +
>> +    XFS_STATS_INC(xs_ig_attrchg);
>> +
>> +    if (mp->m_flags & XFS_MOUNT_WSYNC)
>> +            xfs_trans_set_sync(tp);
>> +    code = xfs_trans_commit(tp, 0);
>> +    xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> +    return code;
>> +
>> +error_out:
>> +    xfs_trans_cancel(tp, 0);
>> +    if (lock_flags)
>> +            xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> +    return code;
>> +}

> Seems strange to ad a "sync_flags" method like this
> setting/clearing the flags already has a generic method via
> FS_IOC_SETFLAGS. Regardless, this is something that would need
> to be kept in the vserver tree...

this was done because at least two of those flags in
question must not be modifiable from userspace, but I
have no problem to revisit this ... nevertheles, it
is Linux-VServer specific, so no real issue ...

thanks,
Herbert

> Cheers,

> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx

> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

<Prev in Thread] Current Thread [Next in Thread>