xfs
[Top] [All Lists]

Re: Please (help) improve support for linux-vserver

To: Herbert Poetzl <herbert@xxxxxxxxxxxx>
Subject: Re: Please (help) improve support for linux-vserver
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Tue, 18 Dec 2012 11:09:46 +1100
Cc: Andras Korn <korn-xfs@xxxxxxxxxxxxxx>, xfs@xxxxxxxxxxx
In-reply-to: <20121217122048.GV17837@xxxxxxxxxxxxxxxxx>
References: <20121216223953.GL18563@hellgate> <20121217000124.GP9806@dastard> <20121217122048.GV17837@xxxxxxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Mon, Dec 17, 2012 at 01:20:48PM +0100, Herbert Poetzl wrote:
> On Mon, Dec 17, 2012 at 11:01:24AM +1100, Dave Chinner wrote:
> > On Sun, Dec 16, 2012 at 11:39:53PM +0100, Andras Korn wrote:
> >> Hi,
> 
> >> I'm a user of xfs and linux-vserver (http://linux-vserver.org/).
> 
> >> I'd like vserver to work better with xfs (or vice versa) and am trying to
> >> proxy between the two development communities (which in the case of vserver
> >> is not very large).
> 
> >> vservers are basically chroots on steroids: the host runs a single kernel,
> >> but it isolates processes running in the guests from each other. There is a
> >> "copy on write hardlink breaking" feature that allows you to hardlink files
> >> (such as libc6) of different guests together (so that they only get mapped
> >> into memory once), and have the kernel break the link if the inode is 
> >> opened
> >> for writing (by creating a copy and returning a FD to the copy).
> 
> >> This feature relies on inode flags (like the 'immutable' bit). vserver adds
> >> two fields to the inode (the other is used to tag inodes with a context 
> >> ID).
> 
> >> The kernel parts work, but xfs_repair breaks such filesystems because it
> >> thinks the flags are invalid.
> 
> >> I approached David Chinner and Eric Sandeen about this on IRC, and they 
> >> said
> >> that the first step towards any improvement would be to share with this 
> >> list
> >> the parts of the vserver patch that affect xfs, so that's what I'm doing
> >> now.
> 
> >> Please find attached the output of
> 
> >> filterdiff -i '*xfs*' patch-3.7-vs2.3.5.1.diff
> 
> > Best to include patches in line so they are easy to quote and
> > comment on.
> 
> > The main issue here is that it includes lots of bits that aren't in
> > the mainline kernel, so there are bits that we cannot push into the
> > mainline kernel.
> 
> > What we really need to do is get the bits that change the on-disk
> > format formalised, and then we can ensure that the userspace tools
> > know about these bits and can query/validate them properly.
> 
> > So, what htat means is that we need to sort out bits like:
> 
> >> --- linux-3.7/fs/xfs/xfs_dinode.h  2012-10-04 13:27:44.000000000 +0000
> >> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_dinode.h        2012-12-11 
> >> 15:56:32.000000000 +0000
> >> @@ -51,7 +51,9 @@ typedef struct xfs_dinode {
> >>    __be32          di_nlink;       /* number of links to file */
> >>    __be16          di_projid_lo;   /* lower part of owner's project id */
> >>    __be16          di_projid_hi;   /* higher part owner's project id */
> >> -  __u8            di_pad[6];      /* unused, zeroed space */
> >> +  __u8            di_pad[2];      /* unused, zeroed space */
> >> +  __be16          di_tag;         /* context tagging */
> >> +  __be16          di_vflags;      /* vserver specific flags */
> >>    __be16          di_flushiter;   /* incremented on flush */
> >>    xfs_timestamp_t di_atime;       /* time last accessed */
> >>    xfs_timestamp_t di_mtime;       /* time last modified */
> >> @@ -184,6 +186,8 @@ static inline void xfs_dinode_put_rdev(s
> >>  #define XFS_DIFLAG_EXTSZINHERIT_BIT 12    /* inherit inode extent size */
> >>  #define XFS_DIFLAG_NODEFRAG_BIT     13    /* do not reorganize/defragment 
> >> */
> >>  #define XFS_DIFLAG_FILESTREAM_BIT   14  /* use filestream allocator */
> >> +#define XFS_DIFLAG_IXUNLINK_BIT     15    /* Immutable inver on unlink */
> >> +
> >>  #define XFS_DIFLAG_REALTIME      (1 << XFS_DIFLAG_REALTIME_BIT)
> >>  #define XFS_DIFLAG_PREALLOC      (1 << XFS_DIFLAG_PREALLOC_BIT)
> >>  #define XFS_DIFLAG_NEWRTBM       (1 << XFS_DIFLAG_NEWRTBM_BIT)
> >> @@ -199,6 +203,7 @@ static inline void xfs_dinode_put_rdev(s
> >>  #define XFS_DIFLAG_EXTSZINHERIT  (1 << XFS_DIFLAG_EXTSZINHERIT_BIT)
> >>  #define XFS_DIFLAG_NODEFRAG      (1 << XFS_DIFLAG_NODEFRAG_BIT)
> >>  #define XFS_DIFLAG_FILESTREAM    (1 << XFS_DIFLAG_FILESTREAM_BIT)
> >> +#define XFS_DIFLAG_IXUNLINK      (1 << XFS_DIFLAG_IXUNLINK_BIT)
> 
> >>  #ifdef CONFIG_XFS_RT
> >>  #define XFS_IS_REALTIME_INODE(ip) ((ip)->i_d.di_flags & 
> >> XFS_DIFLAG_REALTIME)
> >> @@ -211,6 +216,10 @@ static inline void xfs_dinode_put_rdev(s
> >>     XFS_DIFLAG_IMMUTABLE | XFS_DIFLAG_APPEND | XFS_DIFLAG_SYNC | \
> >>     XFS_DIFLAG_NOATIME | XFS_DIFLAG_NODUMP | XFS_DIFLAG_RTINHERIT | \
> >>     XFS_DIFLAG_PROJINHERIT | XFS_DIFLAG_NOSYMLINKS | XFS_DIFLAG_EXTSIZE | \
> >> -   XFS_DIFLAG_EXTSZINHERIT | XFS_DIFLAG_NODEFRAG | XFS_DIFLAG_FILESTREAM)
> >> +   XFS_DIFLAG_EXTSZINHERIT | XFS_DIFLAG_NODEFRAG | XFS_DIFLAG_FILESTREAM 
> >> | \
> >> +   XFS_DIFLAG_IXUNLINK)
> >> +
> >> +#define XFS_DIVFLAG_BARRIER       0x01
> >> +#define XFS_DIVFLAG_COW           0x02
> 
> 
> > These definitions and on disk format changes, and wrap them in
> > a manner that allows us to support them without issue.
> 
> > The main problem I see is that these bits are specific to
> > vserver so if they are set on a mainline kernel/filesytsem,
> > they should never be set. That means we really need a
> > superblock feature bit to indicate whether these changes are
> > valid or not. 
> 
> okay, I don't see a problem with that, any suggestions
> how to implement this? any good examples?

Have a look at the way the 32-bit project ID extension was
implemented.

$ $ git grep -i PROJID32BIT
fs/xfs/xfs_fsops.c:                     
(xfs_sb_version_hasprojid32bit(&mp->m_sb) 
fs/xfs/xfs_ioctl.c:      * Disallow 32bit project ids when projid32bit feature 
is 
fs/xfs/xfs_ioctl.c:                     
!xfs_sb_version_hasprojid32bit(&ip->i_moun
fs/xfs/xfs_sb.h:#define XFS_SB_VERSION2_PROJID32BIT     0x00000080      /* 32 
bit 
fs/xfs/xfs_sb.h:         XFS_SB_VERSION2_PROJID32BIT)
fs/xfs/xfs_sb.h:static inline int xfs_sb_version_hasprojid32bit(xfs_sb_t *sbp)
fs/xfs/xfs_sb.h:                (sbp->sb_features2 & 
XFS_SB_VERSION2_PROJID32BIT);

> > A bit issue is this bit:
> > #define XFS_DIFLAG_IXUNLINK_BIT     15      /* Immutable inver on unlink */
> 
> > Which takes the last bit of the di_flags field. We've been
> > reserving that bit to be used as a "more bits bit" to indicate
> > the presence of an extra flags field in the inode code (similar
> > to the superblock XFS_SB_VERSION_MOREBITSBIT feature bit). That
> > would allow us not to have to use a superblock feature bit to
> > indicate the precense of the new field in the inode.
> 
> we already introduced the DIVFLAGs, so I don't see a
> problem moving the IXUNLINK into that field as well,

The problem with doing that is that you break compatibility with all
existing filesystems that think the bit is where it currently is. We
should be trying to keep everything in the same location so there's
nothing that users see change...

> 10 years ago, we assumed that this bit will become
> part of mainline, but we were wrong on that :)

Well, that's all history, but for future reference you should engage
the relevant filesystem people before changing disk formats, not
after. ;)

> > Unfortunately, this bit is taken, so AFAICT the only way we
> > can merge this into the upstream code is to add a superblock
> > feature bit at the same time. But that then makes the upstream
> > code still incompatible with the vserver code, because none of
> > the vserver filesystems will have the feature bit set. That
> > can be worked around (e.g. the vserver patchset includes a new
> > piece of code that sets the feature bit on mount), but it does
> > make it somewhat difficult to cleanly support the extensions
> > that vserver has made...
> 
> I think we can either set it on mount or have it set
> by a special tool (or by an xfs specific tool) manually

I think that a special tool (xfs_admin) and setting it by default
for new fielsystems is probably the way we'll support it upstream,
but you can add a patch to set it on mount automatically for vserver
installations.

The reason I say this is that it seems simpler to me to add a
generic flag field to the inode and just reserve bits in the flag
field for vserver. That way we have extra flag space available for
upstream filesystems as well, and there's nothing really specific to
the vserver format at that point. We can then add a flag in the new
flag field to say whether the new tag field is valid or not.

I would have liked to make the new flags field 32 bits, but that's a
bit hard given the existing vserver format order of pad, tag, flags.
It could be done similar to the project ID with lo/hi fields, but I
think another 16 bits of flags is probably sufficient at this point
in time.

> > To complicate things further, then new inode format for CRC
> > changes already has a new flags field added to it. Basically,
> > I was not intending to add a new flags field to the existing
> > inode format *ever* as new feature bits woul donly be supported
> > on new format filesystems. I'm not sure what the best way to
> > reconcile this is - I really don't want to have to support 3
> > separate flags fields, 2 of which are optional...
> 
> as it seems, we need 3 bits at most (for the flag field)
> and I don't think that anybody cares which bits those
> are, as long as they do not change too often, so any 3
> bits would do and we could easily adapt the code to use
> them like the DIVFLAGs

Once they are set, they won't get moved at all.

> > What is important is that on-disk formats are compatible
> > between the two trees....
> 
> the filesystem tagging needs 16 bits for the tags, but
> might in the future be replaced by project ids, so I'd
> suggest to use one of the pad fields for now and leave
> that as a future exercise ...

Right, and potentially use a flag bit to indicate that the tag field
is in use? i.e. if the tag is non-zero, then the inode flag bit gets
set, and it it is returned to zero the bit if cleared (similar to
the di_extsize field and the XFS_DIFLAG_EXTSIZE_BIT handling)?

> >> --- linux-3.7/fs/xfs/xfs_ioctl.c   2012-12-11 15:47:37.000000000 +0000
> >> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_ioctl.c 2012-12-11 15:56:32.000000000 
> >> +0000
> >> @@ -26,7 +26,7 @@
> >>  #include "xfs_bmap_btree.h"
> >>  #include "xfs_dinode.h"
> >>  #include "xfs_inode.h"
> >> -#include "xfs_ioctl.h"
> >> +// #include "xfs_ioctl.h"
> >>  #include "xfs_rtalloc.h"
> >>  #include "xfs_itable.h"
> >>  #include "xfs_error.h"
> 
> > That doesn't look right....
> 
> IIRC, it was done to avoid problems with recursive
> inclusion, and it still seems to work, so that include
> is not really required ...

The headers should protect against recursive inclusion already. If
they don't, then that's a bug we need to fix.

> >> --- linux-3.7/fs/xfs/xfs_super.c   2012-12-11 15:47:37.000000000 +0000
> >> +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_super.c 2012-12-11 17:36:47.000000000 
> >> +0000
> >> @@ -114,6 +114,9 @@ mempool_t *xfs_ioend_pool;
> >>  #define MNTOPT_NODELAYLOG  "nodelaylog"   /* Delayed logging disabled */
> >>  #define MNTOPT_DISCARD       "discard"    /* Discard unused blocks */
> >>  #define MNTOPT_NODISCARD   "nodiscard"    /* Do not discard unused blocks 
> >> */
> >> +#define MNTOPT_TAGXID     "tagxid"        /* context tagging for inodes */
> >> +#define MNTOPT_TAGGED     "tag"           /* context tagging for inodes */
> >> +#define MNTOPT_NOTAGTAG   "notag"         /* do not use context tagging */
> > .....
> >> @@ -1149,6 +1170,16 @@ xfs_fs_remount(
> >>            case Opt_inode32:
> >>                    mp->m_maxagi = xfs_set_inode32(mp);
> >>                    break;
> >> +          case Opt_tag:
> >> +                  if (!(sb->s_flags & MS_TAGGED)) {
> >> +                          printk(KERN_INFO
> >> +                                  "XFS: %s: tagging not permitted on 
> >> remount.\n",
> >> +                                  sb->s_id);
> >> +                          return -EINVAL;
> >> +                  }
> >> +                  break;
> >> +          case Opt_notag:
> >> +                  break;
> >>            default:
> >>                    /*
> >>                     * Logically we would return an error here to prevent
> >> @@ -1368,6 +1399,9 @@ xfs_fs_fill_super(
> >>    if (error)
> >>            goto out_free_sb;
> 
> >> +  if (mp->m_flags & XFS_MOUNT_TAGGED)
> >> +          sb->s_flags |= MS_TAGGED;
> >> +
> 
> > Why wouldn't you use vfs-based mount option parsing here
> > and hence not need XFS_MOUNT_TAGGED or all the duplicated
> > parsing in each filesystem? This seems like you all this could
> > be removed from the patch, and the XFS code just checks if
> > (mp->m_super->s_flags & MS_TAGGED) is true....
> 
> mainly because not all filesystems are able to support
> tagging (there are certain requirements to make that work)
> and handling exceptions on the vfs level is more complicated
> than adding the options to the supported filesystems

Agreed, but that's not precisely what I meant. What I meant was add
a vfs helper function that filesystems call in their fill_super
method that handles these mount options in one place and sets
MS_TAGGED appropriately.

The helper only goes into supported filesystems, and avoids the need
to patch the parsing routines of all the filesystems and having
multiple ways of tracking the tagged state. i.e. a single line patch
to the filesystem is much easier to maintain...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>