On Wed, 26 Jun 2013 12:09:24 +1000
Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Mon, Jun 24, 2013 at 09:10:35AM -0400, Dwight Engen wrote:
> > Hi Dave, all. Here is a v2 patch that I believe addresses the
> > previous comments, but I expect there to be more :) I think there
> > are a few more issues to sort out before this is ready, and I want
> > to add some tests to xfstests also.
> > I added permission checks for eofblocks in the ioctl code, but
> > I don't think they are enough. Just because an unprivileged
> > caller is in a group doesn't mean he can write to a file of that
> > group, and I don't know how we can check for that till we get the
> > inode in hand. Brian, if you or anyone else could comment on how
> > this should work for the regular user and write() ENOSPC cases
> > that'd be great.
> > The xfs code now uses inode->i_uid where possible instead of di_uid.
> > The remaining uses of di_uid are where the inode is being setup,
> > conversion to/from disk endianess, in dealing with quotas, and
> > bulkstat.
> Hi Dwight,
> I haven't looked at the code in detail, I'll just mention that I
> agree with Brain that we need to split this up into more patches.
> The conversions are subtle and easy to misunderstand, so small
> patches are good. I'd say at minimum separate patches are needed
> - permissions check on eof blocks
> - splitting eof blocks structure (I made comments on that to
> Brain's posted patch)
> - introduce XFS conversion helpers
> - implement VFS-XFS boundary conversion using helpers
> - XFS ioctl conversions (maybe multiple patches)
> - final patch to modify kconfig once everything is done
> > We do need to decide on the di_uid that comes back from bulkstat.
> > Right now it is returning on disk (== init_user_ns) uids. It looks
> > to me like xfsrestore is using the normal vfs routines (chown,
> > fchown, lchown) when restoring so that won't line up if the
> > xfsrestore is run in !init_user_ns.
> Right. This is a major problem because there's no guarantee that you
> are running restore in the same namespace as dump was run in. If you
> take the dump in the init_user_ns, then you can't restore it from
> inside the namespace container, and vice versa.
> > We could possibly convert to userns values
> > before returning them from the kernel, but I doubt that will work
> > well with the xfs quotas.
> Well, quotas are already converted to/from userns specific values
> before they get to XFS. Including project quotas, which I think at
> this point is wrong. We have no test cases for it, though, so I
> can't validate that it actually works as it is supposed to and that
> we don't break it inadvertantly in the future.
> [ I'm still waiting on Eric to provide any information or scripts
> for how he tested this all worked when he did the conversions.... ]
> > Should we just require that callers of bulkstat
> > be in init_user_ns? Thoughts?
> This is one of the reasons why I want Eric to give us some idea of
> how this is supposed to work - exactly how is backup and restore
> supposed to be managed on a shared filesystem that is segmented up
> into multiple namespace containers? We can talk about the
> implementation all we like, but none of us have a clue to the policy
> decisions that users will make that we need to support. Until we
> have a clear idea on what policies we are supposed to be supporting,
> the implementation will be ambiguous and compromised.
> e.g. If users are responsible for it, then bulkstat needs to filter
> based on the current namespace. If management is responsible (i.e.
> init_user_ns does backup/restore of ns-specific subtrees), then
> bulkstat cannot filter and needs to reject calls from outside the
Hi, in thinking about this a bit more, I'm not sure why bulkstat should
be any different than stat(2). If you stat a file not covered by your
current_user_ns(), it doesn't "filter" and return ENOENT, it just
returns it as overflowuid (65534). Filtering bulkstat also has other
(performance) problems as you pointed out, and we cannot easily filter
I don't think the user namespace is intended to subpartition the
set of inodes available from a filesystem, but only to remap the id
values. It then seems like the only policy that can reliably be
supported today is to backup/restore from init_user_ns, or if you want
to backup/restore in a sub user ns, you must know ahead of time that you
will not encounter ids outside your mapping in the fs/backup media.
> But if we have to allow both configurations - which I think we have to
> because both cases are valid choices for a hosting provider to give
> users - then how are we supposed to implement/handle this?
> The same goes for the file handle interfaces - it's perfectly valid
> for a user to run a userspace NFS server (e.g. ganesha) inside a
> namespace container, but that will allow that process to provide
> unchecked remote access to the entire underlying filesystem, not
> just the namespace being exported. i.e. fs/export.c needs to be made
> namespace aware in some way so that there is a kernel wide policy
> for handle to file translations in namespace containers....
So I looked at open_by_handle_at(2) and it looks to me like it is no
different than regular open(2) wrt the user namespace checks. The code
paths become common at path_openat(). Note that there is no check for
kuid_has_mapping(current_user_ns(), inode->i_uid) so an inode doesn't
have to be covered by the current user ns in order to get it opened,
the caller just has to pass the inode_permission() checks. I guess I
don't see the "unchecked" part, could you elaborate on that?