On Thu, Jul 25, 2013 at 11:49:46AM -0400, Dwight Engen wrote:
> We need to check that userspace callers can only truncate preallocated
> blocks from files they have write access to to prevent them from prematurley
> reclaiming blocks from another user. The internal reclaimer will not specify
> the internal XFS_KEOF_FLAGS_PERM_CHECK flag, but callers from userspace will
> have it forced on.
> Add check for read-only filesystem to free eofblocks ioctl.
OK, so let me preface this by saying I understand a lot more about
user namespaces that I did a week ago...
Firstly, this ioctls allows any user to call this function with
arbitrary uid/gid/prid values here. Regardless of user namespaces, I
think that the only UID/GID that should be allowed to be passed for
a user context is their own UID/GID. There's no point in allowing a
scan if the inode_permission()/uid/gid matching checks are always
going to fail.
Secondly, because a user can issue an arbitrary number of concurrent
scans, I'd suggest we serialise all user scans through a per-mount
mutex. i.e. there can only be one user scan in progress at a time.
For users with capable(CAP_SYS_ADMIN), we don't need to serialise
them as we let root in the init userns shoot themselves in the foot
all they want.
Thirdly, project quotas are not really a user controlled resource
and so I don't think we want users calling this ioctl to trim
project quotas. Indeed, the files in the project that need trimming
may not even belong to the user running the scan, and hence the
user will never be allowed to trim the EOF blocks.
Further, project quotas might underly the namespace container and be
used for limiting the disk space the namespace container uses. In
which case, we don't even want access to the project ID scanning
from within the namespace.
Because of this, I'd suggest that project ID scans need a
"capable(CAP_SYS_ADMIN)" check on them. This means a project ID scan
can only be run from the init_userns - it cannot be done from within
the userns container at all.
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index ed35584..a80f38c 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1247,6 +1247,10 @@ xfs_inode_free_eofblocks(
> if (!xfs_inode_match_id(ip, eofb))
> return 0;
> + if ((eofb->eof_flags & XFS_KEOF_FLAGS_PERM_CHECK) &&
> + inode_permission(VFS_I(ip), MAY_WRITE))
> + return 0;
Lastly, what happens if this inode_permission() call occurs in the
context of a kworker thread. What userns does an anonymous kworker
thread run in? It's the init userns, right? so:
ns = current_user_ns();
return ns_capable(ns, cap) &&
If we are running in the init_userns as a kernel thread, the
ns_capable() check will return true, and there's always a mapping in
the init namespace so kuid_has_mapping() will return true. Hence
we'll always have write permission to every inode we check,
regardless of the XFS_KEOF_FLAGS_PERM_CHECK flag.
IOWs, this permission check is useless if we run the scan via a
workqueue. Hence we need to be *very careful* with the way we
execute scans and so this needs big, loud comments around it to
remind us to be careful in the future.
> +#define XFS_KEOF_FLAGS_PERM_CHECK (1 << 31) /* check can write inode */
> +#if XFS_KEOF_FLAGS_PERM_CHECK & XFS_EOF_FLAGS_VALID
> +#error "Internal XFS_KEOF_FLAGS_PERM_CHECK duplicated bit from
BUILD_BUG_ON() is the preferred method of doing this. Say, in