On Thu, Jun 27, 2013 at 03:57:58PM -0500, Ben Myers wrote:
> On Thu, Jun 27, 2013 at 08:44:10AM +1000, Dave Chinner wrote:
> > On Wed, Jun 26, 2013 at 05:30:17PM -0400, Dwight Engen wrote:
> > > On Wed, 26 Jun 2013 12:09:24 +1000
> > > Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > On Mon, Jun 24, 2013 at 09:10:35AM -0400, Dwight Engen wrote:
> > > > > Should we just require that callers of bulkstat
> > > > > be in init_user_ns? Thoughts?
> > > >
> > > > This is one of the reasons why I want Eric to give us some idea of
> > > > how this is supposed to work - exactly how is backup and restore
> > > > supposed to be managed on a shared filesystem that is segmented up
> > > > into multiple namespace containers? We can talk about the
> > > > implementation all we like, but none of us have a clue to the policy
> > > > decisions that users will make that we need to support. Until we
> > > > have a clear idea on what policies we are supposed to be supporting,
> > > > the implementation will be ambiguous and compromised.
> > > >
> > > > e.g. If users are responsible for it, then bulkstat needs to filter
> > > > based on the current namespace. If management is responsible (i.e.
> > > > init_user_ns does backup/restore of ns-specific subtrees), then
> > > > bulkstat cannot filter and needs to reject calls from outside the
> > > > init_user_ns().
> > >
> > > Maybe we can have bulkstat always filter based on if the caller
> > > kuid_has_mapping(current_user_ns(), inode->i_uid)? That way a caller
> > > from init_user_ns can see them all, but callers from inside a userns
> > > will get a subset of inodes returned?
> > We could do that, though it means bulkstat is going to be a *lot
> > slower* when called from within a user namespace environment. A
> > namespace might only have a few thousand files for backup, yet the
> > underlying filesystem might have tens of millions of inodes in it.
> > The bulkstat call now has to walk all of the inodes just to find the
> > few thousand that match the filter. And multiply that by the number
> > of namespaces all doing backups at 3am in the morning and you start
> > to get an idea of the scope of the problem....
> Ugh. That really doesn't map well onto bulkstat. If we wanted bulkstat to
> work well with namespaces, we might have to teach the filesystem a bit more
> about them in order to create the required indices per namespace. While a
> filter might get the job done in a pinch, wouldn't you really rather have an
> inobt? ;)
Absolutely not. :/
Filesystems can be bind mounted into multiple namespaces, you can
hard link across namespace boundaries, you can do all sorts of
things that result in inodes being shared between namespaces. You
can't have a per-namespace inobt when you can do this sort of thing
that the underlying filesystem many not even be aware of. Hell, you
can have the init namespace manipulate files for the user namespace,
and those manipulations aren't even aware they are happening inside
That doesn't even begin to touch on the major problems it introduces
into the on-disk format. e.g. how do you find, manage and validate
abitrarily rooted allocated inode btrees. What AG do you put them
in? What happens when you have inodes in multiple AGs in a single
namespace? One tree per AG per namespace? What happens when you have
10000 namespaces and 1000 AGs? How do we find the right inobt(s)
when we do an allocation - they aren't in the AGI anymore? How do we
walk then on mount after an unclean shutdown? How do we allocate and
remove trees? What the hell is repair supposed to do with
corrupt/lost inode btrees?
It's a rats nest, and it doesn't solve the basic problem of how
utilities that use bulkstat are supposed to behave.
> To build that inobt you'd have to know whether a given directory was the root
> of a new namespace. Maybe implementable as some kind of flag, 'everything
> below this dir is part of its own namespace, put it in this inobt'. And then
> you'd have to have a way for bulkstat to know to look there, e.g. if the
> is not in init_user_ns and if the initial inode had the flag, use the inobt on
> that initial inode for bulkstat instead of the regular inobts. Crazy. Could
> be done.
And I could fly to the moon, too. But like per-namespace inode
btrees, I don't see ever happening either...