xfs
[Top] [All Lists]

Re: [PATCH 013/119] xfs: support btrees with overlapping intervals for k

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH 013/119] xfs: support btrees with overlapping intervals for keys
From: "Darrick J. Wong" <darrick.wong@xxxxxxxxxx>
Date: Wed, 6 Jul 2016 01:09:11 -0700
Cc: linux-fsdevel@xxxxxxxxxxxxxxx, vishal.l.verma@xxxxxxxxx, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20160706045941.GC12670@dastard>
References: <146612627129.12839.3827886950949809165.stgit@xxxxxxxxxxxxxxxx> <146612635526.12839.13865365567940815077.stgit@xxxxxxxxxxxxxxxx> <20160706045941.GC12670@dastard>
User-agent: Mutt/1.5.24 (2015-08-30)
On Wed, Jul 06, 2016 at 02:59:41PM +1000, Dave Chinner wrote:
> On Thu, Jun 16, 2016 at 06:19:15PM -0700, Darrick J. Wong wrote:
> > On a filesystem with both reflink and reverse mapping enabled, it's
> > possible to have multiple rmap records referring to the same blocks on
> > disk.  When overlapping intervals are possible, querying a classic
> > btree to find all records intersecting a given interval is inefficient
> > because we cannot use the left side of the search interval to filter
> > out non-matching records the same way that we can use the existing
> > btree key to filter out records coming after the right side of the
> > search interval.  This will become important once we want to use the
> > rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.
> 
> I thought I didn't hav emuch to say about this, but then I started
> writing down all my questions.....

I'd have been surprised if you didn't have much to say--

*I* certainly had plenty to say about this code when I dug back into it
last week to make XFS_BTREE_ROOT_IN_INODE work for level == 0 roots.
Most of it was unprintable. :P

> > @@ -445,6 +474,17 @@ static inline size_t xfs_btree_block_len(struct 
> > xfs_btree_cur *cur)
> >     return XFS_BTREE_SBLOCK_LEN;
> >  }
> >  
> > +/* Return size of btree block keys for this btree instance. */
> > +static inline size_t xfs_btree_key_len(struct xfs_btree_cur *cur)
> > +{
> > +   size_t                  len;
> > +
> > +   len = cur->bc_ops->key_len;
> > +   if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> > +           len *= 2;
> > +   return len;
> > +}
> 
> So there's magic here. Why can't the cur->bc_ops->key_len be set
> appropriately when it isi initialised?
> 
> >  /*
> >   * Return size of btree block pointers for this btree instance.
> >   */
> > @@ -475,7 +515,19 @@ xfs_btree_key_offset(
> >     int                     n)
> >  {
> >     return xfs_btree_block_len(cur) +
> > -           (n - 1) * cur->bc_ops->key_len;
> > +           (n - 1) * xfs_btree_key_len(cur);
> > +}
> 
> because this effectively means the key length and offsets for
> a btree with the XFS_BTREE_OPS_OVERLAPPING flag set is *always*
> cur->bc_ops->key_len * 2.

I designed the code around the idea that in going from a regular btree
to an overlapped btree, the key_len stays the same but the number of
keys doubles.  I can change everything such that key_len doubles but
the number of keys stays the same.  For the few places where we
actually update the low and high keys separately (basically updkeys)
we'll have to be a little careful with key_len.

> > +
> > +/*
> > + * Calculate offset of the n-th high key in a btree block.
> > + */
> > +STATIC size_t
> > +xfs_btree_high_key_offset(
> > +   struct xfs_btree_cur    *cur,
> > +   int                     n)
> > +{
> > +   return xfs_btree_block_len(cur) +
> > +           (n - 1) * xfs_btree_key_len(cur) + cur->bc_ops->key_len;
> >  }
> 
> And this is the only case where we use a "half key" length to pull
> the offset of the high key. Wouldn't it be better to be explicit
> about the high key offset rather than encode magic numbers to infer
> that the "overlapping key is really two key lengths with the high
> key at plus one key len". IMO, this is better:
> 
> xfs_btree_high_key_offset(
>       struct xfs_btree_cur    *cur,
>       int                     n)
> {
>       ASSERT(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING);
>       return xfs_btree_block_len(cur) +
>                (n - 1) * cur->bc_ops->key_len +
>                offset_of(struct xfs_btree_double_key, high);
> }
> 
> It means there are much fewer code changes needed for supporting
> the XFS_BTREE_OPS_OVERLAPPING flag, too.

Sure.

> > +STATIC void
> > +xfs_btree_find_leaf_keys(
> > +   struct xfs_btree_cur    *cur,
> > +   struct xfs_btree_block  *block,
> > +   union xfs_btree_key     *low,
> > +   union xfs_btree_key     *high)
> > +{
> > +   int                     n;
> > +   union xfs_btree_rec     *rec;
> > +   union xfs_btree_key     max_hkey;
> > +   union xfs_btree_key     hkey;
> > +
> > +   rec = xfs_btree_rec_addr(cur, 1, block);
> > +   cur->bc_ops->init_key_from_rec(low, rec);
> > +
> > +   if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> > +           return;
> 
> When I see conditionals like this, it makes me want to add
> a btree specific method. i.e.
> 
>       bc_ops->find_leaf_keys()
>       bc_ops->find_node_keys()
> 
> and we hook them up to generic functions that don't require
> checks against feature flags.
> 
> i.e:
> 
> xfs_btree_find_leaf_low_key()
> {
>       rec = xfs_btree_rec_addr(cur, 1, block);
>       cur->bc_ops->init_key_from_rec(low, rec);
> }
> 
> xfs_btree_find_leaf_low_high_keys()
> {
>       xfs_btree_find_leaf_low_key();
> 
>       /*
>        * high key finding code here, which is the same function
>        * for both keys and pointers
>        */
> }

The thing is, there's nothing in xfs_btree_find_*_keys that's specifc
to a btree.  I rather like only having to set one thing in the
btree_ops to get the overlapped mode, rather than having to remember
to make sure that such-and-such-functions are paired with
such-and-such flags.

Well, maybe it wouldn't be so bad.  I think there's only three
functions that need this treatment.

> .....
> 
> > +/*
> > + * Update parental low & high keys from some block all the way back to the
> > + * root of the btree.
> > + */
> > +STATIC int
> > +__xfs_btree_updkeys(
> 
> I kept getting confused by xfs_btree_updkey() and
> xfs_btree_updkeys(). Can we chose a better name for this parent key
> update?

I /think/ I want to collapse them into a single ->updkeys() function.

And maybe rename to update_parent_keys() ?

> > +   struct xfs_btree_cur    *cur,
> > +   int                     level,
> > +   struct xfs_btree_block  *block,
> > +   struct xfs_buf          *bp0,
> > +   bool                    force_all)
> > +{
> > +   union xfs_btree_key     lkey;   /* keys from current level */
> > +   union xfs_btree_key     hkey;
> > +   union xfs_btree_key     *nlkey; /* keys from the next level up */
> > +   union xfs_btree_key     *nhkey;
> > +   struct xfs_buf          *bp;
> > +   int                     ptr = -1;
> > +
> > +   if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> > +           return 0;
> 
> And, again, it's a probably better to use a btree op callout for
> this, especially when you've added this to xfs_btree_updkey():
> 
> > @@ -1893,6 +2132,9 @@ xfs_btree_updkey(
> >     union xfs_btree_key     *kp;
> >     int                     ptr;
> >  
> > +   if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> > +           return 0;
> > +
> >     XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> >     XFS_BTREE_TRACE_ARGIK(cur, level, keyp);
> 
> i.e. one or the other "updkey" does something, but not both.
> Extremely confusing to see both called but then only one do
> anything.
> 
> [back to __xfs_btree_updkeys()]
> 
> > +
> > +   if (level + 1 >= cur->bc_nlevels)
> > +           return 0;
> > +
> > +   trace_xfs_btree_updkeys(cur, level, bp0);
> > +
> > +   if (level == 0)
> > +           xfs_btree_find_leaf_keys(cur, block, &lkey, &hkey);
> > +   else
> > +           xfs_btree_find_node_keys(cur, block, &lkey, &hkey);
> 
> And this code fragment is repeated in many places, so i think a
> helper is warranted for this. That also reminds me - the "find" in
> the name is confusing - it's not "finding" as much as it is
> "getting" the low and high key values from the current block.
> 
> It's especially confusing when you do this:
> 
> > @@ -1970,7 +2212,8 @@ xfs_btree_update(
> >                                         ptr, LASTREC_UPDATE);
> >     }
> >  
> > -   /* Updating first rec in leaf. Pass new key value up to our parent. */
> > +   /* Pass new key value up to our parent. */
> > +   xfs_btree_updkeys(cur, 0);
> >     if (ptr == 1) {
> >             union xfs_btree_key     key;
> 
> You're throwing away the error from xfs_btree_updkeys() at, AFAICT,
> all call sites. This update can fail, so I suspect this needs to
> check and handle the update.

Yes, that's a bug, albeit a theoretical one since updkeys can't fail
at the moment.

(I fixed this one already in my djwong-wtf tree.)

> > @@ -2149,7 +2392,9 @@ xfs_btree_lshift(
> >             rkp = &key;
> >     }
> >  
> > -   /* Update the parent key values of right. */
> > +   /* Update the parent key values of left and right. */
> > +   xfs_btree_sibling_updkeys(cur, level, XFS_BB_LEFTSIB, left, lbp);
> > +   xfs_btree_updkeys(cur, level);
> >     error = xfs_btree_updkey(cur, rkp, level + 1);
> >     if (error)
> >             goto error0;
> 
> Remember what I said above about xfs_btree_updkeys/xfs_btree_updkey
> being confusing? Here we have 3 different key update functions, all
> doing different stuff, taking different parameters. None of the code
> is consistent in how these updates are done - they are all different
> combinations of these functions, so I'm not sure how we are supposed
> to verify the correct updates are being done now or in the future.
> 
> How can we hide this complexity from the generic btree code?

I refactored this mess after bfoster complained, but even after that
there's still a conditional.  We need to updkeys the right block
regardless, but we only need to updkeys the left block if it's an
overlapped tree, which leads to this:

/*
 * Using a temporary cursor, update the parent key values of
 * the
 * block on the left.
 */
error = xfs_btree_dup_cursor(cur, &tcur);
if (error)
        goto error0;
i = xfs_btree_firstrec(tcur, level);
XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, error0);

error = xfs_btree_decrement(tcur, level, &i);
if (error)
        goto error1;

/* Update left and right parent pointers */
error = xfs_btree_updkeys(cur, level);
if (error)
        goto error1;
error = xfs_btree_updkeys(tcur, level);
if (error)
        goto error1;
error = xfs_btree_updkey(cur, rkp, level + 1);
if (error)
        goto error0;

xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);

Still yucky, will have to meditate on this further...

> > @@ -2321,6 +2566,9 @@ xfs_btree_rshift(
> >     if (error)
> >             goto error1;
> >  
> > +   /* Update left and right parent pointers */
> > +   xfs_btree_updkeys(cur, level);
> > +   xfs_btree_updkeys(tcur, level);
> >     error = xfs_btree_updkey(tcur, rkp, level + 1);
> >     if (error)
> >             goto error1;
> 
> Different.
> 
> > @@ -2499,6 +2746,10 @@ __xfs_btree_split(
> >             xfs_btree_set_sibling(cur, rrblock, &rptr, XFS_BB_LEFTSIB);
> >             xfs_btree_log_block(cur, rrbp, XFS_BB_LEFTSIB);
> >     }
> > +
> > +   /* Update the left block's keys... */
> > +   xfs_btree_updkeys(cur, level);
> 
> different...
> 
> > @@ -2806,27 +3057,27 @@ xfs_btree_new_root(
> >             bp = lbp;
> >             nptr = 2;
> >     }
> > +
> >     /* Fill in the new block's btree header and log it. */
> >     xfs_btree_init_block_cur(cur, nbp, cur->bc_nlevels, 2);
> >     xfs_btree_log_block(cur, nbp, XFS_BB_ALL_BITS);
> >     ASSERT(!xfs_btree_ptr_is_null(cur, &lptr) &&
> >                     !xfs_btree_ptr_is_null(cur, &rptr));
> > -
> >     /* Fill in the key data in the new root. */
> >     if (xfs_btree_get_level(left) > 0) {
> > -           xfs_btree_copy_keys(cur,
> > +           xfs_btree_find_node_keys(cur, left,
> >                             xfs_btree_key_addr(cur, 1, new),
> > -                           xfs_btree_key_addr(cur, 1, left), 1);
> > -           xfs_btree_copy_keys(cur,
> > +                           xfs_btree_high_key_addr(cur, 1, new));
> > +           xfs_btree_find_node_keys(cur, right,
> >                             xfs_btree_key_addr(cur, 2, new),
> > -                           xfs_btree_key_addr(cur, 1, right), 1);
> > +                           xfs_btree_high_key_addr(cur, 2, new));
> 
> And this took me ages to work out - you replaced
> xfs_btree_copy_keys() with xfs_btree_find_node_keys() which means
> the fact that we are copying a key from one block to antoher has
> been lost.

That's because we're not strictly copying keys from left and right
into the root anymore.  Yes, the low part of the key is a straight
copy, but we have to iterate left and right, respectively, to
calculate the high keys that go in keys 1 & 2 in the root block.
The high key of a given tree node is the maximum of all the keys or
records in that node, or put another way, it's the highest key
reachable in that subtree...

> It wasn't until I realised that
> xfs_btree_find_node_keys() was writing directly into the new block
> record that it was an equivalent operation to a copy.
> 
> This is why I don't like the name xfs_btree_find_*_keys() - when it
> is used like this it badly obfuscates what operation is being
> performed - it's most definitely not a find operation being
> performed. i.e. xfs_btree_copy_keys() documents the operation in
> an obvious and straight forward manner, the new code takes time and
> thought to decipher.
> 
> Perhaps you could move it all to inside xfs_btree_copy_keys(), so
> the complexity is hidden from the higher level  btree manipulation
> functions...

...so it's not a strict copy.

> > +/* Copy a double key into a btree block. */
> > +static void
> > +xfs_btree_copy_double_keys(
> > +   struct xfs_btree_cur    *cur,
> > +   int                     ptr,
> > +   struct xfs_btree_block  *block,
> > +   struct xfs_btree_double_key     *key)
> > +{
> > +   memcpy(xfs_btree_key_addr(cur, ptr, block), &key->low,
> > +                   cur->bc_ops->key_len);
> > +
> > +   if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> > +           memcpy(xfs_btree_high_key_addr(cur, ptr, block), &key->high,
> > +                           cur->bc_ops->key_len);
> > +}
> 
> This should be located next to xfs_btree_copy_keys().
> 
> >     /* If we inserted at the start of a block, update the parents' keys. */

BTW, I replaced the above comment with:

/*
 * If we just inserted into a new tree block, we have to
 * recalculate nkey here because nkey is out of date.
 *
 * Otherwise we're just updating an existing block (having
 * shoved some records into the new tree block), so use the
 * regular key update mechanism.
 */

> > +   if (ncur && bp->b_bn != old_bn) {
> > +           /*
> > +            * We just inserted into a new tree block, which means that
> > +            * the key for the block is in nkey, not the tree.
> > +            */
> > +           if (level == 0)
> > +                   xfs_btree_find_leaf_keys(cur, block, &nkey.low,
> > +                                   &nkey.high);
> > +           else
> > +                   xfs_btree_find_node_keys(cur, block, &nkey.low,
> > +                                   &nkey.high);
> > +   } else {
> > +           /* Updating the left block, do it the standard way. */
> > +           xfs_btree_updkeys(cur, level);
> > +   }
> > +
> >     if (optr == 1) {
> > -           error = xfs_btree_updkey(cur, key, level + 1);
> > +           error = xfs_btree_updkey(cur, &key->low, level + 1);
> >             if (error)
> >                     goto error0;
> >     }
> 
> This is another of those "huh, what" moments I had with all the
> different _updkey functions....

Ditto.  It took me a long time to figure out what the original code
was doing here, and therefore what was the correct thing to do for the
overlapped btree.

> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index b99c018..a5ec6c7 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -126,6 +126,9 @@ struct xfs_btree_ops {
> >     size_t  key_len;
> >     size_t  rec_len;
> >  
> > +   /* flags */
> > +   uint    flags;
> > +
> .....
> > @@ -182,6 +195,9 @@ struct xfs_btree_ops {
> >  #endif
> >  };
> >  
> > +/* btree ops flags */
> > +#define XFS_BTREE_OPS_OVERLAPPING  (1<<0)  /* overlapping intervals */
> > +
> 
> why did you put this in the struct btree_ops  and not in the
> btree cursor ->bc_flags field like all the other btree specific
> customisations like:
> 
> /* cursor flags */
> #define XFS_BTREE_LONG_PTRS             (1<<0)  /* pointers are 64bits long */
> #define XFS_BTREE_ROOT_IN_INODE         (1<<1)  /* root may be variable size 
> */
> #define XFS_BTREE_LASTREC_UPDATE        (1<<2)  /* track last rec externally 
> */
> #define XFS_BTREE_CRC_BLOCKS            (1<<3)  /* uses extended btree blocks 
> */
> 
> i.e. we should have all the structural/behavioural flags in the one
> place, not split across different structures....

At the time I thought that it would be a good idea in the long run to
move the btree flags that can't be changed without changes to the
btree_ops into a btree_ops specific flags field.  At the time I didn't
know that I'd end up adding only one flag or that the only btree ops
change I'd need was init_high_key_from_rec, so when I took a second
look last week I put eliminating the flags field on the todo list.

Ok, enough for one night. :)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>