On Wed, Jul 06, 2016 at 02:59:41PM +1000, Dave Chinner wrote:
> On Thu, Jun 16, 2016 at 06:19:15PM -0700, Darrick J. Wong wrote:
> > On a filesystem with both reflink and reverse mapping enabled, it's
> > possible to have multiple rmap records referring to the same blocks on
> > disk. When overlapping intervals are possible, querying a classic
> > btree to find all records intersecting a given interval is inefficient
> > because we cannot use the left side of the search interval to filter
> > out non-matching records the same way that we can use the existing
> > btree key to filter out records coming after the right side of the
> > search interval. This will become important once we want to use the
> > rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.
>
> I thought I didn't hav emuch to say about this, but then I started
> writing down all my questions.....
I'd have been surprised if you didn't have much to say--
*I* certainly had plenty to say about this code when I dug back into it
last week to make XFS_BTREE_ROOT_IN_INODE work for level == 0 roots.
Most of it was unprintable. :P
> > @@ -445,6 +474,17 @@ static inline size_t xfs_btree_block_len(struct
> > xfs_btree_cur *cur)
> > return XFS_BTREE_SBLOCK_LEN;
> > }
> >
> > +/* Return size of btree block keys for this btree instance. */
> > +static inline size_t xfs_btree_key_len(struct xfs_btree_cur *cur)
> > +{
> > + size_t len;
> > +
> > + len = cur->bc_ops->key_len;
> > + if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> > + len *= 2;
> > + return len;
> > +}
>
> So there's magic here. Why can't the cur->bc_ops->key_len be set
> appropriately when it isi initialised?
>
> > /*
> > * Return size of btree block pointers for this btree instance.
> > */
> > @@ -475,7 +515,19 @@ xfs_btree_key_offset(
> > int n)
> > {
> > return xfs_btree_block_len(cur) +
> > - (n - 1) * cur->bc_ops->key_len;
> > + (n - 1) * xfs_btree_key_len(cur);
> > +}
>
> because this effectively means the key length and offsets for
> a btree with the XFS_BTREE_OPS_OVERLAPPING flag set is *always*
> cur->bc_ops->key_len * 2.
I designed the code around the idea that in going from a regular btree
to an overlapped btree, the key_len stays the same but the number of
keys doubles. I can change everything such that key_len doubles but
the number of keys stays the same. For the few places where we
actually update the low and high keys separately (basically updkeys)
we'll have to be a little careful with key_len.
> > +
> > +/*
> > + * Calculate offset of the n-th high key in a btree block.
> > + */
> > +STATIC size_t
> > +xfs_btree_high_key_offset(
> > + struct xfs_btree_cur *cur,
> > + int n)
> > +{
> > + return xfs_btree_block_len(cur) +
> > + (n - 1) * xfs_btree_key_len(cur) + cur->bc_ops->key_len;
> > }
>
> And this is the only case where we use a "half key" length to pull
> the offset of the high key. Wouldn't it be better to be explicit
> about the high key offset rather than encode magic numbers to infer
> that the "overlapping key is really two key lengths with the high
> key at plus one key len". IMO, this is better:
>
> xfs_btree_high_key_offset(
> struct xfs_btree_cur *cur,
> int n)
> {
> ASSERT(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING);
> return xfs_btree_block_len(cur) +
> (n - 1) * cur->bc_ops->key_len +
> offset_of(struct xfs_btree_double_key, high);
> }
>
> It means there are much fewer code changes needed for supporting
> the XFS_BTREE_OPS_OVERLAPPING flag, too.
Sure.
> > +STATIC void
> > +xfs_btree_find_leaf_keys(
> > + struct xfs_btree_cur *cur,
> > + struct xfs_btree_block *block,
> > + union xfs_btree_key *low,
> > + union xfs_btree_key *high)
> > +{
> > + int n;
> > + union xfs_btree_rec *rec;
> > + union xfs_btree_key max_hkey;
> > + union xfs_btree_key hkey;
> > +
> > + rec = xfs_btree_rec_addr(cur, 1, block);
> > + cur->bc_ops->init_key_from_rec(low, rec);
> > +
> > + if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> > + return;
>
> When I see conditionals like this, it makes me want to add
> a btree specific method. i.e.
>
> bc_ops->find_leaf_keys()
> bc_ops->find_node_keys()
>
> and we hook them up to generic functions that don't require
> checks against feature flags.
>
> i.e:
>
> xfs_btree_find_leaf_low_key()
> {
> rec = xfs_btree_rec_addr(cur, 1, block);
> cur->bc_ops->init_key_from_rec(low, rec);
> }
>
> xfs_btree_find_leaf_low_high_keys()
> {
> xfs_btree_find_leaf_low_key();
>
> /*
> * high key finding code here, which is the same function
> * for both keys and pointers
> */
> }
The thing is, there's nothing in xfs_btree_find_*_keys that's specifc
to a btree. I rather like only having to set one thing in the
btree_ops to get the overlapped mode, rather than having to remember
to make sure that such-and-such-functions are paired with
such-and-such flags.
Well, maybe it wouldn't be so bad. I think there's only three
functions that need this treatment.
> .....
>
> > +/*
> > + * Update parental low & high keys from some block all the way back to the
> > + * root of the btree.
> > + */
> > +STATIC int
> > +__xfs_btree_updkeys(
>
> I kept getting confused by xfs_btree_updkey() and
> xfs_btree_updkeys(). Can we chose a better name for this parent key
> update?
I /think/ I want to collapse them into a single ->updkeys() function.
And maybe rename to update_parent_keys() ?
> > + struct xfs_btree_cur *cur,
> > + int level,
> > + struct xfs_btree_block *block,
> > + struct xfs_buf *bp0,
> > + bool force_all)
> > +{
> > + union xfs_btree_key lkey; /* keys from current level */
> > + union xfs_btree_key hkey;
> > + union xfs_btree_key *nlkey; /* keys from the next level up */
> > + union xfs_btree_key *nhkey;
> > + struct xfs_buf *bp;
> > + int ptr = -1;
> > +
> > + if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> > + return 0;
>
> And, again, it's a probably better to use a btree op callout for
> this, especially when you've added this to xfs_btree_updkey():
>
> > @@ -1893,6 +2132,9 @@ xfs_btree_updkey(
> > union xfs_btree_key *kp;
> > int ptr;
> >
> > + if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> > + return 0;
> > +
> > XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > XFS_BTREE_TRACE_ARGIK(cur, level, keyp);
>
> i.e. one or the other "updkey" does something, but not both.
> Extremely confusing to see both called but then only one do
> anything.
>
> [back to __xfs_btree_updkeys()]
>
> > +
> > + if (level + 1 >= cur->bc_nlevels)
> > + return 0;
> > +
> > + trace_xfs_btree_updkeys(cur, level, bp0);
> > +
> > + if (level == 0)
> > + xfs_btree_find_leaf_keys(cur, block, &lkey, &hkey);
> > + else
> > + xfs_btree_find_node_keys(cur, block, &lkey, &hkey);
>
> And this code fragment is repeated in many places, so i think a
> helper is warranted for this. That also reminds me - the "find" in
> the name is confusing - it's not "finding" as much as it is
> "getting" the low and high key values from the current block.
>
> It's especially confusing when you do this:
>
> > @@ -1970,7 +2212,8 @@ xfs_btree_update(
> > ptr, LASTREC_UPDATE);
> > }
> >
> > - /* Updating first rec in leaf. Pass new key value up to our parent. */
> > + /* Pass new key value up to our parent. */
> > + xfs_btree_updkeys(cur, 0);
> > if (ptr == 1) {
> > union xfs_btree_key key;
>
> You're throwing away the error from xfs_btree_updkeys() at, AFAICT,
> all call sites. This update can fail, so I suspect this needs to
> check and handle the update.
Yes, that's a bug, albeit a theoretical one since updkeys can't fail
at the moment.
(I fixed this one already in my djwong-wtf tree.)
> > @@ -2149,7 +2392,9 @@ xfs_btree_lshift(
> > rkp = &key;
> > }
> >
> > - /* Update the parent key values of right. */
> > + /* Update the parent key values of left and right. */
> > + xfs_btree_sibling_updkeys(cur, level, XFS_BB_LEFTSIB, left, lbp);
> > + xfs_btree_updkeys(cur, level);
> > error = xfs_btree_updkey(cur, rkp, level + 1);
> > if (error)
> > goto error0;
>
> Remember what I said above about xfs_btree_updkeys/xfs_btree_updkey
> being confusing? Here we have 3 different key update functions, all
> doing different stuff, taking different parameters. None of the code
> is consistent in how these updates are done - they are all different
> combinations of these functions, so I'm not sure how we are supposed
> to verify the correct updates are being done now or in the future.
>
> How can we hide this complexity from the generic btree code?
I refactored this mess after bfoster complained, but even after that
there's still a conditional. We need to updkeys the right block
regardless, but we only need to updkeys the left block if it's an
overlapped tree, which leads to this:
/*
* Using a temporary cursor, update the parent key values of
* the
* block on the left.
*/
error = xfs_btree_dup_cursor(cur, &tcur);
if (error)
goto error0;
i = xfs_btree_firstrec(tcur, level);
XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, error0);
error = xfs_btree_decrement(tcur, level, &i);
if (error)
goto error1;
/* Update left and right parent pointers */
error = xfs_btree_updkeys(cur, level);
if (error)
goto error1;
error = xfs_btree_updkeys(tcur, level);
if (error)
goto error1;
error = xfs_btree_updkey(cur, rkp, level + 1);
if (error)
goto error0;
xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
Still yucky, will have to meditate on this further...
> > @@ -2321,6 +2566,9 @@ xfs_btree_rshift(
> > if (error)
> > goto error1;
> >
> > + /* Update left and right parent pointers */
> > + xfs_btree_updkeys(cur, level);
> > + xfs_btree_updkeys(tcur, level);
> > error = xfs_btree_updkey(tcur, rkp, level + 1);
> > if (error)
> > goto error1;
>
> Different.
>
> > @@ -2499,6 +2746,10 @@ __xfs_btree_split(
> > xfs_btree_set_sibling(cur, rrblock, &rptr, XFS_BB_LEFTSIB);
> > xfs_btree_log_block(cur, rrbp, XFS_BB_LEFTSIB);
> > }
> > +
> > + /* Update the left block's keys... */
> > + xfs_btree_updkeys(cur, level);
>
> different...
>
> > @@ -2806,27 +3057,27 @@ xfs_btree_new_root(
> > bp = lbp;
> > nptr = 2;
> > }
> > +
> > /* Fill in the new block's btree header and log it. */
> > xfs_btree_init_block_cur(cur, nbp, cur->bc_nlevels, 2);
> > xfs_btree_log_block(cur, nbp, XFS_BB_ALL_BITS);
> > ASSERT(!xfs_btree_ptr_is_null(cur, &lptr) &&
> > !xfs_btree_ptr_is_null(cur, &rptr));
> > -
> > /* Fill in the key data in the new root. */
> > if (xfs_btree_get_level(left) > 0) {
> > - xfs_btree_copy_keys(cur,
> > + xfs_btree_find_node_keys(cur, left,
> > xfs_btree_key_addr(cur, 1, new),
> > - xfs_btree_key_addr(cur, 1, left), 1);
> > - xfs_btree_copy_keys(cur,
> > + xfs_btree_high_key_addr(cur, 1, new));
> > + xfs_btree_find_node_keys(cur, right,
> > xfs_btree_key_addr(cur, 2, new),
> > - xfs_btree_key_addr(cur, 1, right), 1);
> > + xfs_btree_high_key_addr(cur, 2, new));
>
> And this took me ages to work out - you replaced
> xfs_btree_copy_keys() with xfs_btree_find_node_keys() which means
> the fact that we are copying a key from one block to antoher has
> been lost.
That's because we're not strictly copying keys from left and right
into the root anymore. Yes, the low part of the key is a straight
copy, but we have to iterate left and right, respectively, to
calculate the high keys that go in keys 1 & 2 in the root block.
The high key of a given tree node is the maximum of all the keys or
records in that node, or put another way, it's the highest key
reachable in that subtree...
> It wasn't until I realised that
> xfs_btree_find_node_keys() was writing directly into the new block
> record that it was an equivalent operation to a copy.
>
> This is why I don't like the name xfs_btree_find_*_keys() - when it
> is used like this it badly obfuscates what operation is being
> performed - it's most definitely not a find operation being
> performed. i.e. xfs_btree_copy_keys() documents the operation in
> an obvious and straight forward manner, the new code takes time and
> thought to decipher.
>
> Perhaps you could move it all to inside xfs_btree_copy_keys(), so
> the complexity is hidden from the higher level btree manipulation
> functions...
...so it's not a strict copy.
> > +/* Copy a double key into a btree block. */
> > +static void
> > +xfs_btree_copy_double_keys(
> > + struct xfs_btree_cur *cur,
> > + int ptr,
> > + struct xfs_btree_block *block,
> > + struct xfs_btree_double_key *key)
> > +{
> > + memcpy(xfs_btree_key_addr(cur, ptr, block), &key->low,
> > + cur->bc_ops->key_len);
> > +
> > + if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> > + memcpy(xfs_btree_high_key_addr(cur, ptr, block), &key->high,
> > + cur->bc_ops->key_len);
> > +}
>
> This should be located next to xfs_btree_copy_keys().
>
> > /* If we inserted at the start of a block, update the parents' keys. */
BTW, I replaced the above comment with:
/*
* If we just inserted into a new tree block, we have to
* recalculate nkey here because nkey is out of date.
*
* Otherwise we're just updating an existing block (having
* shoved some records into the new tree block), so use the
* regular key update mechanism.
*/
> > + if (ncur && bp->b_bn != old_bn) {
> > + /*
> > + * We just inserted into a new tree block, which means that
> > + * the key for the block is in nkey, not the tree.
> > + */
> > + if (level == 0)
> > + xfs_btree_find_leaf_keys(cur, block, &nkey.low,
> > + &nkey.high);
> > + else
> > + xfs_btree_find_node_keys(cur, block, &nkey.low,
> > + &nkey.high);
> > + } else {
> > + /* Updating the left block, do it the standard way. */
> > + xfs_btree_updkeys(cur, level);
> > + }
> > +
> > if (optr == 1) {
> > - error = xfs_btree_updkey(cur, key, level + 1);
> > + error = xfs_btree_updkey(cur, &key->low, level + 1);
> > if (error)
> > goto error0;
> > }
>
> This is another of those "huh, what" moments I had with all the
> different _updkey functions....
Ditto. It took me a long time to figure out what the original code
was doing here, and therefore what was the correct thing to do for the
overlapped btree.
> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index b99c018..a5ec6c7 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -126,6 +126,9 @@ struct xfs_btree_ops {
> > size_t key_len;
> > size_t rec_len;
> >
> > + /* flags */
> > + uint flags;
> > +
> .....
> > @@ -182,6 +195,9 @@ struct xfs_btree_ops {
> > #endif
> > };
> >
> > +/* btree ops flags */
> > +#define XFS_BTREE_OPS_OVERLAPPING (1<<0) /* overlapping intervals */
> > +
>
> why did you put this in the struct btree_ops and not in the
> btree cursor ->bc_flags field like all the other btree specific
> customisations like:
>
> /* cursor flags */
> #define XFS_BTREE_LONG_PTRS (1<<0) /* pointers are 64bits long */
> #define XFS_BTREE_ROOT_IN_INODE (1<<1) /* root may be variable size
> */
> #define XFS_BTREE_LASTREC_UPDATE (1<<2) /* track last rec externally
> */
> #define XFS_BTREE_CRC_BLOCKS (1<<3) /* uses extended btree blocks
> */
>
> i.e. we should have all the structural/behavioural flags in the one
> place, not split across different structures....
At the time I thought that it would be a good idea in the long run to
move the btree flags that can't be changed without changes to the
btree_ops into a btree_ops specific flags field. At the time I didn't
know that I'd end up adding only one flag or that the only btree ops
change I'd need was init_high_key_from_rec, so when I took a second
look last week I put eliminating the flags field on the todo list.
Ok, enough for one night. :)
--D
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
|