xfs
[Top] [All Lists]

Re: [PATCH] xfs: add a shrinker to background inode reclaim

To: Alex Elder <aelder@xxxxxxx>
Subject: Re: [PATCH] xfs: add a shrinker to background inode reclaim
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Fri, 30 Apr 2010 12:02:24 +1000
Cc: xfs@xxxxxxxxxxx
In-reply-to: <1272539253.3221.64.camel@doink>
References: <1272429248-5269-1-git-send-email-david@xxxxxxxxxxxxx> <1272539253.3221.64.camel@doink>
User-agent: Mutt/1.5.20 (2009-06-14)
On Thu, Apr 29, 2010 at 06:07:33AM -0500, Alex Elder wrote:
> On Wed, 2010-04-28 at 14:34 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > On low memory boxes or those with highmem, kernel can OOM before the
> > background reclaims inodes via xfssyncd. Add a shrinker to run inode
> > reclaim so that it inode reclaim is expedited when memory is low.
> > 
> > This is more complex than it needs to be because the VM folk don't
> > want a context added to the shrinker infrastructure. Hence we need
> > to add a global list of XFS mount structures so the shrinker can
> > traverse them.
> 
> I have some comments; one of them suggests a fairness
> change.  But I know this fix is very important and
> we're short on time so I'm going to pull this in as-is
> and ask Linus to pull it as well.  

I tried being fair, it didn't work and I have bigger things to spend
my time on (like getting context based shrinkers accepted ;).

> . . .
> > @@ -134,7 +135,7 @@ restart:
> >             if (error == EFSCORRUPTED)
> >                     break;
> >  
> > -   } while (1);
> > +   } while ((*nr_to_scan)--);
> 
> Do you ever plan to return a negative value in *nr_to_scan?

No.

> > @@ -165,14 +169,18 @@ xfs_inode_ag_iterator(
> >                     continue;
> >             }
> >             error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
> > -                                           exclusive);
> > +                                           exclusive, &nr);
> >             xfs_perag_put(pag);
> >             if (error) {
> >                     last_error = error;
> >                     if (error == EFSCORRUPTED)
> >                             break;
> >             }
> > +           if (nr <= 0)
> > +                   break;
> 
> Why check for negative here?  It will never go negative
> as currently used.  (Defensive coding is a reasonable
> explanation.)

A negative value is invalid - indicateѕ an overflow - so abort on
the termination case (0) or invalid value (< 0). So yes, i guess you
could say "defensive coding". ;)

> > +/*
> > + * Shrinker infrastructure.
> > + *
> > + * This is all far more complex than it needs to be. It adds a global list 
> > of
> > + * mounts because the shrinkers can only call a global context. We need to 
> > make
> > + * the shrinkers pass a context to avoid the need for global state.
> > + */
> > +static LIST_HEAD(xfs_mount_list);
> > +static struct rw_semaphore xfs_mount_list_lock;
> > +
> > +static int
> > +xfs_reclaim_inode_shrink(
> > +   int             nr_to_scan,
> > +   gfp_t           gfp_mask)
> > +{
> > +   struct xfs_mount *mp;
> > +   struct xfs_perag *pag;
> > +   xfs_agnumber_t  ag;
> > +   int             reclaimable = 0;
> > +
> > +   if (nr_to_scan) {
> > +           if (!(gfp_mask & __GFP_FS))
> > +                   return -1;
> > +
> > +           down_read(&xfs_mount_list_lock);
> > +           list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
> > +                   xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
> > +                                   XFS_ICI_RECLAIM_TAG, 1, &nr_to_scan);
> > +                   if (nr_to_scan <= 0)
> > +                           break;
> 
> So we limit our scan using nr_to_scan, and we'll break out as
> soon as we have exhausted that many scans

Yup, that's exactly how shrinkers are supposed to work. If there
is more to do, then the shrinker will be called again with a new
nr_to_scan value.

> (i.e., reclaimed inodes). 

A scan count is not actually freeing/reclaiming an object, it's the
number of objects we've looked at with intent to reclaim. We may not
reclaim any by the time the scan count is exhausted, and that's the
reason that why proportioning across filesystems is not working in
this case.

The reason for this behaviour is that it prevents scanning thousands
(even 10s or 100s of thousands) of objects before finding enough to
reclaim.  If there are that many unreclaimable objects on the end of
the LRU, then there's no point trying to reclaim from it right now
and the time is better spent on reclaiming from somewhere else.

> We're enforcing this limit across the global list of
> all mounts.  Since we don't update the list, and we traverse
> the list from the beginning each time, the first mount always
> gets hit, the ones later in the list may never be reached.

Yes, that may happen. however, if the first filesystem has more
inodes to reclaim, we don't lose anything by reclaiming them first,
as the xfssyncd for the other fielsystems will come by soon and do a
complete reclaim pass on them.

The issue with the shrinker is that if we split the batch up over
all the filesystems, then the shrinker makes much slower progress
because the batches are not large enough to get past the first
cluster of reclaimable inodes on each filesytsem that is currently
under IO.... 

> I think--given the restricted environment we're working in
> here--we could make the scans more fair by rotating the starting
> point for scans each time through.  I.e., record the last mount
> point we scanned and pick up there next time (or at the front
> of the list if that mount went away).

I agree with you in that if the aim was to keep this code then we'd
need to do this. However, it's not worth the effort of designing and
testing a fancy scheme given that I'm going to do my best to remove
this traverse list as quickly as possible. The current stuff works,
but trying to improve it is like polishing a turd - no matter how
shiny you make it, in the end it's still just a smelly pile of crap.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>