xfs
[Top] [All Lists]

Re: [PATCH 5/7] xfs: add configuration of error failure speed

To: xfs@xxxxxxxxxxx
Subject: Re: [PATCH 5/7] xfs: add configuration of error failure speed
From: Carlos Maiolino <cmaiolino@xxxxxxxxxx>
Date: Wed, 4 May 2016 11:57:29 +0200
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20160503232419.GV26977@dastard>
Mail-followup-to: xfs@xxxxxxxxxxx
References: <1462298140-12411-1-git-send-email-cmaiolino@xxxxxxxxxx> <1462298140-12411-6-git-send-email-cmaiolino@xxxxxxxxxx> <20160503232419.GV26977@dastard>
User-agent: Mutt/1.5.24 (2015-08-30)
On Wed, May 04, 2016 at 09:24:19AM +1000, Dave Chinner wrote:
> On Tue, May 03, 2016 at 07:55:38PM +0200, Carlos Maiolino wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > On reception of an error, we can fail immediately, perform some
> > bound amount of retries or retry indefinitely. The current behaviour
> > we have is to retry forever.
> > 
> > However, we'd like the ability to choose what behaviour we have, and
> > that requires the ability to configure the behaviour through the new
> > sysfs interfaces.
> > 
> > max_retries is used to set the number of retries to do until permanently 
> > fail,
> > the only special case is when it is set to -1, which, will cause the system 
> > to
> > retry indefinitely.
> 
> This doesn't read particularly well, and its more a description of
> the mechanism (which shoul dbe in the code) rather than the reason
> for implementing it.. The original commit message talked about
> different failure characteristics, which still need to be mentioned
> here. e.g:
> 
>       We'd like to be able to configure errors to either fail
>       immediately (fail fast), retry for some time before
>       failing (fail slow) or retry forever (fail never). This is
>       implemented by using specific values for the number of
>       retries we allow.

Fair enough, I will re-work the patch descriptions in the next version.


> > 
> > Add both a maximum retry count and a retry timeout so that we can
> > bound by time and/or physical IO attempts.
> > 
> > Finally, plumb these into xfs_buf_iodone error processing so that
> > the error behaviour follows the selected configuration.
> > 
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Signed-off-by:
> > Carlos Maiolino <cmaiolino@xxxxxxxxxx>
> 
> You also need to document what changes you've made to the original
> patches. I can't tell what you've changed in each patch just by
> looking at them - this is the first one where I noticed something
> while reading it. i.e. the commit message should have:
> 
> [ cmaiolino: removed fail-fast/slow/never and instead use retry count
> to determine behaviour ]
>

Ok
 
> > ---
> >  fs/xfs/xfs_buf.h      | 23 ++++++++++++++-
> >  fs/xfs/xfs_buf_item.c | 13 +++++++--
> >  fs/xfs/xfs_mount.h    |  3 +-
> >  fs/xfs/xfs_sysfs.c    | 81 
> > ++++++++++++++++++++++++++++++++++++++++++++++++---
> >  4 files changed, 112 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > index adef116..bd978f0 100644
> > --- a/fs/xfs/xfs_buf.h
> > +++ b/fs/xfs/xfs_buf.h
> > @@ -183,7 +183,28 @@ typedef struct xfs_buf {
> >     unsigned int            b_page_count;   /* size of page array */
> >     unsigned int            b_offset;       /* page offset in first page */
> >     int                     b_error;        /* error code on I/O */
> > -   int                     b_last_error;   /* previous async I/O error */
> > +
> > +   /*
> > +    * async write failure retry count. Initialised to zero on the first
> > +    * failure, then when it exceeds the maximum configured without a
> > +    * success the write is considered to be failed permanently and the
> > +    * iodone handler will take appropriate action.
> > +    *
> > +    * For retry timeouts, we record the jiffie of the first failure. This
> > +    * means that we can change the retry timeout and it on the next error
> > +    * it will be checked against the newly configured timeout. This
> > +    * prevents buffers getting stuck in retry loops with a really long
> > +    * timeout.
> > +    *
> > +    * last_error is used to ensure that we are getting repeated errors, not
> > +    * different errors. e.g. a block device might change ENOSPC to EIO when
> > +    * a failure timeout occurs, so we want to re-initialise the error
> > +    * retry behaviour appropriately when that happens.
> > +    */
> > +   int                     b_retries;
> > +   unsigned long           b_first_retry_time; /* in jiffies */
> > +   int                     b_last_error;
> > +
> >     const struct xfs_buf_ops        *b_ops;
> >  
> >  #ifdef XFS_BUF_LOCK_TRACKING
> > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > index 3a30a88..68ed2a0 100644
> > --- a/fs/xfs/xfs_buf_item.c
> > +++ b/fs/xfs/xfs_buf_item.c
> > @@ -1086,6 +1086,9 @@ xfs_buf_iodone_callback_error(
> >             bp->b_flags |= (XBF_WRITE | XBF_ASYNC |
> >                             XBF_DONE | XBF_WRITE_FAIL);
> >             bp->b_last_error = bp->b_error;
> > +           bp->b_retries = 0;
> > +           bp->b_first_retry_time = jiffies;
> > +
> >             xfs_buf_ioerror(bp, 0);
> >             xfs_buf_submit(bp);
> >             return true;
> > @@ -1095,8 +1098,13 @@ xfs_buf_iodone_callback_error(
> >      * Repeated failure on an async write. Take action according to the
> >      * error configuration we have been set up to use.
> >      */
> > -   if (!cfg->max_retries)
> > -           goto permanent_error;
> > +   if (cfg->max_retries >= 0)
> > +           if (++bp->b_retries > cfg->max_retries)
> > +                   goto permanent_error;
> > +   if (cfg->retry_timeout)
> > +           if (time_after(jiffies,
> > +                          cfg->retry_timeout + bp->b_first_retry_time))
> > +                   goto permanent_error;
> 
> This is hard to read and the pattern is prone to future maintenance
> problems. i.e. nested single line if statements are prone to people
> making mistakes with indentation when adding more conditions to
> them.
> 
> So:
> 
>       if (cfg->max_retries != XFS_ERR_RETRY_FOREVER &&
>           ++bp->b_retries > cfg->max_retries)
>               goto permanent_error;
>       if (cfg->retry_timeout &&
>           time_after(jiffies, cfg->retry_timeout + bp->b_first_retry_time))
>               goto permanent_error;
>

Fair enough, will change this in the next revision.

Thanks for the review
 
> >     /* still a transient error, higher layers will retry */
> >     xfs_buf_ioerror(bp, 0);
> > @@ -1139,6 +1147,7 @@ xfs_buf_iodone_callbacks(
> >      * retry state here in preparation for the next error that may occur.
> >      */
> >     bp->b_last_error = 0;
> > +   bp->b_retries = 0;
> >  
> >     xfs_buf_do_callbacks(bp);
> >     bp->b_fspriv = NULL;
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 0c5a976..0382140 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -54,7 +54,8 @@ enum {
> >  
> >  struct xfs_error_cfg {
> >     struct xfs_kobj kobj;
> > -   int             max_retries;
> > +   int             max_retries;
> 
> #define XFS_ERR_RETRY_FOREVER (-1)
> 
> > +   unsigned long   retry_timeout;  /* in jiffies, 0 = no timeout */
> >  };
> ....
> > +static ssize_t
> > +max_retries_store(
> > +   struct kobject  *kobject,
> > +   const char      *buf,
> > +   size_t          count)
> > +{
> > +   struct xfs_error_cfg *cfg = to_error_cfg(kobject);
> > +   int             ret;
> > +   int             val;
> > +
> > +   ret = kstrtoint(buf, 0, &val);
> > +   if (ret)
> > +           return ret;
> > +
> > +   if (val < -1 || val > INT_MAX)
> > +           return -EINVAL;
> 
> Should special case XFS_ERR_RETRY_FOREVER here. i.e. range is
> 0-INT_MAX || XFS_ERR_RETRY_FOREVER.
> 
>       if ((val < -1 || val > INT_MAX) &&
>           val != XFS_ERR_RETRY_FOREVER)
>               return -EINVAL;
> 
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx

-- 
Carlos

<Prev in Thread] Current Thread [Next in Thread>