xfs
[Top] [All Lists]

Re: [PATCH V4] xfs: Document error handlers behavior

To: Dave Chinner <david@xxxxxxxxxxxxx>, Carlos Maiolino <cmaiolino@xxxxxxxxxx>
Subject: Re: [PATCH V4] xfs: Document error handlers behavior
From: Eric Sandeen <sandeen@xxxxxxxxxxx>
Date: Wed, 14 Sep 2016 10:10:05 -0500
Cc: linux-xfs@xxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20160914012334.GK30497@dastard>
References: <1473757385-81633-1-git-send-email-cmaiolino@xxxxxxxxxx> <20160914012334.GK30497@dastard>
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.3.0
On 9/13/16 8:23 PM, Dave Chinner wrote:

> Ok, I had to update this for the change in retry timeout values from
> Eric, so I went and fixed all the other things I thought needed
> fixing, too. New patch below....
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> 
> xfs: Document error handlers behavior
> 
> From: Carlos Maiolino <cmaiolino@xxxxxxxxxx>
> 
> Document the implementation of error handlers into sysfs.
> 
> [dchinner: significant update:
>       - removed examples from concept descriptions, placed them in
>         appropriate detailed descriptions instead
>       - added explanations for <dev>, <class> and <error> strings
>         in sysfs layout description
>       - added specific definition of "global" per-filesystem error
>         configuration parameters.
>       - reformatted to remove multiple indents
>       - added more information about fail_at_unmount behaviour and
>         constraints
>       - added comment that there is a "default" handler to
>         configure behaviour for all errors that don't have
>         specific handlers defined.
>       - added specific handler value explanations
>       - added note about handlers having context specific
>         defaults with example. ]
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@xxxxxxxxxx>
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> ---
>  Documentation/filesystems/xfs.txt | 125 
> +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 125 insertions(+)
> 
> diff --git a/Documentation/filesystems/xfs.txt 
> b/Documentation/filesystems/xfs.txt
> index 8146e9f..705d064 100644
> --- a/Documentation/filesystems/xfs.txt
> +++ b/Documentation/filesystems/xfs.txt
> @@ -348,3 +348,128 @@ Removed Sysctls
>    ----                               -------
>    fs.xfs.xfsbufd_centisec    v4.0
>    fs.xfs.age_buffer_centisecs        v4.0
> +
> +
> +Error handling
> +==============
> +
> +XFS can act differently according to the type of error found during its
> +operation. The implementation introduces the following concepts to the error
> +handler:
> +
> + -failure speed:
> +     Defines how fast XFS should propagate an error upwards when a specific
> +     error is found during the filesystem operation. It can propagate
> +     immediately, after a defined number of retries, after a set time period,
> +     or simply retry forever.
> +
> + -error classes:
> +     Specifies the subsystem the error configuration will apply to, such as
> +     metadata IO or memory allocation. Different subsystems will have
> +     different error handlers for which behaviour can be configured.
> +
> + -error handlers:
> +     Defines the behavior for a specific error.
> +
> +The filesystem behavior during an error can be set via sysfs files, Each

files.  Each error ...

> +error handler works independently, the first condition met by and error 
> handler

works independently - the firest condition met by /an/ error handler ...

> +for a specific class will cause the error to be propagated rather than reset 
> and
> +retried.
> +
> +The action taken by the filesystem when the error is propagated is context
> +dependent - it may cause a shut down in the case of an unrecoverable error,
> +it may be reported back to userspace, or it may even be ignored because
> +there's nothing useful we can with the error or anyone we can report it to 
> (e.g.
> +during unmount).
> +
> +The configuration files are organized into the following per-mounted 
> filesystem
> +hierarchy:

... into the following hierarchy for each mounted filesystem:

> +
> +  /sys/fs/xfs/<dev>/error/<class>/<error>/
> +
> +Where:
> +  <dev>
> +     The short device name of the mounted filesystem. This is the same device
> +     name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
> +
> +  <class>
> +     The subsystem the error configuration belongs to. As of 4.9, the defined
> +     classes are:
> +
> +             - "metadata": applies metadata buffer write IO
> +
> +  <error>
> +     The individual error handler configurations.
> +
> +
> +Each filesystem has "global" error configuration options defined in their top
> +level directory:
> +
> +  /sys/fs/xfs/<dev>/error/
> +
> +  fail_at_unmount            (Min:  0  Default:  1  Max: 1)
> +     Defines the filesystem error behavior at unmount time.
> +
> +     If set to a value of 1, XFS will override all other error configurations
> +     during unmount and replace them with "immediate fail" characteristics.
> +     i.e. no retries, no retry timeout. This will always allow unmount to
> +     succeed when there are persistent errors present.
> +
> +     If set to 0, the configured retry behaviour will continue until all
> +     retries and/or timeouts have been exhausted. This will delay unmount
> +     completion when there are persistent errors, and it may prevent the
> +     filesystem from ever unmounting fully in the case of "retry forever"
> +     handler configurations.
> +
> +     Note: there is no guarantee that fail_at_unmount can be set whilst an
> +     unmount is in progress. It is possible that the sysfs entries are
> +     removed by the unmounting filesystem before a "retry forever" error
> +     handler configuration causes unmount to hang, and hence the filesystem
> +     must be configured appropriately before unmount begins to prevent
> +     unmount hangs.
> +
> +Each filesystem has specific error class handlers that define the error
> +propagation behaviour for specific errors. There is also a "default" error
> +handler defined, which defines the behaviour for all errors that don't have
> +specific handlers defined. The handler configurations are found in the
> +directory:
> +
> +  /sys/fs/xfs/<dev>/error/<class>/<error>/
> +
> +  max_retries                        (Min: -1  Default: Varies  Max: INTMAX)
> +     Defines the allowed number of retries of a specific error before
> +     the filesystem will propagate the error. The retry count for a given
> +     error context (e.g. a specific metadata buffer) is reset ever time there

every time there ...

> +     is a successful completion of the operation.
> +
> +     Setting the value to "-1" will cause XFS to retry forever for this
> +     specific error.
> +
> +     Setting the value to "0" will cause XFS to fail immediately when the
> +     specific error is reported.
> +
> +     Setting the value to "N" (where 0 < N < Max) will make XFS retry the
> +     operation "N" times before propagating the error.
> +
> +  retry_timeout_seconds              (Min:  -1  Default:  Varies  Max: 1 day)
> +     Define the amount of time (in seconds) that the filesystem is
> +     allowed to retry its operations when the specific error is
> +     found.
> +
> +     Setting the value to "-1" will set an infinite timeout, causing
> +     error propagation behaviour to be determined solely by the "max_retries"
> +     parameter.

This is asymmetric; if you want this, then max_retries should probably say that
-1 will cause the behavior to be determined solely by retry_timeout_seconds...

Could also say "removing any time limits on retries."  (and above, "removing any
count limits on retries.)

But that's already covered by "the first condition met by ..., " really.

> +
> +     Setting the value to "0" will cause XFS to fail immediately when the
> +     specific error is reported.
> +
> +     Setting the value to  "N" (where 0 < N < Max) will propagate the error
> +     on the first retry that fails at least "N" seconds after the first error
> +     was detected, unless the number of retries defined by max_retries
> +     expires first.

Same issue here, really; they are symmetric, right?  First condition met for
propagation propagates the error, period.  This sounds overly complex, unless
I'm missing something. Seems like:

+       Setting the value to "N" (where 0 < N < Max) will make XFS retry the
+       operation for "N" seconds before propagating the error.

would suffice, no?

> +
> +Note: The default behaviour for a specific error handler is dependent on both
> +the class and error context. For example, the default values for
> +"metadata/ENODEV" are "0" rather than "-1" so that this error handler 
> defaults
> +to "fail immediately" behaviour. This is done because ENODEV is a fatal,
> +unrecoverable error no matter how many times the metadata IO is retried.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

<Prev in Thread] Current Thread [Next in Thread>