On 9/14/16 5:22 PM, Dave Chinner wrote:
>> Same issue here, really; they are symmetric, right? First condition met for
>> > propagation propagates the error, period. This sounds overly complex,
>> > unless
>> > I'm missing something. Seems like:
>> >
>> > + Setting the value to "N" (where 0 < N < Max) will make XFS retry the
>> > + operation for "N" seconds before propagating the error.
>> >
>> > would suffice, no?
> No, because that's not what the implementation does:
>
> if (retries expired)
> fail
> if (retry timer expired)
> fail
>
> IOWs, the retry count has precedence over the retry timer. if you
> set both retry_timeout and max_retries, the timeout only takes
> effect if max retries is set high enough that they aren't exhausted
> before the timeout fires.
Then:
+ Setting the value to "N" (where 0 < N < Max) will /allow/ XFS to retry
+ the operation for /up to/ "N" seconds before propagating the error.
? i.e. it could, but only if the retries don't expire first :)
> This is for the case where an failure might take a variable time to
> report. (Think interactions with errors that TLER would address).
> Normally you might say 10 retries, but if it is taking 5 minutes to
> then fail when this specific error condition is hit, you might set a retry
> timeout of 1 minute. In that case, we might get an immediate IO
> error and retry several times before failing. However, if we hit the
> "slow to report" error, we still get failure in the same time frame
> as the immediate failures that have been retried many times before
> giving up.
Either way, it's still the /first/ condition satisfied which will
bubble it up.
So "first condition satisfied will propagate up," plus
one condition is "retry up to N times," plus
one condition is "retry up for up to N seconds"
seems to cover it all, no?
> It's hard to explain complex stuff like with a simple, concise
> description. I'll try again....
>
> Cheers,
>
> Dave.
|