xfs
[Top] [All Lists]

Re: [PATCH V4] xfs: Document error handlers behavior

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH V4] xfs: Document error handlers behavior
From: Eric Sandeen <sandeen@xxxxxxxxxxx>
Date: Wed, 14 Sep 2016 17:31:43 -0500
Cc: Carlos Maiolino <cmaiolino@xxxxxxxxxx>, linux-xfs@xxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20160914222255.GQ30497@dastard>
References: <1473757385-81633-1-git-send-email-cmaiolino@xxxxxxxxxx> <20160914012334.GK30497@dastard> <77fcc1da-9c74-da3e-5bcd-3df420a3bfbb@xxxxxxxxxxx> <20160914222255.GQ30497@dastard>
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.3.0

On 9/14/16 5:22 PM, Dave Chinner wrote:
>> Same issue here, really; they are symmetric, right?  First condition met for
>> > propagation propagates the error, period.  This sounds overly complex, 
>> > unless
>> > I'm missing something. Seems like:
>> > 
>> > +  Setting the value to "N" (where 0 < N < Max) will make XFS retry the
>> > +  operation for "N" seconds before propagating the error.
>> > 
>> > would suffice, no?
> No, because that's not what the implementation does:
> 
>       if (retries expired)
>               fail
>       if (retry timer expired)
>               fail
> 
> IOWs, the retry count has precedence over the retry timer. if you
> set both retry_timeout and max_retries, the timeout only takes
> effect if max retries is set high enough that they aren't exhausted
> before the timeout fires.

Then:

+       Setting the value to "N" (where 0 < N < Max) will /allow/ XFS to retry
+       the operation for /up to/ "N" seconds before propagating the error.

?  i.e. it could, but only if the retries don't expire first :)

> This is for the case where an failure might take a variable time to
> report.  (Think interactions with errors that TLER would address).
> Normally you might say 10 retries, but if it is taking 5 minutes to
> then fail when this specific error condition is hit, you might set a retry
> timeout of 1 minute. In that case, we might get an immediate IO
> error and retry several times before failing. However, if we hit the
> "slow to report" error, we still get failure in the same time frame
> as the immediate failures that have been retried many times before
> giving up.

Either way, it's still the /first/ condition satisfied which will
bubble it up.

So "first condition satisfied will propagate up," plus
one condition is "retry up to N times," plus
one condition is "retry up for up to N seconds"

seems to cover it all, no?

> It's hard to explain complex stuff like with a simple, concise
> description. I'll try again....
> 
> Cheers,
> 
> Dave.

<Prev in Thread] Current Thread [Next in Thread>