xfs
[Top] [All Lists]

Re: [PATCH] xfs: introduce object readahead to log recovery

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH] xfs: introduce object readahead to log recovery
From: Zhi Yong Wu <zwu.kernel@xxxxxxxxx>
Date: Mon, 29 Jul 2013 09:38:11 +0800
Cc: xfstests <xfs@xxxxxxxxxxx>, "linux-fsdevel@xxxxxxxxxxxxxxx" <linux-fsdevel@xxxxxxxxxxxxxxx>, linux-kernel mlist <linux-kernel@xxxxxxxxxxxxxxx>, Zhi Yong Wu <wuzhy@xxxxxxxxxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=E6lEJhraEhvRRC+x+RJMYSjWEgms5k9lkCUQo/wCeMA=; b=riaLGtZhSvQPymCg7MZhTm/saR12OF2OGByMa/VKyDj3jqG8QLnVpKNJi6/7y9orOr ckgMMunvVL40oYfWGG7nO2/PiYs3euAlSFtlI0WxIF4OWD9bnDN++mTnsrIRhn0yq7Uu sB8Grdp7K3yyF50bYOpXSlAOqHri89bPC4kLyeNpKEyva9kEEmtN4yYkZqc/M1DuKtGz mIw0UIOxSA4ZM83rXuK8AU+ZESqBet2LUUBpCiSzWXAJSge0TodrHZovQGmCoKUFi8o3 OqOQHLF15kTswRqcU4XdmXGKvKc1q5h4kdhU0bjhfllebFQZhzkVhsgMRYIKhV7xOEJM JL8w==
In-reply-to: <20130726113521.GM13468@dastard>
References: <1374740619-29797-1-git-send-email-zwu.kernel@xxxxxxxxx> <20130726025009.GE21982@dastard> <CAEH94Lh-UCCEs7hQi_t5v+X+ER1DH9dCtjr6e9GVNX5KJ-f1hQ@xxxxxxxxxxxxxx> <20130726113521.GM13468@dastard>
On Fri, Jul 26, 2013 at 7:35 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Fri, Jul 26, 2013 at 02:36:15PM +0800, Zhi Yong Wu wrote:
>> Dave,
>>
>> All comments are good to me, and will be applied to next version, thanks a 
>> lot.
>>
>> On Fri, Jul 26, 2013 at 10:50 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> > On Thu, Jul 25, 2013 at 04:23:39PM +0800, zwu.kernel@xxxxxxxxx wrote:
>> >> From: Zhi Yong Wu <wuzhy@xxxxxxxxxxxxxxxxxx>
>> >>
>> >>   It can take a long time to run log recovery operation because it is
>> >> single threaded and is bound by read latency. We can find that it took
>> >> most of the time to wait for the read IO to occur, so if one object
>> >> readahead is introduced to log recovery, it will obviously reduce the
>> >> log recovery time.
>> >>
>> >>   In dirty log case as below:
>> >>     data device: 0xfd10
>> >>     log device: 0xfd10 daddr: 20480032 length: 20480
>> >>
>> >>     log tail: 7941 head: 11077 state: <DIRTY>
>> >
>> > That's only a small log (10MB). As I've said on irc, readahead won't
>> Yeah, it is one 10MB log, but how do you calculate it based on the above 
>> info?
>
> length = 20480 blocks. 20480 * 512 = 10MB....
Thanks.
>
>> > And the recovery time from this is between 15-17s:
>> >
>> > ....
>> >     log device: 0xfd20 daddr: 107374182032 length: 4173824
>> >                                                    ^^^^^^^ almost 2GB
>> >         log tail: 19288 head: 264809 state: <DIRTY>
>> > ....
>> > real    0m17.913s
>> > user    0m0.000s
>> > sys     0m2.381s
>> >
>> > And runs at 3-4000 read IOPs for most of that time. It's largely IO
>> > bound, even on SSDs.
>> >
>> > With your patch:
>> >
>> > log tail: 35871 head: 308393 state: <DIRTY>
>> > real    0m12.715s
>> > user    0m0.000s
>> > sys     0m2.247s
>> >
>> > And it peaked at ~5000 read IOPS.
>> How do you know its READ IOPS is ~5000?
>
> Other monitoring. iostat can tell you this, though I use PCP...
thanks.
>
>> > Ok, so you've based the readahead on the transaction item list
>> > having a next pointer. What I think you should do is turn this into
>> > a readahead queue by moving objects to a new list. i.e.
>> >
>> >         list_for_each_entry_safe(item, next, &trans->r_itemq, ri_list) {
>> >
>> >                 case XLOG_RECOVER_PASS2:
>> >                         if (ra_qdepth++ >= MAX_QDEPTH) {
>> >                                 recover_items(log, trans, &buffer_list, 
>> > &ra_item_list);
>> >                                 ra_qdepth = 0;
>> >                         } else {
>> >                                 xlog_recover_item_readahead(log, item);
>> >                                 list_move_tail(&item->ri_list, 
>> > &ra_item_list);
>> >                         }
>> >                         break;
>> >                 ...
>> >                 }
>> >         }
>> >         if (!list_empty(&ra_item_list))
>> >                 recover_items(log, trans, &buffer_list, &ra_item_list);
>> >
>> > I'd suggest that a queue depth somewhere between 10 and 100 will
>> > be necessary to keep enough IO in flight to keep the pipeline full
>> > and prevent recovery from having to wait on IO...
>> Good suggestion, will apply it to next version, thanks.
>
> FWIW, I hacked a quick test of this into your patch here and a depth
> of 100 brought the reocvery time down to under 8s. For other
> workloads which have nothing but dirty inodes (like fsmark) a depth
> of 100 drops the recovery time from ~100s to ~25s, and the iop rate
> is peaking at well over 15,000 IOPS. So we definitely want to queue
> up more than a single readahead...
Excited, I will try it.
By the way, how do you try the workload which has nothing but dirty
dquote objects?

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx



-- 
Regards,

Zhi Yong Wu

<Prev in Thread] Current Thread [Next in Thread>