xfs
[Top] [All Lists]

Re: [PATCH] xfs: introduce object readahead to log recovery

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH] xfs: introduce object readahead to log recovery
From: Zhi Yong Wu <zwu.kernel@xxxxxxxxx>
Date: Fri, 26 Jul 2013 14:36:15 +0800
Cc: xfstests <xfs@xxxxxxxxxxx>, "linux-fsdevel@xxxxxxxxxxxxxxx" <linux-fsdevel@xxxxxxxxxxxxxxx>, linux-kernel mlist <linux-kernel@xxxxxxxxxxxxxxx>, Zhi Yong Wu <wuzhy@xxxxxxxxxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=AAiVwY67ZL/azce62fZ31T8At8O2t3Lk30BehiaWtoY=; b=NFDBcaD1nnKAjZfiBFy+P1JdccjADv79na7vdflk/8Jv83a2uwxnOVyhgVfb4e4+3J YnvqKxegd/7zKMvCvNjYN6E3lZ0YbCJ+kGOnFgd2AMqmZOOnxiipuplWEDq3CF/HqxC8 CNg4CUBuxttWQcyTBJl9bEOyR/uaROyYa/S/LPCJZFZ5j7/2uVN5z2xklaXXb8N3OqFe EoIt3ihKQhpuLYWgOJz1N51khm+FHWJDx7yTpZrdSeiWxHgudfY2kjhyH5/O1H1BO1FL 2DCAotRH0MsGcuulxfgk8NSErxeRMMttTXCIsMYStoOHJZGfDjPZrrrRb2qiBfP9L7W8 cPOg==
In-reply-to: <20130726025009.GE21982@dastard>
References: <1374740619-29797-1-git-send-email-zwu.kernel@xxxxxxxxx> <20130726025009.GE21982@dastard>
Dave,

All comments are good to me, and will be applied to next version, thanks a lot.

On Fri, Jul 26, 2013 at 10:50 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Thu, Jul 25, 2013 at 04:23:39PM +0800, zwu.kernel@xxxxxxxxx wrote:
>> From: Zhi Yong Wu <wuzhy@xxxxxxxxxxxxxxxxxx>
>>
>>   It can take a long time to run log recovery operation because it is
>> single threaded and is bound by read latency. We can find that it took
>> most of the time to wait for the read IO to occur, so if one object
>> readahead is introduced to log recovery, it will obviously reduce the
>> log recovery time.
>>
>>   In dirty log case as below:
>>     data device: 0xfd10
>>     log device: 0xfd10 daddr: 20480032 length: 20480
>>
>>     log tail: 7941 head: 11077 state: <DIRTY>
>
> That's only a small log (10MB). As I've said on irc, readahead won't
Yeah, it is one 10MB log, but how do you calculate it based on the above info?

> show any real difference on this and you need to be testing with
> large logs.
>
>>
>> This dirty ratio is about 15%. I am trying to do tests in larger scale
>> and dirtier filesystem environment.
>>
>> Log recovery time stat:
>>
>>             w/o this patch        w/ this patch
>>   real         0m1.051s             0m0.965s
>>   sys          0m0.033s             0m0.035s
>>
>>   iowait       0m1.018s             0m0.930s
>
> Well, it's not made much of a difference there. That's probably
> within the noise of repeated log recovery runs.
>
> My simple test is:
>
> $ cat recovery_time.sh
> #!/bin/bash
>
> cd /home/dave/src/compilebench-0.6/
>
> mkfs.xfs -f /dev/vdc;
> mount /dev/vdc /mnt/scratch
> chmod 777 /mnt/scratch;
> ./compilebench --no-sync -D /mnt/scratch &
> sleep 55
> /home/dave/src/xfstests-dev/src/godown /mnt/scratch
> umount /mnt/scratch
> xfs_logprint -t /dev/vdc |head -20
> time mount /dev/vdc /mnt/scratch
> umount /mnt/scratch
> $
>
> And the recovery time from this is between 15-17s:
>
> ....
>     log device: 0xfd20 daddr: 107374182032 length: 4173824
>                                                    ^^^^^^^ almost 2GB
>         log tail: 19288 head: 264809 state: <DIRTY>
> ....
> real    0m17.913s
> user    0m0.000s
> sys     0m2.381s
>
> And runs at 3-4000 read IOPs for most of that time. It's largely IO
> bound, even on SSDs.
>
> With your patch:
>
> log tail: 35871 head: 308393 state: <DIRTY>
> real    0m12.715s
> user    0m0.000s
> sys     0m2.247s
>
> And it peaked at ~5000 read IOPS.
How do you know its READ IOPS is ~5000?

>
> It's definitely an improvement, but there's a lot of dead time still
> spent waiting for IO that we should be able to remove. Let's have a
> look at the code...
>
>
>> +STATIC void
>> +xlog_recover_inode_ra_pass2(
>> +     struct xlog                     *log,
>> +     struct xlog_recover_item        *item)
>> +{
>> +     xfs_inode_log_format_t  *in_f;
>> +     xfs_mount_t             *mp = log->l_mp;
>> +     int                     error;
>> +     int                     need_free = 0;
>> +
>> +     if (item->ri_buf[0].i_len == sizeof(xfs_inode_log_format_t)) {
>> +             in_f = item->ri_buf[0].i_addr;
>> +     } else {
>> +             in_f = kmem_alloc(sizeof(xfs_inode_log_format_t), KM_SLEEP);
>> +             need_free = 1;
>> +             error = xfs_inode_item_format_convert(&item->ri_buf[0], in_f);
>> +             if (error)
>> +                     goto error;
>> +     }
>
> I'd just put the conversion buffer on stack and avoid the need to
> alloc/free memory here. There's plenty of stack space available
> during recovery here, so let's use it to keep the overhead of
> readahead down.
OK, will use the buffer on stack.
>
>> +STATIC void
>> +xlog_recover_dquot_ra_pass2(
>> +     struct xlog                     *log,
>> +     struct xlog_recover_item        *item)
>> +{
>> +     xfs_mount_t             *mp = log->l_mp;
>> +     xfs_buf_t               *bp;
>> +     struct xfs_disk_dquot   *recddq;
>> +     int                     error;
>> +     xfs_dq_logformat_t      *dq_f;
>> +     uint                    type;
>> +
>> +
>> +     if (mp->m_qflags == 0)
>> +             return;
>> +
>> +     recddq = item->ri_buf[1].i_addr;
>> +     if (recddq == NULL)
>> +             return;
>> +     if (item->ri_buf[1].i_len < sizeof(xfs_disk_dquot_t))
>> +             return;
>> +
>> +     type = recddq->d_flags & (XFS_DQ_USER | XFS_DQ_PROJ | XFS_DQ_GROUP);
>> +     ASSERT(type);
>> +     if (log->l_quotaoffs_flag & type)
>> +             return;
>> +
>> +     dq_f = item->ri_buf[0].i_addr;
>> +     ASSERT(dq_f);
>> +     error = xfs_qm_dqcheck(mp, recddq, dq_f->qlf_id, 0, XFS_QMOPT_DOWARN,
>> +                        "xlog_recover_dquot_ra_pass2 (log copy)");
>> +     if (error)
>> +             return;
>> +     ASSERT(dq_f->qlf_len == 1);
>> +
>> +     error = xfs_trans_read_buf(mp, NULL, mp->m_ddev_targp, dq_f->qlf_blkno,
>> +                                XFS_FSB_TO_BB(mp, dq_f->qlf_len), 0, &bp,
>> +                                NULL);
>> +     if (!error)
>> +             xfs_buf_relse(bp);
>> +}
>
> That's not doing readahead - that's an integrity check followed by a
> blocking read.  Shouldn't it just be calling xfs_buf_readahead() on
> dq_f->qlf_blkno/dq_f->qlf_len, after checking whether it had been
> cancelled?
ah, i miss tha it is one blocking read. will fix it
>
>>  STATIC int
>>  xlog_recover_commit_pass1(
>>       struct xlog                     *log,
>> @@ -3140,10 +3255,14 @@ xlog_recover_commit_pass2(
>>       struct xlog                     *log,
>>       struct xlog_recover             *trans,
>>       struct list_head                *buffer_list,
>> -     struct xlog_recover_item        *item)
>> +     struct xlog_recover_item        *item,
>> +     struct xlog_recover_item        *next_item)
>>  {
>>       trace_xfs_log_recover_item_recover(log, trans, item, 
>> XLOG_RECOVER_PASS2);
>>
>> +     if (next_item)
>> +             xlog_recover_ra_pass2(log, next_item);
>> +
>
> Ah, that explains the limited improvement - it's only doing
> readahead of a single item at a time. IOWs, we are completing
> recovery of the current object before IO completion of the next
> object has occurred. This generally means the read-ahead queue is
Yes.
> not deep enough...
>
>>       switch (ITEM_TYPE(item)) {
>>       case XFS_LI_BUF:
>>               return xlog_recover_buffer_pass2(log, buffer_list, item);
>> @@ -3181,7 +3300,7 @@ xlog_recover_commit_trans(
>>       int                     pass)
>>  {
>>       int                     error = 0, error2;
>> -     xlog_recover_item_t     *item;
>> +     xlog_recover_item_t     *item, *next_item, *temp_item;
>>       LIST_HEAD               (buffer_list);
>>
>>       hlist_del(&trans->r_list);
>> @@ -3190,14 +3309,18 @@ xlog_recover_commit_trans(
>>       if (error)
>>               return error;
>>
>> -     list_for_each_entry(item, &trans->r_itemq, ri_list) {
>> +     list_for_each_entry_safe(item, temp_item, &trans->r_itemq, ri_list) {
>>               switch (pass) {
>>               case XLOG_RECOVER_PASS1:
>>                       error = xlog_recover_commit_pass1(log, trans, item);
>>                       break;
>>               case XLOG_RECOVER_PASS2:
>> +                     if (&temp_item->ri_list != &trans->r_itemq)
>> +                             next_item = temp_item;
>> +                     else
>> +                             next_item = NULL;
>>                       error = xlog_recover_commit_pass2(log, trans,
>> -                                                       &buffer_list, item);
>> +                                       &buffer_list, item, next_item);
>
> Ok, so you've based the readahead on the transaction item list
> having a next pointer. What I think you should do is turn this into
> a readahead queue by moving objects to a new list. i.e.
>
>         list_for_each_entry_safe(item, next, &trans->r_itemq, ri_list) {
>
>                 case XLOG_RECOVER_PASS2:
>                         if (ra_qdepth++ >= MAX_QDEPTH) {
>                                 recover_items(log, trans, &buffer_list, 
> &ra_item_list);
>                                 ra_qdepth = 0;
>                         } else {
>                                 xlog_recover_item_readahead(log, item);
>                                 list_move_tail(&item->ri_list, &ra_item_list);
>                         }
>                         break;
>                 ...
>                 }
>         }
>         if (!list_empty(&ra_item_list))
>                 recover_items(log, trans, &buffer_list, &ra_item_list);
>
> I'd suggest that a queue depth somewhere between 10 and 100 will
> be necessary to keep enough IO in flight to keep the pipeline full
> and prevent recovery from having to wait on IO...
Good suggestion, will apply it to next version, thanks.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx


-- 
Regards,

Zhi Yong Wu

<Prev in Thread] Current Thread [Next in Thread>