xfs
[Top] [All Lists]

Re: raid5 resync aborted under heavy XFS use

To: Chris Bednar <cjb@xxxxxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: raid5 resync aborted under heavy XFS use
From: Simon Matter <simon.matter@xxxxxxxxxxxxxxxx>
Date: Mon, 30 Jul 2001 08:55:19 +0200
>received: from mobile.sauter-bc.com (unknown [10.1.6.21]) by basel1.sauter-bc.com (Postfix) with ESMTP id F14CA57306; Mon, 30 Jul 2001 08:55:19 +0200 (CEST)
Cc: Linux XFS Mailing List <linux-xfs@xxxxxxxxxxx>
Organization: Sauter AG, Basel
References: <Pine.LNX.4.10.10107291152120.11436-100000@xxxxxxxxxxxxxxxxxxxxxxxxx>
Sender: owner-linux-xfs@xxxxxxxxxxx
Chris Bednar schrieb:
> 
>     Hi. I still haven't determined whether this is an XFS problem, or a raid5 
> problem,
> but here's a situation I see if raid5 is trying to resync an XFS volume while 
> people
> are thrashing it:

Are you sure resync is aborted because of heavy load. I was having a
similar situation where I thought resync aborted because of stress but I
had to find out later that a broken disk was the culprit.

> 
>   Jul 26 14:55:38 gigem101 kernel: raid5: in raid5_sync_request, bufsize=512 
> redone=6 rtrn=-5
>   Jul 26 14:55:38 gigem101 kernel: md: sync_request returned sectors=-5 
> (j=296588438) ... exiting
>   Jul 26 14:55:38 gigem101 kernel: raid5: resync aborted (err=-5)!
> 
> I'm running a 2.4.5-SGI_XFS_1.0.1 kernel, with md patched up to print errors 
> (patch at end).
> The default behaviour was just to print `raid5: resync aborted' with no other 
> indication (which
> definitely needs to be fixed in the md driver anyway).  As you can see, the 
> chunk-size on
> the volume in question is 512k. We have another one (identical hw, 128k 
> chunk) that synced
> without trouble, since I kept users off it. This has only happened during 
> periods of heavy
> read/write activity.
> 
>     I'm starting this on the XFS list only, since I know there have been 
> issues with
> chunk sizes >256k elsewhere. It doesn't look to me like this kernel has the 
> I/O optimization
> problem that's been discussed of late, since that seems to be turned off for 
> ALL md devices
> here.
> 
>     The basic setup is 8 u160-scsi 180GB Seagate disks on an adaptec 29160 
> controller,
> 512k chunk, left-sym parity, 1.2 TB XFS filesystem.
> 
> ----
> Chris J. Bednar   <http://optics.tamu.edu/~bednar/>
> Director, Distributed Computing Product Group
> http://AdvancedDataSolutions.com/
> 
>   -kernel-2.4.5-raid5.printerr.patch---------
> 
> --- linux/drivers/md/md.c.printerr      Sat Jul 21 05:25:52 2001
> +++ linux/drivers/md/md.c       Tue Jul 24 23:04:29 2001
> @@ -3248,7 +3248,10 @@
> 
>         err = down_interruptible(&mddev->resync_sem);
>         if (err)
> +       {
> +               printk ("md: down_interruptible err=%d ... exiting\n", err);
>                 goto out_nolock;
> +       }
> 
>  recheck:
>         serialize = 0;
> @@ -3309,6 +3312,7 @@
>                 sectors = mddev->pers->sync_request(mddev, j);
> 
>                 if (sectors < 0) {
> +                       printk ("md: sync_request returned sectors=%d (j=%d) 
> ... exiting\n", sectors, j);
>                         err = sectors;
>                         goto out;
>                 }
> --- linux/drivers/md/raid5.c.printerr   Sat Jul 21 05:25:52 2001
> +++ linux/drivers/md/raid5.c    Tue Jul 24 23:06:42 2001
> @@ -1167,6 +1167,7 @@
>         int data_disks = raid_disks-1;
>         int redone = 0;
>         int bufsize;
> +       int rtrn;
> 
>         sh = get_active_stripe(conf, sector_nr, 0, 0);
>         bufsize = sh->size;
> @@ -1183,7 +1184,11 @@
>         handle_stripe(sh);
>         release_stripe(sh);
> 
> -       return (bufsize>>9)-redone;
> +       if ((rtrn = (bufsize>>9) - redone) < 0)
> +       {
> +         printk ("raid5: in raid5_sync_request, bufsize=%d redone=%d 
> rtrn=%d\n", bufsize, redone, rtrn);
> +       }
> +       return rtrn;
>  }
> 
>  /*
> @@ -1241,15 +1246,16 @@
>  {
>         raid5_conf_t *conf = data;
>         mddev_t *mddev = conf->mddev;
> +       int errval = 0;
> 
>         if (!conf->resync_parity)
>                 return;
>         if (conf->resync_parity == 2)
>                 return;
>         down(&mddev->recovery_sem);
> -       if (md_do_sync(mddev,NULL)) {
> +       if ((errval = md_do_sync(mddev,NULL))) {
>                 up(&mddev->recovery_sem);
> -               printk("raid5: resync aborted!\n");
> +               printk("raid5: resync aborted (err=%d)!\n", errval);
>                 return;
>         }
>         conf->resync_parity = 0;



<Prev in Thread] Current Thread [Next in Thread>